Intro to Assembly Optimization

Hello and welcome to the Intro to Assembly Optimization workshop!

The target audience is anyone that has a working understanding of programming and some familiarity with low level concepts. This workshop will explore the optimization of a “Hello World” program, and take it from C all the way down to a highly optimized, hand written, 64 bit ELF binary.

This documentation contains notes that you can use to follow along during the stream.

Hello World in C curl -sL n0.lol/i2ao/1
Hello World in Assembly curl -sL n0.lol/i2ao/2
Optimizing Hello World in Assembly curl -sL n0.lol/i2ao/3
Writing an Elf binary from Scratch curl -sL n0.lol/i2ao/4
Further Optimization curl -sL n0.lol/i2ao/5

Tools

Required

A Linux System
GCC
GNU binutils
xxd or other hex editor
A text editor

Optional

Radare2 - For debugging
Linux Man Pages

Online Resources

Online Disassembler https://defuse.ca/online-x86-assembler.htm
Syscall Table https://syscalls64.paolostivanin.com/

Part 1: Hello World In C

1
2
3
4
5
6
#include <stdio.h>

int main() {
   printf("[^0^] u!!\n");
   return 0;
}

We begin with our Hello World Program. It is going to print “[^0^] u!!” with a new line at the end. Then it returns 0 to the operating system and exits.

Compile

 gcc hello.c -o hello

Run

./hello

Analyze

readelf -a hello
objdump -d hello -M intel

Part 2: Hello World In Assembly

CODE - bigsmile.asm

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
SECTION .DATA
  msg:     db "[^0^] u!!",10
  msgLen:  equ $-msg

SECTION .TEXT
GLOBAL _start

_start:
; Print ----------------------------------------------------------------
  mov rax, 1      ; Syscall 1
  mov rdi, 1      ; The File Descriptor - STDOUT
  mov rsi, msg    ; Pointer to the message
  mov rdx, msgLen ; Length of the Message
  syscall

; Exit -----------------------------------------------------------------
  mov rax, 60     ; Syscall 60
  mov rdi, 0      ; Error code - eg return 0;
  syscall

Build and Execute

nasm -f elf64 bigsmile.asm -o bigsmile.o
ld bigsmile.o -o bigsmile
./bigsmile

Read attributes about the elf binary

readelf -a bigsmile

Disassembly

$ r2 bigsmile
...
[0x00400082]> pd
  ;-- entry0:
  ;-- section..TEXT:
  ;-- rip:
  0x00400082 b801000000     mov eax, 1
  0x00400087 bf01000000     mov edi, 1
  0x0040008c 48be78004000.  movabs rsi, 0x400078 ; section..DATA ;"[^0^] u!!\n"
  0x00400096 ba0a000000     mov edx, 0xa
  0x0040009b 0f05           syscall
  0x0040009d b83c000000     mov eax, 0x3c
  0x004000a2 bf00000000     mov edi, 0
  0x004000a7 0f05           syscall

Changes Made

Using syscalls directly
Using nasm to compile and ld to link

Part 3: Optimizing Hello World in Assembly

CODE - smile.asm

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
SECTION .DATA
  msg:     db "[^0^] u!!",10
  msgLen:  equ $-msg

SECTION .TEXT
GLOBAL _start

_start:
; Print ----------------------------------------------------------------
  mov al, 1       ; RAX holds syscall 1 (write). We are using the lower
                  ; 8 bits of RAX with AL. This takes up less bytes.
  mov rdi, rax    ; RDI holds the file descriptor - STDOUT. We copy the
                  ; value in RAX and move it there to save space.
  mov rsi, msg    ; RSI contains the address of our buffer.
  mov dl, msgLen  ; RDX holds the length of the buffer. We are the lower
                  ; 8 bits, DL, again for this.
  syscall         ; Now we call the kernel.

; Exit -----------------------------------------------------------------
  mov al, 60      ; We are now executing syscall 60 - Exit
  xor rdi, rdi    ; RDI contains the return value, here it will be 0!
  syscall         ; Call the kernel one last time.

Build & Run

nasm -f elf64 smile.asm -o smile.o
ld smile.o -o smile
./smile

Note about Registers

REF: https://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture

Registers in x86 can be divided into smaller registers that hold different sized values.

Example

RAX is a 64 bit register. It can be broken down like this.

    RAX 0000000000000000000000000000000000000000000000000000000000000000 64
    EAX                                 00000000000000000000000000000000 32
     AX                                                 0000000000000000 16
     AH                                                 00000000          8
     AL                                                         00000000  8

Here is a table of all the general purpose registers with their respective subdivisions.

   +-------+-------+-------+-------+-------+-------+-------+-------+
64 |   RAX |   RCX |   RDX |   RBX |   RSP |   RBP |   RSI |   RDI |
32 |   EAX |   ECX |   EDX |   EBX |   ESP |   EBP |   ESI |   EDI |
16 |    AX |    CX |    DX |    BX |    SP |    BP |    SI |    DI |
 8 | AH|AL | CH|CL | DH|DL | BH|BL |  |SPL |  |BPL |  |SIL |  |DIL |
   +-------+-------+-------+-------+-------+-------+-------+-------+

In our newly optimized code, we save space by using the lower 8 bits of RAX and RDX. This is because we are only moving 8 byte values. Registers define the total bit width of the number, so using a 64 bit register will make the integer 1 look like:

0000000000000000000000000000000000000000000000000000000000000001

While using the lower 8 bits will make the integer 1 look like:

00000001

This reduces the size of the integer on disk by 3 bytes, and the total instruction size by 5 bytes. This may seem insignificant, but it adds up.

             48 c7 c0 01 00 00 00    mov    rax, 1   ; 7 Bytes
             b0 01                   mov     al, 1   ; 2 Bytes

Another optimization we are using is copying a register to another, instead of moving a number into a register.

             48 c7 c7 01 00 00 00    mov    rdi, 1   ; 7 Bytes
             48 89 c7                mov    rdi,rax  ; 3 Bytes

The last optimization we did was XORing RDI with itself. This is a common way to create a 0, rather than moving a 0 into the register.

             48 c7 c7 00 00 00 00    mov    rdi,0    ; 7 Bytes
             48 31 ff                xor    rdi,rdi  ; 3 Bytes

We’ll cover even more optimizations later on in the workshop.

Making the binary smaller

You can use strip to reduce the binary’s size. This removes debug symbols that are unnecessary for running the binary on a system.

ls -lah smile
strip smile
ls -lah smile

You’ll see that the binary is now much smaller.

Changes

Use smaller registers
XOR a register with itself to create a 0
Copy data between registers instead of moving a number into it
Use Strip to strip the data

Part 4: Writing an ELF binary from scratch

In our last two assembly examples, we used nasm to assemble our code, and ld to link it. The binary was very small, because it was written and assembled in this way. Nasm created an ELF binary for us by using the flag -f elf64, which uses the well defined ELF binary format and loads our code into it.

The next part takes this a step further, where we don’t rely on nasm to create an ELF binary for us. We will create the binary ourselves using a custom ELF template. The template itself is mainly here for rapid prototyping of shell code, and allows you to create more reliable payloads using a very well defined structure.

If you want to read more about the development of this template, you can refer to my previous write ups on ELF Binary Mangling, as well as my golfclub repo. Links are on my website: https://n0.lol

Using the ELF nasm template

There is so much to cover involving ELF binaries that we won’t cover now. I’ll run down the most important aspects of it for the purposes of this presentation

Everything is hand defined according to the ELF spec
There are locations in the header that we can hide data.
You have 12 bytes to work with from 0x04 to 0x0F

We are going to jump right into some dirty tricks to hide data, and reference with our optimized code.

CODE - tiny.asm

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
BITS 64
        org 0x100000000  ; Where to load this into memory
;----------------------+------+-------------+----------+------------------------
; ELF Header struct    | OFFS | ELFHDR      | PHDR     | ASSEMBLY OUTPUT
;----------------------+------+-------------+----------+------------------------
        db 0x7F, "ELF" ; 0x00 | e_ident     |          | 7f 45 4c 46
msg:    dd 0x5e305e5b  ; 0x04 |             |          | 5b 5e 30 5e "^0^["
        dd 0x2175205d  ; 0x08 |             |          | 5d 20 75 21 "!u ]"
        dw 0x0a21      ; 0x0C |             |          | 21 0a       "\n!"
        dw 0x0         ; 0x0E |             |          | 00 00
;----------------------+------+-------------+----------+------------------------
; ELF Header struct ct.| OFFS | ELFHDR      | PHDR     | ASSEMBLY OUTPUT
;----------------------+------+-------------+----------+------------------------
        dw 2           ; 0x10 | e_type      |          | 02 00
        dw 0x3e        ; 0x12 | e_machine   |          | 3e 00
        dd 1           ; 0x14 | e_version   |          | 01 00 00 00
        dd _start - $$ ; 0x18 | e_entry     |          | 04 00 00 00
;----------------------+------+-------------+----------+------------------------
; Program Header Begin | OFFS | ELFHDR      | PHDR     | ASSEMBLY OUTPUT
;----------------------+------+-------------+----------+------------------------
phdr:   dd 1           ; 0x1C |   ...       | p_type   | 01 00 00 00
        dd phdr - $$   ; 0x20 | e_phoff     | p_flags  | 1c 00 00 00
        dd 0           ; 0x24 |   ...       | p_offset | 00 00 00 00
        dd 0           ; 0x28 | e_shoff     |   ...    | 00 00 00 00
        dq $$          ; 0x2C |   ...       | p_vaddr  | 00 00 00 00
                       ; 0x30 | e_flags     |   ...    | 01 00 00 00
        dw 0x40        ; 0x34 | e_shsize    | p_addr   | 40 00
        dw 0x38        ; 0x36 | e_phentsize |   ...    | 38 00
        dw 1           ; 0x38 | e_phnum     |   ...    | 01 00
        dw 2           ; 0x3A | e_shentsize |   ...    | 02 00
        dq 2           ; 0x3C | e_shnum     | p_filesz | 02 00 00 00 00 00 00 00
        dq 2           ; 0x44 |             | p_memsz  | 02 00 00 00 00 00 00 00
        dq 2           ; 0x4C |             | p_align  | 02 00 00 00 00 00 00 00
;--- END OF HEADER -------------------------------------------------------------
_start:
;--- Write -----------------------------------------------------------------//--
  mov al, 1
  mov rdi, rax
  mov rsi, msg
  mov dl, 10
  syscall
;--- Exit ------------------------------------------------------------------//--
  mov al, 60
  xor rdi, rdi
  syscall

Build

  nasm -f bin -o tiny tiny.asm

We are now building using the raw bin format. This instructs nasm to not apply any binary templates to the assembled code. This is similar to how you compile 16 bit COM files for DOS.

Hiding Data

Since we’ve jumped right into this hot mess of a binary, we’re going to use some of the features of the template as described above.

Since we know that our string “[^0^] u!!\n” is 10 bytes long, and we have 12 bytes to hide data from 0x04 to 0x0F, let’s pack our string into this space.

Nasm can store chunks of data like so

      dq 0x0123456789ABCDEF      8 bytes - Quad Word
      dd 0x012345678             4 bytes - Double Word
      dw 0x0123                  2 bytes - Word
      db 0x01                    1 byte  - Byte

To store our string, we need to divide it up and store it in little endian format for nasm to assemble it correctly.

Our chunks will look like this


    dd 0x5e305e5b ; "^0^["
    dd 0x2175205d ; "!u ]"
    dw 0x0a21     ; "\n!"

There are other ways to store strings and data, but this will do for our purposes here.

We also can use nasm to put a label on this data, so that we can reference the address in our code. Our label in this case is “msg:”, which appears towards the top of the code.

Since we already know that our data is 10 bytes, we can just put a 10 into dl, rather than relying on nasm to calculate that for us. It’s important to keep track of data sizes and lengths when you’re writing things like this!

We’ve covered a lot of weird stuff quickly, so let’s move on to the next section and do our final bits of optimization in this course.

Changes

Use custom binary template.
Put string in the elf header.

Part 5: Further Optimizations

Now that we have a whacky ELF binary to mess around in, and have established methods of referencing data, here is where the more interesting things begin.

Let’s take a look at the code and build it, then discuss what is going on.

Code - smol.asm

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
BITS 64
        org 0x100000000  ; Where to load this into memory
;----------------------+------+-------------+----------+------------------------
; ELF Header struct    | OFFS | ELFHDR      | PHDR     | ASSEMBLY OUTPUT
;----------------------+------+-------------+----------+------------------------
        db 0x7F, "ELF" ; 0x00 | e_ident     |          | 7f 45 4c 46
msg:    dd 0x5e305e5b  ; 0x04 |             |          | 5b 5e 30 5e "^0^["
        dd 0x2175205d  ; 0x08 |             |          | 5d 20 75 21 "!u ]"
        dw 0x0a21      ; 0x0C |             |          | 21 0a       "\n!"
        dw 0x0         ; 0x0E |             |          | 00 00
;----------------------+------+-------------+----------+------------------------
; ELF Header struct ct.| OFFS | ELFHDR      | PHDR     | ASSEMBLY OUTPUT
;----------------------+------+-------------+----------+------------------------
        dw 2           ; 0x10 | e_type      |          | 02 00
        dw 0x3e        ; 0x12 | e_machine   |          | 3e 00
        dd 1           ; 0x14 | e_version   |          | 01 00 00 00
        dd _start - $$ ; 0x18 | e_entry     |          | 04 00 00 00
;----------------------+------+-------------+----------+------------------------
; Program Header Begin | OFFS | ELFHDR      | PHDR     | ASSEMBLY OUTPUT
;----------------------+------+-------------+----------+------------------------
phdr:   dd 1           ; 0x1C |   ...       | p_type   | 01 00 00 00
        dd phdr - $$   ; 0x20 | e_phoff     | p_flags  | 1c 00 00 00
        dd 0           ; 0x24 |   ...       | p_offset | 00 00 00 00
        dd 0           ; 0x28 | e_shoff     |   ...    | 00 00 00 00
        dq $$          ; 0x2C |   ...       | p_vaddr  | 00 00 00 00
                       ; 0x30 | e_flags     |   ...    | 01 00 00 00
        dw 0x40        ; 0x34 | e_shsize    | p_addr   | 40 00
        dw 0x38        ; 0x36 | e_phentsize |   ...    | 38 00
        dw 1           ; 0x38 | e_phnum     |   ...    | 01 00
        dw 2           ; 0x3A | e_shentsize |   ...    | 02 00
        dq 2           ; 0x3C | e_shnum     | p_filesz | 02 00 00 00 00 00 00 00
        dq 2           ; 0x44 |             | p_memsz  | 02 00 00 00 00 00 00 00
        dq 2           ; 0x4C |             | p_align  | 02 00 00 00 00 00 00 00
;--- END OF HEADER -------------------------------------------------------------
_start:
;--- Write -----------------------------------------------------------------//--
  mov al, 1
  ; mov rdi, rax
  mov edi, eax
  ; mov rsi, msg
  mov esi, eax
  shl rsi, 0x20
  mov sil, 4
  mov dl, 10
  syscall
;--- Exit ------------------------------------------------------------------//--
  mov al, 60
  xor rdi, rdi
  syscall

Build

  nasm -f bin -o smol smol.asm

Creating Addresses

Because we’re working on such a small scale, everything feels more immediate. We know what is at every byte in our binary, and we can therefore rely on a consistent location to refer to.

Our previous binary referenced the memory address for our string at the address 0x100000004. The instruction to do this was:

  48be0400000001000000  movabs rsi, 0x100000004

This is a very long instruction, a total of 10 bytes due to needing a 64 bit register to hold the address. We can create this manually and save space by building the address. Here’s the process.

     89 c6        mov    esi,eax
     48 c1 e6 20  shl    rsi,0x20
     40 b6 04     mov    sil,0x4

We copy the value in EAX, which is 1, into ESI.

     RSI 0000000000000000000000000000000000000000000000000000000000000001

Then we shift RSI to the left by 32 bits (0x20).

     RSI 0000000000000000000000000000000100000000000000000000000000000000

This creates the value 0x100000000, which is the base address of where we loaded the binary into memory.

Next, we use the lower 8 bits of RSI to load the last value we need, 4. We now have the address of our string, 0x100000004

     RSI 0000000000000000000000000000000100000000000000000000000000000100

This saves us an Earth shattering 1 byte, but it also introduces a very important concept in developing exploits.

Shellcode is injected by a variety of means, and when creating a proper payload for a buffer or heap overflow, certain bytes may be ignored by the application that is being exploited, or by other things like servers that don’t handle bytes like 00 (NULL), 0A (\n) or 0D (\r) in a way that you may be hoping for.

Certain tools, such as msfvenom, are capable of creating payloads that avoid specific bytes (“bad chars”), to aid in exploitation. If we were injecting this payload, there are other ways of referencing strings, such as by direct loading of the desired bytes into registers and then pushing onto the stack, but I wanted to demonstrate methods of referencing data that can be applied to other programs in assembly.

Other optimizations

The last optimization we are doing in this lesson is copying EAX to EDI, rather than copying RAX to RDI. When you move data between two 32 bit registers, the upper 32 bits will be zero’d out. This is not the case for the lower registers. Moving data to AL preserves the rest of the data in the upper 56 bits. This is what enabled us to move 4 into SIL in the last section.

  48 89 c7 mov    rdi,rax

becomes

  89 c7    mov    edi,eax

Another two byte instruction to achieve a similar effect is inc edi, which increments EDI by 1 from 0.

[ Changes ]

Switch RDI with EDI
Instead of loading the address of the string into RSI directly, construct the address by shifting 0x left 20 bits and then moving 4 into SI

Contents

Tools

Part 1: Hello World In C

Part 2: Hello World In Assembly

Changes Made

Part 3: Optimizing Hello World in Assembly

Note about Registers

Example

Making the binary smaller

Changes

Part 4: Writing an ELF binary from scratch

Using the ELF nasm template

Hiding Data

Changes

Part 5: Further Optimizations

Creating Addresses

Other optimizations