Hello and welcome to the Intro to Assembly Optimization workshop!
The target audience is anyone that has a working understanding of programming and some familiarity with low level concepts. This workshop will explore the optimization of a “Hello World” program, and take it from C all the way down to a highly optimized, hand written, 64 bit ELF binary.
This documentation contains notes that you can use to follow along during the stream.
Contents
- Hello World in C
curl -sL n0.lol/i2ao/1
- Hello World in Assembly
curl -sL n0.lol/i2ao/2
- Optimizing Hello World in Assembly
curl -sL n0.lol/i2ao/3
- Writing an Elf binary from Scratch
curl -sL n0.lol/i2ao/4
- Further Optimization
curl -sL n0.lol/i2ao/5
Tools
Required
- A Linux System
- GCC
- GNU binutils
- xxd or other hex editor
- A text editor
Optional
- Radare2 - For debugging
- Linux Man Pages
Online Resources
- Online Disassembler https://defuse.ca/online-x86-assembler.htm
- Syscall Table https://syscalls64.paolostivanin.com/
Part 1: Hello World In C
|
|
We begin with our Hello World Program. It is going to print “[^0^] u!!” with a new line at the end. Then it returns 0 to the operating system and exits.
Compile
gcc hello.c -o hello
Run
./hello
Analyze
readelf -a hello
objdump -d hello -M intel
Part 2: Hello World In Assembly
CODE - bigsmile.asm
|
|
Build and Execute
nasm -f elf64 bigsmile.asm -o bigsmile.o
ld bigsmile.o -o bigsmile
./bigsmile
Read attributes about the elf binary
readelf -a bigsmile
Disassembly
$ r2 bigsmile
...
[0x00400082]> pd
;-- entry0:
;-- section..TEXT:
;-- rip:
0x00400082 b801000000 mov eax, 1
0x00400087 bf01000000 mov edi, 1
0x0040008c 48be78004000. movabs rsi, 0x400078 ; section..DATA ;"[^0^] u!!\n"
0x00400096 ba0a000000 mov edx, 0xa
0x0040009b 0f05 syscall
0x0040009d b83c000000 mov eax, 0x3c
0x004000a2 bf00000000 mov edi, 0
0x004000a7 0f05 syscall
Changes Made
- Using syscalls directly
- Using nasm to compile and ld to link
Part 3: Optimizing Hello World in Assembly
CODE - smile.asm
|
|
Build & Run
nasm -f elf64 smile.asm -o smile.o
ld smile.o -o smile
./smile
Note about Registers
REF: https://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture
Registers in x86 can be divided into smaller registers that hold different sized values.
Example
RAX is a 64 bit register. It can be broken down like this.
RAX 0000000000000000000000000000000000000000000000000000000000000000 64
EAX 00000000000000000000000000000000 32
AX 0000000000000000 16
AH 00000000 8
AL 00000000 8
Here is a table of all the general purpose registers with their respective subdivisions.
+-------+-------+-------+-------+-------+-------+-------+-------+
64 | RAX | RCX | RDX | RBX | RSP | RBP | RSI | RDI |
32 | EAX | ECX | EDX | EBX | ESP | EBP | ESI | EDI |
16 | AX | CX | DX | BX | SP | BP | SI | DI |
8 | AH|AL | CH|CL | DH|DL | BH|BL | |SPL | |BPL | |SIL | |DIL |
+-------+-------+-------+-------+-------+-------+-------+-------+
In our newly optimized code, we save space by using the lower 8 bits of RAX and RDX. This is because we are only moving 8 byte values. Registers define the total bit width of the number, so using a 64 bit register will make the integer 1 look like:
0000000000000000000000000000000000000000000000000000000000000001
While using the lower 8 bits will make the integer 1 look like:
00000001
This reduces the size of the integer on disk by 3 bytes, and the total instruction size by 5 bytes. This may seem insignificant, but it adds up.
48 c7 c0 01 00 00 00 mov rax, 1 ; 7 Bytes
b0 01 mov al, 1 ; 2 Bytes
Another optimization we are using is copying a register to another, instead of moving a number into a register.
48 c7 c7 01 00 00 00 mov rdi, 1 ; 7 Bytes
48 89 c7 mov rdi,rax ; 3 Bytes
The last optimization we did was XORing RDI with itself. This is a common way to create a 0, rather than moving a 0 into the register.
48 c7 c7 00 00 00 00 mov rdi,0 ; 7 Bytes
48 31 ff xor rdi,rdi ; 3 Bytes
We’ll cover even more optimizations later on in the workshop.
Making the binary smaller
You can use strip to reduce the binary’s size. This removes debug symbols that are unnecessary for running the binary on a system.
ls -lah smile
strip smile
ls -lah smile
You’ll see that the binary is now much smaller.
Changes
- Use smaller registers
- XOR a register with itself to create a 0
- Copy data between registers instead of moving a number into it
- Use Strip to strip the data
Part 4: Writing an ELF binary from scratch
In our last two assembly examples, we used nasm to assemble our code, and ld to link it. The binary was very small, because it was written and assembled in this way. Nasm created an ELF binary for us by using the flag -f elf64, which uses the well defined ELF binary format and loads our code into it.
The next part takes this a step further, where we don’t rely on nasm to create an ELF binary for us. We will create the binary ourselves using a custom ELF template. The template itself is mainly here for rapid prototyping of shell code, and allows you to create more reliable payloads using a very well defined structure.
If you want to read more about the development of this template, you can refer to my previous write ups on ELF Binary Mangling, as well as my golfclub repo. Links are on my website: https://n0.lol
Using the ELF nasm template
There is so much to cover involving ELF binaries that we won’t cover now. I’ll run down the most important aspects of it for the purposes of this presentation
- Everything is hand defined according to the ELF spec
- There are locations in the header that we can hide data.
- You have 12 bytes to work with from 0x04 to 0x0F
We are going to jump right into some dirty tricks to hide data, and reference with our optimized code.
CODE - tiny.asm
|
|
Build
nasm -f bin -o tiny tiny.asm
We are now building using the raw bin format. This instructs nasm to not apply any binary templates to the assembled code. This is similar to how you compile 16 bit COM files for DOS.
Hiding Data
Since we’ve jumped right into this hot mess of a binary, we’re going to use some of the features of the template as described above.
Since we know that our string “[^0^] u!!\n” is 10 bytes long, and we have 12 bytes to hide data from 0x04 to 0x0F, let’s pack our string into this space.
Nasm can store chunks of data like so
dq 0x0123456789ABCDEF 8 bytes - Quad Word
dd 0x012345678 4 bytes - Double Word
dw 0x0123 2 bytes - Word
db 0x01 1 byte - Byte
To store our string, we need to divide it up and store it in little endian format for nasm to assemble it correctly.
Our chunks will look like this
dd 0x5e305e5b ; "^0^["
dd 0x2175205d ; "!u ]"
dw 0x0a21 ; "\n!"
There are other ways to store strings and data, but this will do for our purposes here.
We also can use nasm to put a label on this data, so that we can reference the address in our code. Our label in this case is “msg:”, which appears towards the top of the code.
Since we already know that our data is 10 bytes, we can just put a 10 into dl, rather than relying on nasm to calculate that for us. It’s important to keep track of data sizes and lengths when you’re writing things like this!
We’ve covered a lot of weird stuff quickly, so let’s move on to the next section and do our final bits of optimization in this course.
Changes
- Use custom binary template.
- Put string in the elf header.
Part 5: Further Optimizations
Now that we have a whacky ELF binary to mess around in, and have established methods of referencing data, here is where the more interesting things begin.
Let’s take a look at the code and build it, then discuss what is going on.
Code - smol.asm
|
|
Build
nasm -f bin -o smol smol.asm
Creating Addresses
Because we’re working on such a small scale, everything feels more immediate. We know what is at every byte in our binary, and we can therefore rely on a consistent location to refer to.
Our previous binary referenced the memory address for our string at the address 0x100000004. The instruction to do this was:
48be0400000001000000 movabs rsi, 0x100000004
This is a very long instruction, a total of 10 bytes due to needing a 64 bit register to hold the address. We can create this manually and save space by building the address. Here’s the process.
89 c6 mov esi,eax
48 c1 e6 20 shl rsi,0x20
40 b6 04 mov sil,0x4
We copy the value in EAX, which is 1, into ESI.
RSI 0000000000000000000000000000000000000000000000000000000000000001
Then we shift RSI to the left by 32 bits (0x20).
RSI 0000000000000000000000000000000100000000000000000000000000000000
This creates the value 0x100000000, which is the base address of where we loaded the binary into memory.
Next, we use the lower 8 bits of RSI to load the last value we need, 4. We now have the address of our string, 0x100000004
RSI 0000000000000000000000000000000100000000000000000000000000000100
This saves us an Earth shattering 1 byte, but it also introduces a very important concept in developing exploits.
Shellcode is injected by a variety of means, and when creating a proper payload for a buffer or heap overflow, certain bytes may be ignored by the application that is being exploited, or by other things like servers that don’t handle bytes like 00 (NULL), 0A (\n) or 0D (\r) in a way that you may be hoping for.
Certain tools, such as msfvenom, are capable of creating payloads that avoid specific bytes (“bad chars”), to aid in exploitation. If we were injecting this payload, there are other ways of referencing strings, such as by direct loading of the desired bytes into registers and then pushing onto the stack, but I wanted to demonstrate methods of referencing data that can be applied to other programs in assembly.
Other optimizations
The last optimization we are doing in this lesson is copying EAX to EDI, rather than copying RAX to RDI. When you move data between two 32 bit registers, the upper 32 bits will be zero’d out. This is not the case for the lower registers. Moving data to AL preserves the rest of the data in the upper 56 bits. This is what enabled us to move 4 into SIL in the last section.
48 89 c7 mov rdi,rax
becomes
89 c7 mov edi,eax
Another two byte instruction to achieve a similar effect is inc edi, which increments EDI by 1 from 0.
[ Changes ]
- Switch RDI with EDI
- Instead of loading the address of the string into RSI directly, construct the address by shifting 0x left 20 bits and then moving 4 into SI
X86