Encoding Mutations: A Base64 Case Study


[:  :]

This writeup will be a quick rundown of how Base64 works, and how ambiguity in the decoding process can be used to an attacker or defender’s advantage.

This writeup can apply to most of the popular encoding schemes (base32, ASCII85 and others), but the focus is on base64.

Base64 Overview

Here is a base64 chart based on the most popular spec.

┌────┬────────┬───┐ ┌────┬────────┬───┐ ┌────┬────────┬───┐ ┌────┬────────┬───┐
│ ## │ Binary │Chr│ │ ## │ Binary │Chr│ │ ## │ Binary │Chr│ │ ## │ Binary │Chr│
├────┼────────┼───┤ ├────┼────────┼───┤ ├────┼────────┼───┤ ├────┼────────┼───┤
│ 00 │ 000000 │ A │ │ 16 │ 010000 │ Q │ │ 32 │ 100000 │ g │ │ 48 │ 110000 │ w │
│ 01 │ 000001 │ B │ │ 17 │ 010001 │ R │ │ 33 │ 100001 │ h │ │ 49 │ 110001 │ x │
│ 02 │ 000010 │ C │ │ 18 │ 010010 │ S │ │ 34 │ 100010 │ i │ │ 50 │ 110010 │ y │
│ 03 │ 000011 │ D │ │ 19 │ 010011 │ T │ │ 35 │ 100011 │ j │ │ 51 │ 110011 │ z │
│ 04 │ 000100 │ E │ │ 20 │ 010100 │ U │ │ 36 │ 100100 │ k │ │ 52 │ 110100 │ 0 │
│ 05 │ 000101 │ F │ │ 21 │ 010101 │ V │ │ 37 │ 100101 │ l │ │ 53 │ 110101 │ 1 │
│ 06 │ 000110 │ G │ │ 22 │ 010110 │ W │ │ 38 │ 100110 │ m │ │ 54 │ 110110 │ 2 │
│ 07 │ 000111 │ H │ │ 23 │ 010111 │ X │ │ 39 │ 100111 │ n │ │ 55 │ 110111 │ 3 │
│ 08 │ 001000 │ I │ │ 24 │ 011000 │ Y │ │ 40 │ 101000 │ o │ │ 56 │ 111000 │ 4 │
│ 09 │ 001001 │ J │ │ 25 │ 011001 │ Z │ │ 41 │ 101001 │ p │ │ 57 │ 111001 │ 5 │
│ 10 │ 001010 │ K │ │ 26 │ 011010 │ a │ │ 42 │ 101010 │ q │ │ 58 │ 111010 │ 6 │
│ 11 │ 001011 │ L │ │ 27 │ 011011 │ b │ │ 43 │ 101011 │ r │ │ 59 │ 111011 │ 7 │
│ 12 │ 001100 │ M │ │ 28 │ 011100 │ c │ │ 44 │ 101100 │ s │ │ 60 │ 111100 │ 8 │
│ 13 │ 001101 │ N │ │ 29 │ 011101 │ d │ │ 45 │ 101101 │ t │ │ 61 │ 111101 │ 9 │
│ 14 │ 001110 │ O │ │ 30 │ 011110 │ e │ │ 46 │ 101110 │ u │ │ 62 │ 111110 │ + │
│ 15 │ 001111 │ P │ │ 31 │ 011111 │ f │ │ 47 │ 101111 │ v │ │ 63 │ 111111 │ / │
└────┴────────┴───┘ └────┴────────┴───┘ └────┴────────┴───┘ └────┴────────┴───┘

There are only a few rules you need to know to understand base64:

  1. When data is encoded, it is split up into 6 bit chunks.
  2. Each chunk is mapped to one of 64 ASCII characters (Chr), shown above.
  3. Four 6 bit chunks make up a block, representing 3 bytes of data (24 bits).
  4. If there aren’t enough bits in the data being encoded to fit into a 6 bit chunk, the remaining bits are set to 0.
  5. If there aren’t enough 6 bit chunks to make up a 24 bit block, a special padding character is used as a placeholder for remaining chunks.

The padding character is ‘=’. You will often see this at the end of a base64 encoded string, signifying that the data isn’t aligned to (AKA a multiple of) 24 bits. Padding characters have no value, and can be thought of like a NULL.

To illustrate the encoding process, let’s look at how the string ’netspooky’ would be encoded. We can use the Linux base64 program to generate our encoded string. Note that the -n flag is used to prevent echo from printing a newline, which would add a byte to the end of the string!

$ echo -n "netspooky" │ base64
bmV0c3Bvb2t5

The base64 program followed the above rules to generate this string. We can understand how these rules were followed if we consider this from a binary perspective.

First, let’s divide the string into 3 chunks. It’s divisible by 3 and therefore needs no padding.

net  spo  oky
bmV0 c3Bv b2t5

Here is the first chunk ’net’, encoded as ‘bmV0’

        ┌─────────┬─────────┬─────────┐
data:   │n        │e        │t        │
binary: │011011│10│0110│0101│01│110100│
base64: │b     │m      │V      │0     │
        └──────┴───────┴───────┴──────┘

The second chunk ‘spo’, encoded as ‘c3Bv’

        ┌─────────┬─────────┬─────────┐
data:   │s        │p        │o        │
binary: │011100│11│0111│0000│01│101111│
base64: │c     │3      │B      │v     │
        └──────┴───────┴───────┴──────┘

The third chunk ‘oky’, encoded as ‘b2t5’

        ┌─────────┬─────────┬─────────┐
data:   │o        │k        │y        │
binary: │011011│11│0110│1011│01│111001│
base64: │b     │2      │t      │5     │
        └──────┴───────┴───────┴──────┘

Since our string was divisible by 3, it fit perfectly, and requires no padding.

Now, what if we were encoding the word ’nets’?

$ echo -n "nets" │ base64
bmV0cw==

We already know how ’net’ is encoded (‘bmV0’), but what happens when our string isn’t divisible by 3? We use the same process, except now we add padding.

The first char representing the first 6 bit chunk is ‘c’, the same as in the encoding of ‘spo’. This makes sense because the first byte ’s’ hasn’t changed.

Now we have these two extra bits. If we remember rule 4 from above, base64 will take any bits not defined within a 6 bit chunk, and turn them into 0s. If they are all undefined, then the null value padding character = is used to represent that they aren’t needed.

Example:

        ┌─────────┬──────────────────┐
data:   │s        │                  │
binary: │011100│11│----│------│------│
base64: │c     │w      │=     │=     │
        └──────┴───────┴──────┴──────┘
                 \
                  w maps to 110000, so the 4 undefined bits convert to 0's

Implementation Differences

Nearly every base64 implementation will produce the same encoded string output. Not every implementation will decode the same way, which is where things get interesting. Some less strict implementations of base64 will decode base64 strings with incomplete padding just fine. For example, Javascript:

console.log(atob("bmV0"));
net

console.log(atob("cw"));
s

If we use python in the same way:

$ python3 -c "import base64;print(base64.b64decode('bmV0'.encode('ascii')))"
b'net'

$ python3 -c "import base64;print(base64.b64decode('cw'.encode('ascii')))"
binascii.Error: Incorrect padding

Trying to decode ‘cw’, you get an “Incorrect padding” error.

The Linux base64 binary gets through the decoding of the ’s’ string, but it reports “invalid input” immediately after:

$ base64 -d <<< "cw"
sbase64: invalid input

There are far too many base64 implementations to discuss here, including some bespoke instances within very important systems.

Obfuscation

Now we know that there are differences in how base64 data is interpreted by different tools and libraries, we can explore some ways to use this to our advantage.

Some implementations use =’s effectively as a delimiter of base64 data. In these cases, two chunks of base64 encoded data, one after another, may or may not be written to a single buffer for processing. This means that splitting on arbitrary chunks of data may possibly be used to generate valid base64.

Examples

'netspooky'                         │ bmV0c3Bvb2t5
'nets','poo','ky'                   │ bmV0cw==cG9va3k=
'ne','ts','po','ok','y'             │ bmU=dHM=cG8=b2s=eQ==
'n','e','t','s','p','o','o','k','y' │ bg==ZQ==dA==cw==cA==bw==bw==aw==eQ==

Try it out!

base64 -d <<< "bmV0c3Bvb2t5"
base64 -d <<< "bmV0cw==cG9va3k="
base64 -d <<< "bmU=dHM=cG8=b2s=eQ=="
base64 -d <<< "bg==ZQ==dA==cw==cA==bw==bw==aw==eQ=="

Another neat trick that can throw off certain tools is using whitespace within base64 text. Check out this pyramid of base64 data and what it decodes to. Some tools and libraries will decode it just fine, but others (such as Linux base64), will consider it invalid input. Software that can decode this may be using a regex used to filter out whitespace, or isolate valid base64 characters only.

Keep in mind that these simple techniques can also be used to ensure that your encoded data isn’t processed, or readable only by your tools.

PROTIP: Combine this with a custom base64 charset to frustrate most analysts.

When approaching a web application that may accept Base64 encoded data, you may run into filters that look for certain Base64 patterns without actually decoding the data itself. Using obfuscation, you may be able to sneak some data past these pesky filters and decode at the heart of the app. The same can be said for yara rules or other detection software. Decoding can be more expensive than just applying a simple regex ;}

See how your favorite base64 implementation handles encoded data like this! Try to not put padding at the end, decode at arbitrary chunk points, include whitespace and symbols with mixed encoding, etc.

I wrote a quick and dirty script to generate some mutations based on the uneven chunk obfuscation technique outlined above. Feel free to use it for both attack and defense!

Use Cases:

Thanks to dnz for telling me to actually write this stuff down and turn it into a proper tool. Thanks to remy and gren for feedback. Shoutout tchq, vxug, tcpd.