DEF CON CTF 2023 Qualifiers - IFUCKUP
Overview
Summary
This is a detailed writeup for the “IFUCKUP” challenge at the DEF CON CTF 2023 Qualifiers that I solved together with several team members from Norsecode (a collaboration of several CTF teams, including Kalmarunionen). The challenge binary that had to be exploited permits a buffer overflow on the stack, but also uses a “new and improved ASLR” and relocates stack and binary to random addresses constantly. As we learned afterwards, this challenge was based on an older “F.U.C.K.U.P.” challenge in which the intended solution was to break the pseudorandom number generator (PRNG) that determines the relocation addresses. For this challenge the PRNG was not supposed to be breakable anymore, with the intended solution being to use vdso instead. However, several teams, including us, broke the new PRNG anyway.
Writeup by: shalaamum
Solved by (alphabetical): killerdog, mikbrosim, myldero, Pie, Raptor Jesus, rot256, shalaamum, zanderdk, zopazz, zzz
This writeup was first published in this repository.
Introduction
This was the challenge at the DEF CON CTF Qualifiers for 2023 that I spent most time on, first reversing the binary, and then working out how to break the PRNG, both with killerdog and rot256. Implementing this we had two parallel efforts, with killerdog and rot256 trying with z3, which ran into problems due to what we later found out was a bug in a specific version of z3, and me implementing a linear-algebra solution with Sage. The original version of my script would have been too slow for the timeout of only one minute (at least on a single core), so primarily Pie and Raptor Jesus worked on improving my script, speeding it up very significantly. The end result of this half of the solution was a script that could recover the internal state of the two PRNGs in the running binary. The second half of the solution, using this to obtain a shell, which includes predicting what the binary’s address will be at execution of the payload (using the crypto-half) was then done primarily by rot256, zanderdk, and zzz.
At the time of the DEF CON Quals I was still quite new at using GDB, and that the binary relocates itself constantly made debugging particularly difficult. One thing I was stuck on a while during the contest was that the binary changed the rounding mode in the control word of the x87 FPU, leading to discrepancies to the values I was expecting from the Ghidra decompile1. I did not contribute to crafting the payload used to get a shell, or to solving the problem of predicting future addresses using the recovered internal state of the PRNG, which was done by running a new instance of a patched binary locally in GDB, stopping at the point at which the internal state of the PRNG is known, and then replacing the internal state of the PRNG of the local binary with the recovered one of the binary running on the remote, and then running the local instance up to the point we want to know the base address, and then extracting that information.
The motivation for this writeup was mostly that I wanted to learn how to properly debug the binary and how to do the parts of the solution I wasn’t involved in during the contest. Because of this I rewrote all of the solution code from scratch, and thus this writeup does not reflect the solution Norsecode used during the contest. Some parts of this writeup are also more detailed than would be reasonable during a live contest, where it would be unsound strategically to spend excessive time understanding something that does not contribute to solving the challenge faster.
The bulk of this writeup is contained in the “Static reversing” section below, where we go through the entire program in detail, statically, and explain what everything does. This section also already contains steps towards a solution, by e.g. not only discussing how raw PRNG values get converted to values that are output, but also how to obtain some of the raw bits from the output values. The section after that, “Dynamic debugging” then discusses how to deal with the constant relocations when trying to dynamically understand the behavior of the binary. A solution using a GDB script (using the Python API) with a stop handler that automatically fixes breakpoints when relocating is discussed, which allows debugging across relocations. This is then used to verify some of the understanding gained from the static reversing part. Finally, the “Solution” section discusses the implementation of the actual solution, from recovery of the PRNG state to the simple ROP chain used to obtain a shell.
Static reversing
We are given a statically linked, 32-bit, stripped ELF binary.
Running the binary
If we run the binary we get an initial message and are then dropped into a infinite loop in which we can choose one of 6 options from a menu. The welcome message and the first option gives us some information on what the challenge is about:
| |
If we choose option 2, we obtain a printout claiming App moved to new random location, while for 3, we are printed addresses of the stack and binary, as well as a random number:
| |
When choosing 4 we can enter 100 bytes of data, which judging from the crash appear to overflow a buffer on the stack:
| |
Finally, option 5 prints out 8 lines of 8 random numbers (4 bytes each, in hex), separated into two groups of four on each line:
| |
entry
The decompile of the entry function is initially as follows:
| |
From running the binary previously we can guess what the functions in the switch cases must more or less do, and FUN_080495a4 must print the menu and retrieve the user’s choice.
syscall
FUN_0804902d is called in various places, in particular as FUN_0804902d(1,0,0,0,0,0,0,uVar1); in the entry function, and we know this must quit the program. The decompile of this function initially shows no arguments, but we can see from the disassembly
| |
that this function performs a syscall, and we should set the function signature of this function to
| |
Now the call in entry is syscall(1,0,0,0,0,0,0);, corresponding to the syscall exit(0).
print
If we input a menu option that isn’t valid, a call to FUN_08049051("Unknown command\n"); is made, so presumably this is some kind of print, and after renaming/retyping variables we see that it uses the write syscall to write the string to stdout.
| |
entry again
After renaming the functions, the entry function now looks like this:
| |
We are now going to go through the various functions offered by the menu.
tohex
The view_state function contains lines like
| |
from which we can guess that FUN_08049000 gets an unsigned 32 bit number as first argument, and then converts it to a hex string with 8 characters in the buffer passed to as the second argument. The cleaned up decompile confirms this:
| |
view_state
The view_state function then looks like this:
| |
The addresses and “random” that are printed come from globals, but it is at this point not clear yet where exactly the value comes from.
get_random
The get_random function that can be selected from the main menu and prints 8 times 8 random values looks like this in the Ghidra decompile:
| |
After some cleaning up we obtain this:
| |
Now first we can notice that the general structure seems to be that there are two PRNGs involved here: We get values from the first PRNG, which has state stored where DAT_0804b6c4 points to, and every time we are in the middle of a line (where “- " is printed), the first PRNG is advanced a bunch of times without us getting the value printed, with the amount of times determined by a secod PRNG, whose state is stored where DAT_0804b6c4 + 0x44 points to.
The two lines
| |
are curious: what exactly is going on? We will come back to this point later.
step_prng
First, let us look at the step_prng function. From what was discussed above we can guess that the argument might be a pointer to 0x44 bytes of memory.
| |
We can see here that - up to the very end - operations are integer operations, not floating points. Furthermore, we notice that there are many uses of the array passed as an argument, but with the exception of the lines
| |
the index is always 1 plus something, while in the one case here we see that the first component just gets decremented modulo 0x10. As uVar5 and uVar6 are used to calculate indices in the other uses of param_1, this leads us to guess that the first component is some kind of counter, and separate from the other 0x10 unsigned ints, that make up the rest of the state. We thus define a struct prng_state which has a field uint index, followed by a field uint[0x10] state.
We now have the following decompile:
| |
Let us break this down a bit. The state consists of an integer between 0 and 15 that we provisionally called index, and 16*32 = 512 bits that we called state. Note that the new state after running this function is F_2-linear2 in the previous state, and the linear map being applied depends only on index. We can thus define A_0, ..., A_15 to be the 512 x 512-matrices over F_2 that define the state-change depending on index, i.e. thinking of the effects on prng of this function can be described as:
| |
and the result before the floating-point conversion is then:
| |
Let P be the permutation matrix that cyclically shifts all the bits, so that the bit with index i is afterwards found at index (i - 32) % 512, i.e. P * (v_0, ..., v_511) = (v_32, v_33, ..., v_511, v_0, ..., v_31). Then note that we have A_i = P^-i * A_0 * P^i.
Now define state' = P^index * state. Then note that state' updates in a way that can be described more succinctly:
| |
and for the integer result we have:
| |
The upshot is that if we want to recover the internal state of the PRNG to be able to predict future values, then index is irrelevant, this is just an implementation detail for more efficient handling of the shift part, where instead of applying a permutation matrix that requires modifying 16 doublewords of memory, only decrementing a single doubleword is required.
We will thus define A = P * A_0 and consider state' in our attempt to break this PRNG.
The final thing of note here is what is up with the line
| |
i.e. how the return value depends on result_uint.
This is where the Intel® 64 and IA-32 Architectures Software Developer Manuals come in handy.
In the following, we will write z for result_uint to shorten notation.
The relevant disassembly is the following:
| |
The ADD and the POPs are just recovering registers and stack, the important operations are FILD, FLD and FMULP. What happens is:
- The 32 bit unsigned integer is moved onto the stack, where it beforehand has been zero-extended to 64 bit. Those 64 bits can thus be interpreted as a signed integer and still represent the same value as the unsigned
z. FILDpushes this 64 bit signed integer to the FPU stack. This is a precise operation.FLDpushes a single precision float with raw (byte) valuef8000000to the FPU stack.FMULPmultiplies the two values at the top of the stack, pops both, and pushes the result.
Now let us figure out in detail what the result will be.
We start with 0x2F800000, which is interpreted as single precision float.
The fractional part of the significant is encoded with 23 bits, which happen to be all 0.
A 1 as the integer part is implied (normalized form). Thus the significant is precisely 1.
The top bit is the sign, which is 0, for positive. What remains is the biased exponent, which is
then 95. The biasing constant for single-precision floats is 127, so the actual exponent is 95 - 127 = -32.
This means this floating point number precisely encodes 2^-32.
The multiplication with z then gives 2^-32 * z.
Multiplication is infinitely precise and then
rounded when stored as double extended precision.
There are two cases:
- Case
z == 0: Then we get a precise 0. - Case
z != 0: Then writez = 2^m + z'withz' < 2^mand aszwas 32 bit value, we will havem < 32. Then for a precise fractional part of the significant we will needmbits. But there are 63 available, so this is precise. The exponent will be-32 + m. The biased exponent is thenm - 32 + 16383, which can be represented in the 15 bits available.
The conclusion is that the function step_prng returns the precise value 2^-32 * z.
get_random again: floating point details
The decompile for the printout of one random number was the following:
| |
It is important that we understand precisely how the floating point number 2^-32 * z that gets returned by the call to step_prng gets converted to an unsigend int that is passed to tohex. The relevant assembly is the following:
| |
So as we noted before, after the CALL step_prng, the top of the FPU stack will contain ST(0) = 2^-32 * z, precisely.
The FMUL operation in this function then multiplies this value in-place with the double stored as 0x41EFFFFFFFE00000.
We first need to decode what 0x41EFFFFFFFE00000 represents precisely.
The top bit is zero, so this is positive. The biased exponent is 11 bits and has
value 1054. The biasing constant is 1023, so the exponent is 31.
The fractional part of the significant is 52 bits. Those bits have value
4503599625273344 so the significant is 1 + 4503599625273344 * 2^(-52).
Thus the value represented is
| |
This value is then multiplied with 2^-32 * z (which was precise for all z that we consider).
We get
2^(-32) * z * 4294967295.
As both z and 4294967295 are 32-bit integers, their product fits in 64 bits.
This means 63 bits suffice for representing the fractional part of the significant precisely.
The exponent has absolute value at most 32, so also fits in the available 15 bits after biasing.
Hence this value is precise.
The next operation is FNSTCW, which stores the control word of the x87 FPU in local_2e.
Then local_30 is set to 0xc0 | local_2e. The effect of this is to set bits 10 and 11, which in the control word
correspond to RC, rounding control. See Vol. 1 8.1.5. of the
Intel® 64 and IA-32 Architectures Software Developer Manuals.
FLDCW then updates the control word of the FPU with this value1. RC being 0b11 mean
that the rounding mode is set to rounding towards zero (see Vol. 1 4.8.4).
FISTP then pops ST(0) and stores it after conversion to a signed 64 bit integer in random_int, which now happens by rounding towards zero.
Finally, FLDCW resets the control word to what it was before.
Before FISTP, the value of ST(0) was precisely 2^(-32) * z * 4294967295.
We have to convert this to a 64 bit signed integer by rounding to zero.
We can rewrite the value as:
| |
We distinguish two cases:
- Case
z > 0: We have0 < z < 2^32and thus obtain0 < 2^(-32) * z < 1. Asz >= 1it then follow that0 <= z - 1 < z - 2^(-32) * z < z. Hence the rounded-to-zero integer we obtain isz - 1. - Case
z == 0: Then the value is already0, and rounded to zero this yields0again.
Note that the lower 32 bits of the signed integer z-1 (or 0), where z was an unsigned 32 bit integer, contain the value of z-1 (or 0) when interpreted as an unsigned 32 bit integer.
So we now fully understand how the z from step_prng gets converted into the unsigned integer that we actually get printed. Later we will want to go backwards. For that process we can then distinguish two cases:
- Case value is
v = 0: Then we can’t be sure whetherzwas0or1. - Case value is
v > 0: Then we can conclude thatz = v + 1.
The function get_random also calls step_prng another time: In the loop in which the first PRNG is advanced a number of times, with how often depending on the second PRNG.
The assembly here is very parallel, with the change that 2^-32 * z is multiplied with the single-precision float defined by0x437F0000 instead of the double defined by 0x41EFFFFFFFE00000.
For single-precision floats the fractional part of the significant has 23 bits. The value of the significant is thus
0b1.1111111 = 2 - 2^-7. The top bit, the sign, is zero, so the number is positive. Finally, the biased exponent is 0b10000110 = 134, so with biasing constant 127, the actual exponent is 7. The value is thus
(2 - 2^-7) * 2^7 = 2^8 - 1.
With the same argument as before we can conclude that the result of the multiplication will then be 2^(-32) * z * (2^8 - 1) exactly. It remains to consider the conversion to a signed 64-bit integer while rounding towards zero.
For this, let us write z as z = 2^24 * z' + z'' with 0 <= z' < 2^8 and 0 <= z'' < 2^24. Then we must round towards zero the value
| |
What range can - 2^-8*z' + (2^8 - 1)*2^-32*z'' be in? The value can not become smaller than -2^8 * (2^8 - 1), which is bigger than -1. At the same time, the value can not be bigger than (2^8 - 1) * (2^-32) * (2^24 - 1), which is smaller than 1. Thus z' - 1 < v < z' + 1, so that we can conclude that v rounded towards zero yields either z' - 1 or z'.
So if we want to go backwards and recover bits of z from w, the rounded-towards-zero value, then we know that z' (which are the top 8 bits of the 32-bit unsigned integer z) must be either w + 1 or w. For many w, some of the top bits of w + 1 and w will agree, in which case we thus recover bits of z.
One useful thing to ask here is how often this happens. If we make the simplifying assumption that w is a uniformly random 8-bit unsigned integer, then how many bits of z can we be sure about on average? The most significant bit, i.e. bit 7, is the same for w and w+1 if and only if at least one of the lower bits is 0. This chance of this is 1 - 2^-7, as the chance of all 7 least significant bits being 1 is 2^-7. More generally, bit i is the same with probability 1 - 2^-i. This means the number of bits we can expect to obtain is
| |
So we can expect on average to obtain about 6 bits of information about z from w.
change_random
From what the running binary claims, this function should relocate binary and stack, and then printout having done so. After changing the function signature and name of the called function, the decompile is as follows:
| |
relocate
The decompile initially looks like this:
| |
The parts surrounding step_prng calls are similar to what we saw in get_random. The second PRNG is used to step the first PRNG between 0 and 255 times, and then the first PRNG is used to get a random value for the first syscall. Cleaning up the code and looking up the syscalls we obtain the following:
| |
So the first PRNG is used to mmap at a random address. Because it could happen that the address can not be mmapped, the loop just repeats the attempt with new random addresses until it succeeds. After mmapping something, a first function is called to presumably copy the binary over, and then part of that memory is made executable and readable only. Presumably the last function does the actual relocation.
Because we need it later we should still look a little bit more closely at the floating point stuff surrounding the stepping of the second PRNG.
| |
After the call, the raw random value is in ST(0). The precise value 255 is then pushed. FMUL ST1 then multiplies these two, but this time nothing is popped, and the result is written into ST(1). Then the last two values on the stack are swapped, so that now the top of the stack contains the product, and then we have 255. For some reason the top of the stack is then popped and stored in memory as a double precision floating point value with the previous rounding mode, then loaded again and popped and stored as an integer with rounding mode set to round towards zero. Finally, the value 255 is also popped and stored, to local_54.
memcpy
FUN_08049089 initially looks as follows.
| |
This is clearly an implementation of memcpy:
| |
transition_to_new_base_and_relocate_stack
At the end relocate, the function FUN_0804920c is called, which we now tentatively call transition_to_new_base_and_relocate_stack. In that function, the very first instruction is CALL FUN_08049205. That function in turn has the following disassembly:
| |
So after the CALL, the address of the next instruction, i.e. where this function should return to (usually with RET), is on the top of the stack. This gets popped into EAX, but then we jump to EAX - *EDI + ESI. Given that we expect code around these parts to move execution to the new copy of the binary, this makes sense if EDI stores a pointer to the variable holding the current base address, and ESI is the new base address. Changing the function signature of this function we obtain the following decompile:
| |
Changing the function signature of transition_to_new_base_and_relocate_stack as well the end of the relocate function now looks acceptable:
| |
Let us continue with transition_to_new_base_and_relocate_stack. The function indeed relocates the stack, and is also split out as it’s own function by Ghidra even though it still uses the stack frame of relocate (but given the changing around of EBP, Ghidra presumably would not be able to out of the box recognize the locals anyway). This makes the decompile a mess that is very difficult to follow. I am thus going to mostly follow the listing view here. Note that when doing this challenge during a live contest, one has to decide where to spend time and it was not necessary to fully understand this function to solve this challenge. But now there is no live contest anymore, so I can satisfy my curiosity regarding how this works.
After the call to jump_to_new_binary_location, this is the next block:
| |
Initially, ECX is set to the stack pointer, and EDX is set to the next higher address that is divisible by 1 << 0xc = 4096. The part after that is a loop, where after each iteration ECX is increased by 4 and the loop stops once ECX is not smaller than EDX anymore. What the loop does is loop over each doubleword in the part of the stack up to EDX, and if the value is an address pointing to the memory in which the binary was mapped to before, then the value is adjusted to instead point to the new mapping. Note that there is a bug here: We really should update the entire stack. Depending on e.g. the amount of environment variables etc. the current stack pointer could be just below a multiple of 0x1000 = 4096. In that case, we might not even update the return address of the relocate function, causing a crash on return.
The next block is the following.
| |
Here EBX is modified. What was the previous value and how can we interpret it? For this we have to go back to the start of the relocate function where we see
| |
The function __i686.get_pc_thunk.bx is very simple: It places the address of the next instruction (i.e. the instruction it returns to) into the EBX register. This is used in position-independent code in order to calculate addresses of other parts of the binary. If the binary were loaded at the base address Ghidra uses to display the listing, then the value would be 0x080490e6 after the call. After adding 0x25d2 we get EBX = 0x804b6b8. If we check that address in Ghidra (or use readelf) we can see that this is exactly where the .got.plt section starts, which is followed by the .bss section. Most other functions in the binary also begin by (after saving EBX onto the stack) calling __i686.get_pc_thunk.bx and then adjusting EBX so that it points to .got.plt. EBX is then used to access globals, for example the instruction after the next one after the listing above is
| |
In which the address of the global prngs is obtained by adding 0xc to EBX.
For reading more of the code it will be helpful to have a list of globals as offsets to .got.plt ready for reference:
0x0c:prngs0x10:DAT_0804b6c80x14:current_random0x18:stack_address0x1c:binary_base
Now let us get back to transition_to_new_base_and_relocate_stack. We had the following block.
| |
What this does is first to adjust EBX to now point towards .got.plt in the new, relocated, mapping for the binary. Next, 7 doublewords are released from the stack, which is just to remove the arguments to the syscall(0x7d,binary_base_new,0x3000,5,0,0,0); call in relocate, which had not been done yet.
The next block is then
| |
This is a call to munmap(binary_base, 0x4000), so we now unmap the old mapping for the binary.
| |
As ESP-relative addressing is not used (the stack is either used by addressing relative to EBP, or with PUSH/POP), we will ignore the ESP adjustments. To understand what EAX will be after the first instruction we have to look back to relocate and find the last time a write to EBP - 0x2c happens. The way Ghidra names locals on the stack this will then be local_30. We find that EAX will here get the value of binary_base_new. We then store this in the global binary_base.
| |
The first three instructions push the address of prngs[1] to the stack. Then we call step_prng, so we get a random value from the second PRNG. This value is then multiplied with local_54, which we previously identified as being the precise value 255. The rest is then doing the same as what we saw in relocate, and at the end EAX will have the value of 255 times the return value of step_prng, converted to a 32-bit unsigned integer by rounding towards zero.
The next block is then the following.
| |
This is a loop in which the random value EAX from the previous block, which can be between 0 and 255, is the amount of times we call step_prng(prngs).
| |
Next we call step_prng(prngs) another time, but this time we use the output. The handling of the floating point numbers is again very similar to what we just discussed, just this time we multiply with a stored 4294967295 from the stack. The part depending on the return value of step_prng ultimately writes the obtained unsigned int (for details about what the multiplications and rounding does, see the section on get_random) to EBP - 0x58 and EBP - 0x24. Then EAX and EBO - 0x2c become this value rounded down to the next value divisible by 4096. Interspersed with all this, the arguments to a syscall are pushed to the stack. This call ends up being mmap2(random value, DAT_0804b6c8, 3, 0x22, -1, 0). So we mmap at the random address, of length DAT_0804b6c8, which we thus should rename stack_size.
| |
The return value of the previous syscall is checked against the requested address; if it differs, we jump back and try again (again, just like in relocate). Then we call memcp(new_stack_address, stack_address, stack_size), where new_stack_address is the random address we mapped. We thus copy the stack to the newly mapped memory.
| |
This is a loop. Before the loop starts we set ECX = stack_size, EDX = new_stack_address, and EAX = stack_address. What Ghidra calls current_base_ptr is just the register EDI that is being reused, and new_base is ESI. For some reason Ghidra does not allow splitting this out as a new variable or telling it to forget the name, or at least I did not figure out how to do it. But we can just edit the listing here in the writeup:
| |
EDI is set to stack_address + stack_size. In the loop EDX gets incremented until it hits new_stack_address + stack_size. In each iteration, we load the value at EDX (so one of the doublewords on the new stack) into ESI. Then we check whether this is bigger or equal to EAX = stack_address and also smaller than stack+address + stack+size. If it is, then this means the stack contained what looks like a pointer to somewhere in the old stack. That pointer is then adjusted to instead point to the new stack.
| |
The first five instructions now adjust ESP and EBP to point to the new stack.
Then the syscall munmap(old_stack_address, stack_size) is made.
Finally, the new stack address is written to the global stack_address.
| |
This snipped is identical to one we already had in this function: It calls step_prng with the second PRNG, and converts the return value to a random value from 0 to 255 in EAX.
| |
This continues being identical to previous instructions, and is a loop that calls step_prng(prngs) EAX many times.
| |
This is nearly identical to what we had previously: A call to step_prng(prngs) is made, and the return value multiplied by 4294967295 and rounded down. This time the result is written at EBP - 0x48, and the global current_random.
Now we finally arrived at the last block:
| |
This is just a standard function ending, restoring registers, releasing the stack frame, and returning.
Overview of relocate
Here is an overview of what relocate does:
- Generate a random number between
0and255from the second PRNG and step the first one that many times. - Get a random address from the first PRNG, map it, and copy the binary there. If the address can’t be mapped, try again.
- Continue execution in the new copy of the binary.
- Adjust doublewords on the stack within the same page as ESP that could have been pointers to the old mapped binary and adjust them to point to the new mapping. Also start using globals from the new mapping instead of the old one, and then unmap the old mapping for the binary.
- Update the global
binary_base. - Generate a random number between
0and255from the second PRNG and step the first one that many times. - Get a random address from the first PRNG, map it, and copy the stack there. If the address can’t be mapped, try again.
- Adjust all doublewords on the stack that could have been pointers to the old stack and adjust them to point to the new mapping.
- Adjust
ESPandEBPto use the new stack, update the globalstack_address, and unmap the old mapping for the stack. - Generate a random number between
0and255from the second PRNG and step the first one that many times. - Generate a random 32-bit integer using the first PRNG and store it in the global
current_random.
stack_smash
The last menu-function we have not looked into yet and that does something relevant is the one promising us to be able to smash the stack. After renaming variables and function this is the decompile:
| |
The read function we haven’t looked at yet, which we thus should do next.
read
This function looks like this
| |
The syscall here is read(0, buf, num - i). So there is a loop that reads up to num bytes into the buffer, aborting if the read syscall fails. Once num bytes have been read, the first PRNG is advanced as many times as the number of bytes the last read call actually read. This means that the number of times is not quite deterministic, it can depend on delays in transmitting our data to the remote. However, it seems likely that if we send all num bytes at once, that all will be read in one go. Alternatively one could send num-1 bytes, then wait long enough to ensure these bytes have already been read, then send the last byte. After advancing the PRNG, relocate is called once.
So stack_smash indeed gives us a buffer overflow on the stack, but due to relocate being called twice before the return, we can not use binary and stack addresses from the view_state menu option, we will need to predict addresses by breaking the PRNGs.
entry revisited
entry now looks as follows:
| |
While we understand all the menu options now, there are some calls before the loop that presumably set up the program, and then in the loop there is a function that obtains the option the user chooses. We will now go through them.
set_stack_address_and_size
The decompile of FUN_08049780 looks like this, after cleaning it up:
| |
So an address on the stack is obtained, and then we attempt to use mprotect on 0x1000 byte long ranges of the stack, going downwards and upwards, until we found the limits of the mapping, in order to set the stack_address and stack_size globals correctly.
initialize_prng_random
Next, the function FUN_08049a06 is called twice at the start of entry, once for each of the two PRNGs, and is used to initialize the state with random data.
| |
memset
The function FUN_080495a4 (that is used to make the user choose a menu option) calls two functions FUN_080490c9 and FUN_0804942f that we will look at first. The first of those looks like this after cleaning it up:
| |
So this is clearly memset.
read_and_step
FUN_0804942f looks like this:
| |
A single character is read in from stdin, until the character in the third argument is reached, or the total number of characters were read reaches the second argument. The characters is stored in the first argument, and the first PRNG is stepped the num_read - 1 times.
menu
The FUN_080495a4 function can now be made to look as follows:
| |
So the menu is printed, a buffer for the string read from the user is set to zero and then read in (using the previous function, that also steps the first PRNG), and then the choice is converted to an integer and returned.
entry, last time
For completeness, here is how entry looks at the end:
| |
Dynamic debugging
In the previous section we went through the entire binary and reversed it statically. In a contest setting it would of course be prudent to speed up the process with dynamic debugging, as well as using it to confirm the findings. As the binary constantly relocates itself, this makes just using GDB directly a little more difficult however. During the contest zzz produced a patched version of the binary that did not carry out the relocation, which was very helpful to debug the x87 parts. However, such patching can cause differences in the behavior of the binary, for example by missing invocations of step_prng. My goal for this writeup was thus to do things differently and be able to debug an unpatched binary.
Debug symbols
Given that we have identified the various functions statically already, it would be useful to access the symbols in GDB. To do this, we can use Ghidra2Dwarf. Due to the relocation, we can’t really use it much in this case though.
Avoiding the crash due to the bug in transition_to_new_base_and_relocate_stack
Recall the bug in transition_to_new_base_and_relocate_stack that can cause a crash depending on the current size of the stack. In my case I actually hit this bug in GDB, with the binary crashing when attempting to return from that function, because the return address had not been updated, as the walking-over-the-stack-and-adjusting-addresses-loop had stopped at a lower address. For solving the challenge on the remote there really isn’t anything we can do except hope that the environment happens to be such that the binary does not immediately crash before the first user interaction (unlikely, as the admins would hopefully notice that). If it happens locally, then a simple fix is to remove enough environment variables. GPT4 suggested to me to run python -c 'import os; print("\n".join("unset environment " + x for x in os.environ.keys()))' > unsetenv.gdb to generate the GDB commands to unset all environment variables, and then run GDB with gdb -x unsetenv.gdb ./ifuckup_dbg, and this solves the problem.
The problem with relocations
If we just try to use GDB directly, by running gdb ./ifuckup_dbg, and setting a breakpoint at get_random, then we will get something like the following:
| |
So what happens? Why do we get a segmentation fault, and what is this about removing breakpoints? Here is my understanding what happens (partially based on speculation, you should not count on this being true in details, please let me know if you have any corrections):
What GDB does when setting a breakpoint at an address (in our case 0x8049813, the first instruction of get_random), then the byte at that address is replaced by 0xCC, the opcode for int3. Execution of that instruction raises a SIGTRAP, and the program is stopped. GDB then replaces the int3 by the original byte at that spot and rewinds the instruction pointer by one instruction (so that it points to the instruction where we wanted to break and that we just repaired, as it was not executed yet). If we have more breakpoints, then likely GDB also repairs them so that everything looks normal when the user inspects memory. Then the user can do what they want and inspect the situation etc.. Once the user commands GDB to continue, GDB presumably steps a single instruction, and then adds in the 0xCC for all the breakpoints again, and then resumes execution.
Now in this case, we set a breakpoint to 0x8049813 so GDB places a int3 in that spot, replacing the previous PUSH EBP. The binary then runs, and in particular relocates the binary. The int3 is copied with everything else of course, but is now located at a different address, in this case 0xe1a10814. So far this does not cause a problem, and execution proceeds normally, and we can select the Get random values option from the menu. Once execution reached the int3 at 0xe1a10814, this causes a SIGTRAP, which GDB announces to us. Three lines below that we see 0xe1a10814 in ?? (). The address here is one higher than the address at which the int3 was located: this is the address of the next instruction to be executed (int3 has a one-byte encoding). Now GDB knows of a single breakpoint at 0x8049813, and none at 0xe1a10813, so it does not do anything with regards to 0xe1a10813, but tries to change the byte at 0x8049813 back to its original value so that everything looks normal if we were to inspect that part of memory. However, that address is now not mapped anymore! So GDB can not do that and hence does the sensible thing and informs us that
| |
Yet, we can still ask to continue. In itself this does not cause a problem. However, note that because GDB did not know of a breakpoint it manages at 0xe1a10813, it did not rewind the instruction pointer by one operation, and neither does it fix the byte at 0xe1a10813 (and how could it, it has no way of knowing what the value should be). Thus PUSH EBP was not executed and is thus missing on the stack. The execution can still continue as normal, as PUSH EBP is also encoded as just a single byte. But at the very end, one entry missing on the stack means that the wrong EBP and return address get popped from the stack, causing a segmentation fault when trying to return to that address.
To confirm, we can after the SIGSEGV check whether int3 is still in that place, and it is:
| |
Automatically fixing breakpoints during relocation
In order to fix the problem just described, we can employ the following strategy. We set a breakpoint at exactly the instruction that passes execution from the old to the new copy of the binary. If we stop at that breakpoint, we remove all the breakpoints so far, and recreate them in the new copy. Additionally, there is one thing we must do before recreating the breakpoints: The old version of the binary had an int3 at the address of the breakpoint instead of the correct byte from the binary. This got copied over to the new version. When stopping, GDB recovers the previous byte at the old version of the binary, but does not do so for the new, copied version (because it does not know it as a breakpoint). We thus need to not only delete the breakpoint and add a new one, we also need to fix this byte in the copied version, before adding it as a new breakpoint.
Of course, we do not want to do this manually. The following Python script will thus do this for us, if we load it into GDB before executing the binary.
| |
With this we have now bypassed the relocation issue and can more normally debug the binary with GDB. The situation is not quite as good as without relocations, because GDB’s knowledge of symbols does not survive the relocation, but in this case we do not have e.g. complicated data structures for which this would be a huge usability improvement.
Figuring out why the binary sometimes misunderstands our menu choice
During the contest, my script for recovering the PRNG states failed sometimes (less than half of the time), with a UnicodeDecodeError, due to the remote sending back a non-decodable byte. As the solution of just running it a second time was more efficient than spending time debugging this, I did not figure out what the cause for this was during the contest. But now we can look into this. Checking when the decoding error happens we can quickly see that the byte in question comes from the menu function, where the input we send gets repeated back to us:
| |
This is then usually followed by Unknown command\n, as is to be expected if the menu choice contains a random byte.
We can use the following script to debug this:
| |
This will repeatedly call the get_random function, and print the data we receive from the running process, but stop once we hit the Unknown command case.
The output of the script will end something like this:
| |
The number of calls until the error occurs is not constant, but is often in the range of a couple hundred. A guess we can make here is that it might occur on average every 256th call.
To understand why this might happen, we can start statically. The choice_str buffer, where the first byte evidently weirdly seems to be something different than expected at printout, is stored in the stack frame of the menu function:
| |
One thing to immediately notice here is that choice_str is located at 0x11 below the top of the stack frame, so not doubleword-aligned. This means that the very first byte of choice_str is the most significant byte of the aligned doubleword that starts 0x14 below the top of the stack frame. As we can see in the code snippet below, the buffer first gets zeroed, then the read_and_step function gets called, and then it is printed out.
| |
It is thus likely that something weird happens in the read_and_step function, which looks as follows:
| |
This in itself looks like it should, however a call to relocate happens at the end. Could that be the problem? Relocation changes addresses on the stack that look like pointers to the stack or binary to adjust them to the new addresses for the new mappings for stack and binary, so perhaps this is what happens? To check this we can use GDB. We add the following to our Python script loaded to GDB, in order to print out the crucial doubleword just before and after the call to relocate.
| |
In GDB we then see output like the following:
| |
Note that the output from before relocation always has 0x35 as the most significant byte, which corresponds to the character “5”. The output also confirms that our speculation was correct: Things fail exactly when the base address of the binary also happens to have most significant byte 0x35. The other three bytes are left over from previous function calls, so it seems there just happens to be left an address of the binary in that spot, where the first byte of our input then overwrites the most significant byte. Checking the corresponding offset in the binary, we see it is the address of the next instruction after the call to __i686.get_pc_thunk.bx in get_random, so in the previous call to get_random from the main loop in entry, the return address of this call happens to be placed in exactly the double word in which the first byte of our input is stored later.
| |
Collecting data to verify static reversing
We want to recover the internal state of the PRNGs from the Get random values menu option. For that it will be important that we understood step_prng and the relationship between it’s return value to the values actually printed correctly. For this we will generate some output directly from the binary, like this:
- We stop at the start of the
get_randomfunction, and extract the current state of the two PRNGs - We stop near the end of the
step_prngfunction and extract the value ofresult_uint. - We stop in
get_randoma bit after the two spots in whichstep_prngis called, just after conversion of the return value to an integer, and extract that value.
The following stop handlers will do this for us, writing the data into a file debugdata.py in a way we can easily import later.
| |
Using this together with the following script we can then gather data to test our understanding of the PRNGs and their relation to the output data.
| |
We can then implement the PRNG as follows
| |
and then test this against the data obtained before using the following script.
| |
This will output Success!, confirming the understanding obtained from the static reversing described in the previous section was correct.
Solution
Via the Test stack smash menu option we can write a ROP chain onto the stack. To be able to construct a working ROP chain we will need to know some addresses though, and the binary relocates itself constantly to random addresses obtained from the two PRNGs. As the raw output bits of the PRNGs depend F_2-linearly on previous states2, and the Get random values provides us with outputs from the first PRNG, interspersed with gaps depending on outputs of the second PRNG, we can recover the PRNG states and thus predict future values.
Recovering the PRNG states
The internal state of the PRNGs that we wish to recover consist of 512 bits. Both state updates as well as the raw output bits (by this I mean the variable called result_uint in the reversing section) depend F_2-linearly on the state. Thus, if we know the value of the i-th bit of raw output of one of the PRNGs, after stepping j times previously, then this gives us a F_2-linear equation satisfied by the original state. The get_random menu option gives us bits of output of the first PRNG: We first get 4 consecutive outputs that leak (unless the value is 0) all 32 bits of raw output. Then there is a gap of up to 255 outputs we do not get, followed by 8 consecutive outputs, then there is another gap, and then another 8 consecutive outputs, and so on. 8 consecutive outputs of 32 bits give us 256 linear equations satisfied by the 512 bits of the original state. Guessing the length of the gap, we can then add another 256 linear equations from the next block of 8 consecutive outputs. With this bruteforce over 256 possible values for the gap, we thus have a simultaneous system of 512 F_2-linear equations satisfied by 512 unknowns. Usually, this linear system will not have full rank, so it does not fully determine the original state. However the probability of getting full rank, assuming random linear equations, is about 29%, and it is extremely unlikely that the rank would be smaller than 504. Thus, rather than bruteforcing over one further gap to use more output values, it is faster if we instead just bruteforce over the remaining indeterminancy, by trying all solutions of the linear system. Testing whether a solution is correct can be done by checking whether the next block of values occurs after a gap of no more than 255 values - this is very unlikely to happen by coincidence.
Having recovered the state of the first PRNG, we can then determine the precise length of the 8 gaps in the output at every call to get_random. While each gap is a 8-bit number, we on average learn 6 bits of information about the raw output of the PRNG from this (as explained in the reversing section). Thus each call to get_random allows us to learn about 48 bits of information about the second PRNG, so that we expect that roughly 11 calls should be enough to recover the state of the second PRNG as well.
The mathematics behind the PRNG solver I wrote at the contest is the same as the one here in this writeup, however everything was reimplemented from scratch. During the contest we glued the various scripts together, and due to the simulation-method used to predict the relevant binary address, this meant that the solution needed a SageMath and GDB setup at the same time. For this writeup I thus decided to make it possible to run the PRNG solver separately, by providing a simple interface over a network socket. This means it is possible to for example run the PRNG solver in a Docker container with SageMath, and have the script interacting with the binary/remote running elsewhere with e.g. a GDB setup as desired. Another advantage is that the PRNG solver can run some precomputations once and keep them in memory, and then use them multiple times when testing the interaction script (saving the startup delay for import sage.all is also not bad).
The first step in the PRNG solver is then to precompute, for each gap length from 0 to 255, the matrix that maps the original state to the sequence of raw bits the PRNG outputs, while skipping those that are part of the gap. For this it is most efficient to first compute the linear relations for 32*(8 + 255 + 8) successive raw output bits, and then construct the “gap matrices” from this by taking only a subset of the rows. Calculating the full matrix can be done as follows:
| |
We obtain the columns of the matrix we want by evaluating the outputs of the PRNG on states with only a single bit set.
Constructing a matrix from columns takes some time with Sage, so to obtain the gap matrices it is then more efficient to use the matrix_from_rows function on this full matrix than construct the gap matrices from their rows. We also precompute the kernels:
| |
To use the data that is output from get_random for solving for the original state of the first PRNG, we first need to convert the values to the raw values (i.e. result_uint). This is simple: As we discussed earlier, if the output value is not 0, then we can obtain the raw value by adding 1. Should we get an output value of 0, then we can not know whether the raw value was 0 or 1, meaning we are missing one bit of information. Rather than implement this special case, I decided to just abort in that case, as this is very unlikely to happen.
| |
For each possible gap length g we now have a “gap matrix” B_g. If the gap really was of length g, then the 512 raw output bits we obtained make up the components of a vector t that satisfies t = B_g * s, with s the original state of the first PRNG. To solve for all solutions s, we can first solve for one solution s_ and then consider all s = s_ + k with k an element in the kernel, i.e. such that B_g * k = 0. If the gap g we guessed was incorrect, then it can happen that there is no solution at all. But it also possible that there is a solution even though g is incorrect, and for the correct g we still do not know which element of the kernel is the correct one beforehand. To test which state is the correct state, we check whether a candidate state fits with the entire data we obtained, i.e. wheither the blocks of consecutive values appear as expected with gaps of at most 255 values. This is extremely unlikely to be the case for incorrect candidate states.
| |
After we obtained the state for the first PRNG, we still need to find the state for the second PRNG. But now we are able to determine the 8 gaps in each of the get_random outputs, and as discussed this gives us about 48 linear equations for the original state of the second PRNG on average. If we repeatedly ask for get_random in the main menu, then between two such calls the second PRNG will be advanced three times in relocate called from read_and_step, in turn called from menu. Similarly to the first PRNG, we can precompute the relevant matrix for the second PRNG:
| |
This matrix contains rows for each of the top 8 bits of the raw outputs, leaving out the three steppings that happen in relocate. From the gaps lengths we can observe we can not in general determine all top 8 bits of the original raw output, but instead only obtain those bits that would not change when adding 1. The following function implements extraction of the relevant indices and solving for the original state:
| |
To communicate with the interaction script, the PRNG solver listens to a port passed as an argument, after precomputing the relevant matrices:
| |
On a connection, the PRNG solver requests the output values from another get_random call by sending NEXT. Once both states have been recovered, the PRNG solver sends FINISH, receives the output of one more get_random call that is used as the last verification the values get correctly predicted, and then sends base addresses of binary and stack that will be current when returning a call to stack_smash after selecting that menu option at the very next opportunity.
| |
The last part here requires knowing how many times the two PRNGs are advanced until the crucial values used for the base address of binary and stack are obtained. At a high level, the two calls that happen are menu and stack_smash.
In menu, the PRNGs are only interacted with in the one call to read_and_step. Here the first PRNG is advanded num_read - 1 times with num_read the number of characters before the newline that we send. As we will only send a single character, this will be zero. Then there is a call to relocate.
stack_smash has two calls to relocate first, then the call to read, where the first PRNG is again advanced num_read - 1 times, which this time is 993. Then a relocate follows (within read), and then finally another relocate in stack_smash.
Thus in total we have:
- 3 calls to
relocate - Advance the first PRNG 99 times
- Two calls to
relocate
In each relocate we have (see the section on relocate above for the details), assuming that any address that is attempted to be mapped can be mapped:
- The second PRNG is stepped once to obtain a random number between
0and255, and the first PRNG is stepped this many times. - The first PRNG is used to obtain the new base address for the binary.
- The second PRNG is stepped once to obtain a random number between
0and255, and the first PRNG is stepped this many times. - The first PRNG is used to obtain the new base address for the stack.
- The second PRNG is stepped once to obtain a random number between
0and255, and the first PRNG is stepped this many times. - The first PRNG is used to obtain the new value for the global
current_random.
We can implement this as follows:
| |
From the interaction script side, we automate connecting to the PRNG solver and retrieving the required data. After obtaining the predicted binary and stack base addresses, we can ask for a stack smash. Using the scripted stop handler for GDB that handles relocation, we can without problem debug our payloads by stopping before return of the stack_smash function.
| |
ROP chain
After the previous steps, we can now overwrite parts of the stack with whatever we want, while knowing the base address of the binary at the time of return to the address we now control. We also know the base address of the mapping the stack is in, but this is less useful because we do not know how much the stack grew, as this depends on the environment, so we can not directly predict the stack pointer.
The binary contains a function syscall(uint number,undefined4 arg0,undefined4 arg1,undefined4 arg2,undefined4 arg3,undefined4 arg4, undefined4 arg5) that we can use to obtain a shell, by calling syscall(11, "/bin/sh", argv, envp,...). Here, 11 is the syscall number for execve. The man page for execve notes:
| |
So on Linux we can do syscall(11, "/bin/sh", 0, 0), but not necessarily on other UNIX systems. As we do not know on what system the remote runs, for a more generic solution we should use syscall(11, "/bin/sh", p, p), where p is the address of a doubleword with value 0 (argv and envp will then be interpreted as the empty list).
Since the binary is 32-bit, arguments are passed on the stack. Before we can call syscall, we need to get /bin/sh and a zero doubleword somewhere we know the address of. For this we need a gadget to do this, and a place to put this that is writable and won’t mess anything up.
Looking at the output of readelf -l ./ifuckup:
| |
we see that the first 0x3000 bytes of the loaded binary are not writable, but the following page is writable, but with relevant content only starting at offset 0x36b8. We thus have plenty of space in between 0x3000 and 0x36b7 for writing what we want.
To read something in, we can use read(byte *buf, uint num), which reads num bytes into buf from stdin. This function will call relocate and thus relocate the binary again. But this is not a problem: Like other pointers to the binary on the stack, ours will also be adjusted by relocate, so we can ignore this.
We will thus begin our ROP chain with the address for read. The next doubleword will be the return value for the return from read, and then the next two doublewords will be the arguments for read. Because the arguments are part of the caller’s responsibility and hence will not be popped by read, we make the return address for read the address of a pop-pop-ret-gadget, to get rid of the two arguments. We then follow this up with the address for syscall with arguments 11, the address where we wrote /bin/sh, and then twice the address of where we wrote a zero doubleword.
| |
Attachments
The attachment contains the following files:
ifuckup: The challenge binary.ifuckupgdb.py: The [handlers for GDB scripted in Python].ifuckup.gdb: Tiny GDB script to import two other files (in case you want to use GDB manually, to have less typing).prng.py: Python module with an implementation of the PRNG.gatherdebugdata.py: Script to gather basic example data.testrev.py: Script to use the gather data to check if it fits with expected behavior.debugunknown.py: Script used to debug the “unknown command” bug.solveprng.py: Solver for the PRNG states.solve.py: Main solve script.
Note, the scripts that use GDB expect to find a file unsetenv.gdb that can be generated with python -c 'import os; print("\n".join("unset environment " + x for x in os.environ.keys()))' > unsetenv.gdb, but it might also work if you leave the file empty.
Footnotes
Unfortunately Ghidra ignores changes to the rounding mode, there is an open issue for this. ↩︎ ↩︎
A bit depending
F_2-linearly on some other bits means concretely that this bit is given by xoring together (which is the same as adding modulo 2) a subset of those other bits. ↩︎ ↩︎The reality is slightly more complicated then this always being
99, see the subsection onreadin the reversing section for more details. ↩︎