The Loader's Job — PE Loading, Part 4

Relocations, imports, TLS callbacks, and the path from "image mapped" to "entry point runs." The kernel built the foundation; the user-mode loader assembles the rest of the house before the program can start.

In Part 3 we watched the kernel finish its mapping work. By the time control returns to user mode, our hello.exe sits at a randomized base address with its sections in the right page ranges, each page protected according to its Characteristics flags, and the demand-paging machinery ready to bring bytes off the disk as the program touches them. Everything the file format demands of the kernel is done.

Two things are missing, and both are concrete. First, some of the absolute addresses the linker baked into the binary may be wrong. The linker assumed ImageBase = 0x140000000, and the kernel honored ASLR by loading the image at, say, 0x7FF6A8B00000 instead. Every 64-bit pointer the linker hard-coded against the old base now points at empty memory. The .reloc section we noted back in Part 1, then dissected in Part 2, exists specifically to tell the loader what to patch. Forty-seven such pointers exist in our binary; the loader has to find and adjust each one.

Second, every ordinary statically-imported cross-module function call in the program — every call to printf, malloc, Sleep, GetLastError in our binary's source — currently goes through an Import Address Table slot that doesn't yet hold a function address. The loader has to find each imported DLL, recursively map any DLLs that DLL depends on, walk each DLL's export table, look up the requested function, compute its actual address in memory, and write that address into the right IAT slot. Two DLLs and roughly fifty imports in our case; in a real production binary it can be dozens of DLLs and thousands of imports. (Calls resolved dynamically at runtime via GetProcAddress, delay-loaded imports, COM vtables, direct syscalls, or shellcode-style PEB walks bypass the IAT entirely — those mechanisms exist precisely because the IAT machinery doesn't fit every case.)

And it has to do both of those things before any of the program's own code runs.

This is the user-mode loader's job. It's the last leg of the journey we started in Part 1, and it's where every architectural decision of the PE format finally pays off. The loader is also where the most analyst-relevant mechanisms live: the IAT is the hooking surface that both EDRs and malware exploit, the DLL search order is the basis of an entire class of persistence attacks, TLS callbacks are the canonical pre-entry-point execution window. By the end of this post you'll know all of them — how they work mechanically, what the loader actually does, and why these structures got shaped the way they did.

Where the user-mode loader actually is

The "loader" is not a separate program. It's code that lives inside a DLL named ntdll.dll, and it runs in the address space of the process being created. ntdll.dll is special in two ways: it's mapped into every Windows process unconditionally (no matter what the binary's import table says), and it's mapped into every process at the same time, by the kernel, as part of the same setup pass that mapped the EXE.

When the kernel finishes building the new process, it sets the CPU's instruction pointer not to the EXE's entry point but to a function inside ntdll called LdrInitializeThunk. That function calls into LdrpInitializeProcess and a family of related Ldrp* routines that together implement the user-mode loader. The loader runs entirely in user mode (no syscalls except where it specifically needs the kernel), reads the EXE's PE headers that the kernel has already mapped, and performs the work we'll walk through in the next sections.

The loader keeps its bookkeeping in a structure called the Process Environment Block, or PEB. Every process has exactly one PEB, allocated by the kernel and reachable through a CPU register — at gs:[0x60] on x86-64 (or via a documented field in the Thread Environment Block). The PEB contains a field called Ldr, of type PEB_LDR_DATA, that holds three linked lists of every module currently loaded in the process. The lists are kept in three different orders — load order, memory order, and initialization order — so different consumers can walk them efficiently.

This data structure is at the center of two important things. First, the loader itself uses it: when the loader encounters an import on, say, kernel32.dll, it consults PEB_LDR_DATA to check whether kernel32 is already mapped (because some earlier DLL needed it) before going to the trouble of locating and mapping it again. Second, anyone who needs to find loaded modules at runtime walks the same lists. Shellcode that needs to call GetProcAddress without an import table walks PEB.Ldr.InMemoryOrderModuleList. The Windows API GetModuleHandle walks the same lists. Process inspection tools walk them too. We'll see the shellcode walk concretely later in this part.

For now, the picture to hold in your head is: ntdll.dll is mapped into every process by the kernel; the kernel transfers initial control to a function inside it; that function reads the EXE's PE headers, performs the loader's work, and finally transfers control to the EXE's entry point. The kernel's mapping work is done before user-mode code runs at all. Everything we describe from here on happens in user mode, in functions that live inside ntdll.

Phase 1 — Base relocations

The first concrete work the loader does is fix the addresses the linker got wrong.

To understand what "wrong" means here, recall what the linker did back in Part 1. When the linker emitted our binary, it had to bake addresses into the machine code: addresses of global variables that other code references, addresses of string constants, function pointers in vtables, and so on. For most code on x86-64, the linker uses RIP-relative addressing — "the variable is 0x2EDA bytes after this instruction" — and these references survive ASLR untouched, because the distance between two things inside the same image doesn't change when the image is loaded somewhere new.

But not every reference can be RIP-relative. A 64-bit absolute pointer in a data table (a function pointer, say, or a string-table entry pointing into .rdata) has to hold a full 64-bit virtual address. The linker computes that address as ImageBase + RVA, with ImageBase = 0x140000000 for our binary. If the kernel honored ASLR and mapped the image at 0x7FF6A8B00000 instead, every such pointer is off by a delta of 0x7FF6A8B00000 - 0x140000000 = 0x7FF5A8B00000. The values in those slots are now garbage — they reference memory that doesn't belong to our process.

The Base Relocation Table — the .reloc section in our file layout, pointed to by data directory entry 5 — exists for exactly this purpose. It's a list of every location inside the image that holds a baked-in absolute address, organized in a format that lets the loader apply fixups efficiently. The loader reads the table, computes the actual delta (actual_base - preferred_base), and adds that delta to every value the table points at.

The .reloc block format

The relocation table is structured as a sequence of variable-length blocks. Each block describes fixups for one 4 KB page of the image. The reason for the page-based grouping is space efficiency: instead of storing a full 32-bit RVA for every fixup, each block stores one 32-bit "this block applies to this page" header, followed by 16-bit entries that each carry only the 12 bits needed to address a location within that page. Twenty-bit savings per fixup.

Each entry is a 16-bit value. The top 4 bits encode the relocation type. The bottom 12 bits encode an offset within the page named by the block header.

Other types exist for 32-bit binaries (HIGHLOW, type 3, for 32-bit absolute fixups) and for non-x86 architectures (ARM_MOV32, THUMB_MOV32, various MIPS and IA64 types), but in modern x86-64 you'll see DIR64 and ABSOLUTE almost exclusively.

What the loader does with it

The loader iterates the blocks. For each block, it iterates the entries. For each DIR64 entry, it computes the target address as actual_base + block_VA + entry_offset, reads the 64-bit value at that address, adds the delta (actual_base - preferred_base), and writes the result back. ABSOLUTE entries are skipped. When the entire table has been processed, every baked-in 64-bit pointer in the image matches the image's actual load address.

Two practical notes. First, the pages the loader writes to during relocation are usually code and data pages — which are PAGE_EXECUTE_READ and PAGE_WRITECOPY respectively when the kernel mapped them. The loader temporarily promotes these pages to a writable protection while it applies fixups, then restores the original protection when it's done. The copy-on-write mechanism from Part 3 still applies: the writes give this process its own private copies of the patched pages, and other processes sharing the same image are unaffected. Second, if the image happens to load at its preferred ImageBase — which is rare with ASLR but not impossible — the delta is zero and the loader can skip the table entirely. Older Windows versions (and binaries built without DYNAMIC_BASE) made this the common case; modern ASLR makes it the rare case.

What it looks like in hello.exe

Our binary has a .reloc section of 132 bytes (0x84), pointed to by data directory entry 5 at RVA 0x10000. It contains four blocks, describing fixups in pages 0x7000, 0x8000, 0x9000, and 0xE000. Together they hold 47 DIR64 fixups and 3 ABSOLUTE padding entries.

The pages being patched correspond to the sections containing 64-bit absolute pointers. Block 0 (page 0x7000) holds one stray absolute reference inside the .text page — not part of a RIP-relative call or jump (those don't need fixups), but an embedded 64-bit pointer the compiler emitted into the code region. Block 1 (page 0x8000) is in .data, containing 9 absolute pointers to other data. Block 2 (page 0x9000) is in .rdata, the biggest block, containing 33 pointers — these are mostly string-table entries pointing into the read-only data section itself. Block 3 (page 0xE000) is in .CRT — four pointers to C runtime initialization routines that _initterm will later walk. The fixups are all for absolute 64-bit values, never inside an ordinary RIP-relative instruction stream (those would never need a base-relocation fixup in the first place). Most live in data-like regions; this binary also has one in the .text page, which the loader handles by temporarily promoting that page to a writable protection while it applies the fixup, then restoring the original PAGE_EXECUTE_READ protection.

Phase 2 — Loading the imported DLLs

With relocations applied, the loader knows every address in the image is now correct for the chosen base. The next problem is that the image references functions in other DLLs that haven't been loaded yet. hello.exe imports from KERNEL32.dll and msvcrt.dll; a real production application might import from several dozen DLLs. None of those are mapped into the process yet (except ntdll, which the kernel always pre-maps). The loader has to find each one, map it in, and recursively repeat the process for whatever that DLL imports.

The mechanical part is straightforward. The loader walks the Import Directory — a sequence of IMAGE_IMPORT_DESCRIPTOR structures pointed to by data directory entry 1 — and for each descriptor reads the DLL name string. Locating the DLL on disk is the interesting part: Windows uses a specific search order that's been carefully designed (and contested, and tweaked) over decades. Once the DLL is located, the loader maps it using the same NtCreateSection(SEC_IMAGE) / NtMapViewOfSection mechanism we walked through in Part 3, applies relocations to it (computed against its preferred base, not the EXE's), and records it in PEB_LDR_DATA. Then the loader recurses: it walks that DLL's own import table and loads its dependencies, depth-first.

For hello.exe, the recursive chain is short. KERNEL32.dll imports from ntdll.dll (already mapped) and a few other system DLLs. msvcrt.dll imports from KERNEL32.dll (which the loader is already in the process of loading, so it just resolves to the existing entry) and a couple of others. After a handful of recursive descents, every DLL the process needs is mapped in. A real Win32 application might end up with 50+ DLLs loaded by the time the recursion completes — every shell extension, every input-method DLL, every common-controls helper, every dependency-of-a-dependency.

The DLL search order

When the loader has a DLL name like VERSION.dll or WINMM.dll or some application-specific plugin DLL and no path, it consults a defined search order. The order matters because it determines which copy of the DLL gets loaded if multiple copies exist on the system, and it's a famously hijackable attack surface: place a malicious file with the requested name in a directory that gets searched before the legitimate copy, and you can intercept the program's calls into that DLL. (Names from the KnownDLLs list — KERNEL32.dll, NTDLL.dll, USER32.dll, and dozens of others — are immune to this trick because they bypass the directory search entirely; we'll see why in a moment. Search-order hijacking targets the DLLs not on that list.)

The full decision tree the loader walks has several gates before it ever consults a directory: DLL redirection rules (.local files and registry redirects), API-set name resolution (logical names like api-ms-win-core-processthreads-l1-1-0.dll that map to real DLLs via the schema), SxS manifest redirection (side-by-side assemblies declared in the application's manifest), the already-loaded-module list (if a DLL by that name is already in the PEB's loader data, the loader reuses it), the KnownDLLs section (a pre-mapped set of critical system DLLs), and on modern Windows the package dependency graph for packaged apps. Only when none of those resolve the name does the loader walk the classic directory ladder. The diagram below shows just that final ladder — the part most relevant to search-order hijacking — for a desktop application with Safe DLL Search Mode enabled (the default since Windows XP SP2):

First, KnownDLLs is a privileged shortcut. Microsoft maintains a list of system DLLs that are pre-mapped into a special section object at boot time and shared across every process. When the loader is asked for one of these, it doesn't search anywhere — it just maps the pre-existing section. The KnownDLLs list includes kernel32.dll, user32.dll, ntdll.dll, gdi32.dll, and several dozen others. A common beginner question is "can a malicious kernel32.dll placed in the application directory hijack a real program?" The answer is no, because kernel32 is a KnownDLL — the search order never gets consulted for it. Search-order hijacking targets non-system DLLs that legitimate applications load by name.

Second, Safe DLL Search Mode moves the current directory from position 2 to position 5. The unsafe ordering — where the current working directory is searched right after the application directory — was the default behavior on early Windows versions. When applications could be launched from a working directory the user controlled (downloads folder, USB drive, network share), an attacker who dropped a same-named DLL there could intercept calls before any of the system directories were checked. Safe DLL Search Mode, default since Windows XP SP2, pushes the CWD down the list. The registry value HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\SafeDllSearchMode controls this; it remains enabled by default on current Windows.

Third, modern Windows offers a more restrictive opt-in API. SetDefaultDllDirectories lets a process declare "search only System32 and directories I explicitly add via AddDllDirectory." Used together, these functions remove the application directory, current directory, and PATH from the search entirely — eliminating most search-order hijacking surface at the cost of breaking applications that depend on side-by-side DLLs. The recommendation in Microsoft's security guidance is to call SetDefaultDllDirectories(LOAD_LIBRARY_SEARCH_SYSTEM32) at process startup if you're writing a hardened application.

Packaged apps (UWP, MSIX) bypass all of this. They use a different search order that consults the package dependency graph declared in the application manifest, then System32. There's no PATH search and no current-directory search; the packaging system handles dependency resolution at install time.

Phase 3 — The import/export handshake

Once the loader has located and mapped every dependency DLL, the next job is to connect the executable's function references to the actual function addresses in those DLLs. This is the most analyst-relevant mechanism in the entire loader, and it's where the format's design choices have the biggest practical consequences: every statically-imported cross-DLL function call your program makes — the calls the compiler emitted, recorded in the import table — routes through this machinery, and anyone who can write to the machinery controls those calls. (Calls resolved at runtime by other means — GetProcAddress, delay-loaded imports, COM vtables, direct syscalls — sidestep this path; we'll touch on the shellcode-style version of that bypass later in this part.)

The mechanism is a three-party contract. The executable says what it needs through its Import Table. Each DLL advertises what it offers through its Export Table. The loader sits between them, walks both, and fills in the executable's Import Address Table with the resolved function pointers. After the loader is done, the IAT is a flat array of function addresses, and every CALL [IAT+offset] instruction in the program's code reads the right address.

The import side: what the EXE needs

The Import Directory (data directory entry 1) is an array of IMAGE_IMPORT_DESCRIPTOR structures, one per imported DLL, terminated by a zeroed descriptor. Each descriptor has five fields:

For each imported DLL there are two parallel arrays of 64-bit thunks (on PE32+): the Import Lookup Table at OriginalFirstThunk and the Import Address Table at FirstThunk. On disk, both arrays hold identical contents: each entry identifies a function by name or by ordinal. After the loader runs, the ILT still holds those name/ordinal references — preserved forever as a record of what the binary was asking for — while the IAT has been overwritten with real function pointers.

Each thunk is a 64-bit value, and the topmost bit determines how to interpret the rest. If the top bit is set (0x8000000000000000), the low 16 bits are an ordinal number — the function is being imported by its position in the DLL's export table. If the top bit is clear, the thunk value is an RVA pointing to an IMAGE_IMPORT_BY_NAME structure, which is just a 2-byte hint followed by a null-terminated function name. Most imports use the by-name variant; ordinals appear mostly for stable APIs in system DLLs.

For hello.exe, the Import Directory has two non-zero descriptors. The first names KERNEL32.dll; its ILT at RVA 0xD040 and IAT at RVA 0xD1C8 each hold 13 thunks (plus a terminating zero), pointing at name structures for DeleteCriticalSection, EnterCriticalSection, GetLastError, InitializeCriticalSection, Sleep, TlsGetValue, VirtualProtect, and a handful of others. The second names msvcrt.dll; its ILT at RVA 0xD0B0 and IAT at RVA 0xD238 hold 34 thunks for functions like fprintf, malloc, exit, __getmainargs — the C-runtime functions our program uses, directly or indirectly. All are imported by name; none use ordinals. The IAT data directory at RVA 0xD1C8 has Size 0x188 (392 bytes), accounting for both DLLs' 47 patched entries plus the two terminating zeros — 49 64-bit slots in total.

The export side: what a DLL offers

The other half of the handshake is the DLL's IMAGE_EXPORT_DIRECTORY, pointed to by data directory entry 0 in the DLL's own headers. It advertises which symbols this DLL makes available and where each one lives. The structure has several fields but the three that matter for the lookup are three parallel arrays:

The design looks convoluted on first read, but each array has a job. AddressOfFunctions is the canonical address book: the function at EAT index N lives at DLL_base + AddressOfFunctions[N]. AddressOfNames is an alphabetically-sorted directory of names; the sort order lets the loader use binary search instead of a linear scan, which matters in DLLs with thousands of exports (ntdll.dll has over 2,000). AddressOfNameOrdinals is the bridge between the two: AddressOfNameOrdinals[i] tells you which EAT index corresponds to AddressOfNames[i].

Why the indirection? Because a DLL can export the same function by both an ordinal and a name, or by an ordinal alone, or by a name alone. The three-array scheme handles all three cases uniformly. The flat AddressOfFunctions table is the source of truth, indexed by EAT index; the name and name-ordinal arrays are added on top to support name-based lookups.

To resolve "Sleep" in kernel32.dll, the loader binary-searches AddressOfNames for the string, finds it at some index i, reads AddressOfNameOrdinals[i] to get the Export Address Table index — say, 1407 — then reads AddressOfFunctions[1407] to get the function's RVA, and finally computes DLL_base + RVA as the function's actual virtual address. That address is what gets written into the EXE's IAT slot for Sleep. A small precision note: the value in AddressOfNameOrdinals is the EAT index, not the publicly-exposed ordinal number. The PE spec calls this the "unbiased" ordinal. When code imports by public ordinal directly (instead of by name), the loader subtracts the DLL's OrdinalBase from the requested ordinal first to get the same EAT index. For most DLLs OrdinalBase is 1, so the two values look identical in practice — but the distinction matters when reading the spec or analysing unusual binaries.

Putting the two sides together

The full loader algorithm for our binary's KERNEL32 imports looks like this. For each thunk in the ILT at RVA 0xD040:

Most DLLs have OrdinalBase = 1, which means the public ordinal and the EAT index look identical in practice — but the distinction is real, and analysts looking at binaries with non-default OrdinalBase values will see it. The compact form of the lookup is:

Repeat for the 12 other KERNEL32 thunks. Then move on to the next import descriptor (msvcrt.dll) and repeat for all 34 of its thunks. When the loop terminates, every IAT slot in hello.exe contains a real 64-bit function pointer, and the program's CALL [rip+disp] instructions will dereference those pointers to actual code in the loaded DLLs.

One last mechanical detail. The IAT pages were mapped PAGE_WRITECOPY when the kernel loaded the image (this was Part 3's story — writable image pages start out shared and become private on first write). The loader's IAT patching is exactly those first writes. They trigger copy-on-write faults that give this process its own private copies of the IAT pages. By the time the loader finishes, those pages have been promoted to PAGE_READWRITE and the IAT contents are private to this process. Some hardened loaders and runtimes take an extra step here: after patching, they re-protect the IAT pages back to PAGE_READONLY, which prevents later in-process IAT writes from succeeding without an explicit VirtualProtect call. Where this happens depends on configuration; modern process-mitigation features can opt into related hardening, and the IAT data directory entry's existence (separate from any individual import descriptor's FirstThunk) is part of what lets the loader find and re-protect the IAT region as a unit.

IAT hooking and the shellcode workaround

The IAT design has a property worth dwelling on. Every statically-imported cross-DLL function call in the executable — every call the compiler emitted from a header-declared prototype — goes through exactly one specific memory location, the IAT slot for that function, and the CALL instruction reads that location at runtime, every single call. The compiler doesn't bake function addresses into the call sites; it bakes the IAT slot's address into the call sites and lets the loader supply the function address through the slot. From the program's perspective this is just an extra layer of indirection. From an attacker's or defender's perspective it's a hook point — overwrite the slot, and every IAT-routed call to that function now goes through your code instead. (Calls that bypass the IAT — dynamic GetProcAddress resolution, delay-loaded imports, direct syscalls — bypass the hook too, which is exactly why modern adversaries reach for those techniques and why modern defenders watch elsewhere.)

The IAT hook

Implementing an IAT hook is mechanically simple. The attacker (or defender; the technique is symmetric) does this:

From that point on, every call to that imported function dispatches to the hook function. The hook does whatever it wants — log the call, modify parameters, decline the call entirely, return a fake value — and optionally chains to the original function using the saved pointer.

The technique is invisible to code that doesn't go looking for it. The call site doesn't change; only the bytes in a specific memory slot do, and inspecting whether an IAT entry points "outside" its expected DLL is something most programs never bother to do. Both security tools and malware have leaned on IAT hooks for this reason: any API a target program reaches through the IAT — CreateFile, VirtualAllocEx, NtCreateThread on the defensive side, or FindFirstFileW, RegQueryValueExW, WSAConnect on the offensive side — can be intercepted with a single pointer overwrite.

One important historical correction is worth making here. Older AV products, classic rootkits, and pre-2015-era endpoint agents did use user-mode IAT hooks as a primary monitoring mechanism. Modern EDR products generally do not rely on IAT hooks as their primary telemetry source, and the reason is exactly what the next section walks through: any malware that resolves APIs dynamically via GetProcAddress or by walking the PEB to find a DLL's exports bypasses the IAT entirely and never reads the hooked slots. Modern EDR telemetry more commonly combines kernel-side callbacks (ETW Threat Intelligence, ObRegisterCallbacks, PsSetCreateProcessNotifyRoutine, image-load notify routines), ETW-based event subscriptions, user-mode instrumentation, memory scanning, and inline hooks placed directly into the first few bytes of sensitive ntdll.dll functions, which catch calls regardless of how the caller resolved the address. If you read a security blog post that says "the EDR hooked our import table," it is almost certainly describing one of those other mechanisms rather than user-mode IAT patching.

IAT hooking still appears in the wild — in malware that wants to subvert a single target program's behavior, in game-cheat engines, in older rootkits, and in some application-shimming and instrumentation tools. It remains the cleanest worked example of "the same indirection the loader uses to do its job is the indirection an attacker uses to redirect calls," which is why it's worth understanding mechanically even when it isn't the technique of choice anymore.

There are countermeasures. Hardened loaders re-protect the IAT to PAGE_READONLY after patching, making any hook attempt require an explicit VirtualProtect call (and turning hook installation into a noisier action). Process-integrity scans look for IAT entries pointing outside their declared module. Modern Windows offers Control Flow Guard (CFG) and Code Integrity Guard (CIG), which raise the bar in different ways. But the fundamental architectural fact remains: any call that does route through the IAT is one indirect dereference away from being redirected, and that is what makes the IAT a perennial hooking surface on Windows, even as the primary defensive use of the technique has moved elsewhere.

What shellcode does instead

Shellcode — code injected into a running process via buffer overflow, process injection, reflective DLL injection, or any other technique — has the opposite problem from a normal executable. It has no PE headers of its own, no IAT, no import descriptors. It's just a raw blob of instructions sitting somewhere in the target process's memory. When it needs to call a Win32 API, there's no IAT slot to dereference: it has to find the function address some other way.

The canonical technique is the PEB walk. Every Windows process has its ntdll, its kernel32, and its PEB_LDR_DATA linked lists; the lists name every loaded module and record where it lives. Walk the lists, find kernel32, parse its export directory, locate GetProcAddress, and from there resolve everything else. The pseudocode looks like this:

Three facts make this work. First, gs:[0x60] on x86-64 always points to the current thread's TEB; the TEB has the PEB pointer at offset 0x60; the PEB has the loader data at offset 0x18. These offsets have been stable across Windows versions for many years. Second, in ordinary Win32 processes kernel32.dll is almost always already loaded by the time any user code runs — it's a KnownDLL, so the directory search never gets to it, and most processes import from it directly or indirectly. (Robust shellcode still walks the PEB to confirm, and shouldn't assume kernel32 is present in unusual process types; ntdll.dll is the only DLL the kernel maps into every user-mode process unconditionally, so it's the more reliable starting point for the lowest-level resolution.) Third, PE headers are mapped at the module's base address (we saw this in Part 3: the headers region is the first page of the mapped image, read-only and accessible at runtime). So once shellcode has the base address, it can navigate to the export directory using the same field-arithmetic any PE parser uses.

The relationship between IAT-based calls and shellcode-style resolution is the same mechanism running at different times. Imports are early binding: the loader resolves every function once at startup, writes the addresses into the IAT, and the program reads them cheaply at runtime. Shellcode is late binding: each call resolves its target on demand by walking the loader's data structures itself. The export table doesn't care which one is reading it; both follow the same export-resolution path, and in the non-forwarded case both end up at DLL_base + function_RVA.

Phase 4 — TLS callbacks and DllMain

Two things happen between import resolution and the executable's entry point, both of which give code a chance to run before main(): TLS callbacks, and DllMain dispatch for every loaded DLL. Both are legitimate features. Both have been abused for anti-analysis purposes often enough that any malware analyst learns to look for them.

TLS callbacks

The TLS Directory (data directory entry 9, at RVA 0x9040 in this binary, which falls inside .rdata) is the PE format's mechanism for thread-local storage initialization. Its core job is mundane: it describes per-thread variables that the runtime should allocate and zero-initialize when a new thread starts — that template data is what lives in the .tls section (just 16 bytes in our binary). The interesting part of the directory is one of its fields, AddressOfCallBacks, which points at a null-terminated array of function pointers. Those functions — the TLS callbacks — get invoked by the loader at specific lifecycle events: process attach, process detach, thread attach, thread detach.

The lifecycle event we care about is process attach. When the loader is preparing the image to run, it walks the TLS callback arrays for the EXE and every loaded DLL, in load order, and invokes each callback with the DLL_PROCESS_ATTACH reason. This happens before the loader transfers control to AddressOfEntryPoint. The callback receives a module handle, an integer reason code, and a reserved parameter — and from inside the callback, the program is fully mapped, imports are fully resolved, and arbitrary code can run.

One constraint matters here. TLS callbacks execute while the loader still holds the loader lock — the same global serialization lock that protects the PEB's loader data structures during module load and unload. The same restrictions that apply to DllMain apply here: a TLS callback should not call LoadLibrary, should not spawn threads it then waits on, should not call into other DLLs that themselves might take the loader lock, and should not perform synchronization that could lead another loader-lock holder to wait on it. Doing any of those things risks deadlock or undefined loader state. In practice the rule reduces to: do simple, self-contained, fast initialization, and defer anything ambitious to after main starts. Malware authors don't always follow this rule, which is part of why TLS-callback-based unpackers occasionally hang or crash on systems with different loader timings.

That last fact is what makes TLS callbacks a famous anti-debugging technique. A debugger that launches a program normally sets an initial breakpoint at AddressOfEntryPoint and pauses there. A TLS callback runs before the breakpoint hits. If a binary uses a TLS callback to check for debugger presence (via IsDebuggerPresent, the BeingDebugged field of the PEB, NtQueryInformationProcess, or any of the dozens of other techniques), the check happens before the analyst has a chance to inspect anything. The callback can exit the process, jump to garbage, decrypt the actual payload, or rewrite the entry point — by the time the analyst's debugger pauses, the malware has already had its say.

This technique isn't new. The Ursnif/Gozi-ISFB malware family used TLS callbacks for anti-analysis and process injection; the GRUM botnet used them to execute its unpacking code; many packers have leaned on them over the years. The pattern is old enough that modern debuggers (x64dbg, IDA Pro, WinDbg) all have explicit "break on TLS callback" options that pause execution at the first TLS callback rather than at AddressOfEntryPoint. If you're analyzing an unknown PE, that option is one of the first things to enable.

For our hello.exe, the TLS Directory is populated and AddressOfCallBacks points at a small array of two callback functions emitted by the MinGW C runtime (RVAs 0x1600 and 0x15D0, both inside .text). These are CRT housekeeping — they initialize per-thread runtime state for any threads the program later creates — and don't do anything anti-analysis-like. But they do run before AddressOfEntryPoint. A debugger configured to break at the entry point of our binary would miss them entirely. This is a useful reminder that "TLS callbacks exist in this binary" is not by itself a malware signal; many benign C runtimes emit them. The signal is what the callbacks do, which requires actually disassembling them.

DllMain and the loader lock

After TLS callbacks fire, the loader makes one more pass: it calls each loaded DLL's DllMain function with the DLL_PROCESS_ATTACH reason code. This is the DLL's chance to do per-process initialization — allocate internal buffers, register with subsystems, initialize singletons, set up thread-local storage that wasn't covered by the TLS template. The DllMain calls happen in dependency order: if foo.dll imports from bar.dll, the loader calls bar.dll!DllMain first, so by the time foo.dll!DllMain runs it can rely on bar.dll being fully initialized.

One feature of DllMain dispatch is worth knowing about: the loader lock. This is a process-wide critical section that the loader holds for the entire duration of any loader operation, including DllMain dispatch. It exists to serialize loader work — if two threads both tried to load DLLs concurrently, the bookkeeping in PEB_LDR_DATA would corrupt. The lock guarantees that loader operations are mutually exclusive.

The consequence is that DllMain runs under the loader lock. Code inside DllMain can't do anything that would require taking the loader lock again — most notably, it can't call LoadLibrary (which would deadlock against the lock the current thread already holds). It also shouldn't wait on threads created from within DllMain, since those threads' own DllMain invocations would also need the loader lock. Microsoft's documentation calls out a long list of "things you must not do in DllMain"; in practice, most DLLs put a minimal amount of code there and defer real initialization to a function the program calls explicitly after the loader is done.

From an analyst's perspective, DllMain is a third pre-main execution window. Malware that drops or sideloads a DLL and induces a target program to load it gets to run code inside that DLL's DllMain before the target's own logic resumes — which is why DLL hijacking attacks are often paired with malicious DllMain initialization. The principle is the same as TLS callbacks: any code that runs during loader dispatch happens before any "interactive" debugger pauses by default.

Phase 5 — Entry point and CRT startup

With imports resolved, TLS callbacks run, and DllMain dispatched for every loaded DLL, the loader has done its job. The last step is the simplest mechanically and the most consequential conceptually: jump to the executable's entry point.

The entry point address is computed as actual_base + AddressOfEntryPoint, where AddressOfEntryPoint is the RVA we saw back in Part 2 sitting in the Optional Header. For our binary it's 0x1410, which we converted to file offset 0x810 back in Part 2 and found held the bytes 55 48 89 E5 48 83 EC 20 (push rbp / mov rbp,rsp / sub rsp,0x20). The loader sets RIP to that virtual address and returns; the next instruction the CPU fetches is the first instruction of our program.

Except — that first instruction isn't main. Almost no compiled C or C++ program has main at its entry point. The linker fills in AddressOfEntryPoint with the address of a CRT startup function (the exact name depends on the toolchain and subsystem: mainCRTStartup for console MSVC programs, WinMainCRTStartup for GUI ones, wmainCRTStartup and wWinMainCRTStartup for the wide-char variants; MinGW uses its own variants of the same idea). That function is the C runtime's bootstrap. It runs before main and does work that main assumes is already done.

The C++ static-constructor invocation is the third pre-main execution window we mentioned earlier. Anything that constructs a global C++ object — an std::string at namespace scope, a logging singleton, a registration helper that registers itself with some other subsystem — runs during _initterm, before main. Malware that's been compiled with the same trick (an unusual but not unheard-of pattern) can hide its activation in a class constructor whose object is never explicitly referenced anywhere.

For analysts: a breakpoint on main won't catch any of this. A breakpoint on _initterm or on the entry-point function itself (whatever AddressOfEntryPoint resolves to) will. x64dbg can be configured to break on the "system breakpoint" — a stop earlier than the entry point, inside ntdll's loader initialization, before any user code has run — which catches even pre-loader inspection. Knowing which breakpoint to set is part of doing serious analysis.

Once main is called, control passes to the program's own logic. The loader's work is complete. From this point forward the loader stays in the process, dormant, only reactivating if the program calls LoadLibrary or FreeLibrary or some other operation that requires module bookkeeping. The PEB and its linked lists remain populated, queryable, and walkable — by the program itself, by injected code, by analysis tools.

The complete timeline

We've now walked through every phase of the process between "user double-clicks hello.exe" and "main() runs." Here's the full sequence in one place, with the work split across kernel, ntdll loader, and program phases.

Try it yourself, on Windows

Inspect a binary's imports, exports, and relocations

Several of the structures we've walked through are easy to inspect on a real binary. The standard tools are all already on a Windows development machine.

dumpbin /imports hello.exe dumps the Import Directory. You'll see one block per imported DLL, with each block listing the function names, ordinals, and (after binding) the resolved addresses. Run this against our hello.exe and you'll get exactly the KERNEL32.dll and msvcrt.dll blocks we walked through above.

dumpbin /exports kernel32.dll (or any DLL) dumps the Export Directory. You'll see the ordinal-to-name mapping in tabular form. Try it against ntdll.dll to see the 2,000+ exports it advertises.

dumpbin /relocations hello.exe dumps every entry in the Base Relocation Table. For our binary you'll see the same 4 blocks and 47 fixups we counted above, broken out by virtual address.

For inspecting a running process rather than a static binary, x64dbg is the canonical free choice. The crucial configuration step: open Options → Preferences → Events, and check "System breakpoint" and "TLS callbacks". With those enabled, x64dbg pauses before the entry point on every launched binary, so you can step through TLS callbacks and the loader's final transitions instead of skipping them. Process Hacker (now called System Informer) shows the PEB and loader linked lists directly — open any process, navigate to the General tab to see PEB → Ldr → InMemoryOrderModuleList as a tree, and to the Modules tab to see every loaded DLL with its base address and path.

For looking inside a DLL's export directory in even more detail than dumpbin offers, CFF Explorer and PE-bear both provide GUI inspection of every PE structure including the export directory's three-array layout.

If you don't have a Windows box to hand, most of these inspections are doable from WSL or Linux too. objdump -p hello.exe shows the import directory, and Python's pefile module (pip install pefile) walks every structure in the format. x64dbg-style live debugging requires Windows, but static structural inspection works anywhere.

What we've built across four parts

We started in Part 1 with a 99-line C source file and four command-line tools. We end here with a fully running process whose every byte we can account for.

Part 1 traced the path from source code to executable. The preprocessor expanded macros and inlined headers. The compiler turned each C statement into machine instructions, leaving placeholder references for things it didn't yet know (external function addresses, string-constant locations). The assembler packed those instructions into .obj files with their metadata. The linker stitched the object files together, resolved internal references, and recorded external dependencies as imports that the loader would later satisfy.

Part 2 dissected the PE file format itself. The DOS stub, the PE signature, the COFF file header, the Optional Header with its 16 data directories, and the section headers and section bodies. Every field had a job; every byte had a purpose. We learned to convert between three coordinate systems (file offset, RVA, virtual address) and saw how the section table is the bridge between them.

Part 3 watched the kernel transform the file into a mapped image. NtCreateSection built a section object describing the image; NtMapViewOfSection wired its prototype PTEs into a new process's address space; the kernel applied per-section page protections. The image was now present in virtual address space but not yet in physical memory — demand-paging would bring pages in lazily as the program touched them, and copy-on-write would make writable pages private to each process that wrote to them. The mapping operation itself read almost nothing from disk.

Part 4 — this post — has filled in everything the user-mode loader does after the kernel returns. Relocations correct the absolute addresses ASLR invalidated. The recursive DLL loader walks the import directory, navigates the DLL search order, and maps each dependency. The import/export handshake walks both sides' tables to write resolved function pointers into the IAT. TLS callbacks and DllMain dispatch run pre-main initialization code (and pre-main anti-analysis tricks). Finally, the loader hands control to AddressOfEntryPoint, the CRT startup function does runtime initialization, C++ constructors fire, and main is called.

The picture across the four parts is unified. Every field in the PE format exists because something downstream — the linker, the loader, the kernel — needs to consume it. Every step of the loader's work has a corresponding piece of metadata in the file. The format isn't arbitrary; it's the structured language by which the compiler-and-linker pipeline communicates with the operating system about what kind of program this is and how to bring it to life.

What we deliberately didn't cover

Delay-loaded imports are a CRT-implemented optimization where DLLs aren't loaded until first use. They appear in data directory entry 13 (Delay Import Descriptor) with a structure parallel to but distinct from the regular Import Directory; the runtime supplies thunk stubs that invoke LoadLibrary and GetProcAddress lazily. Worth a separate post.

Packaged apps (UWP/MSIX) follow a different loader path with their own search order, container, and integrity model. The kernel mechanisms we walked through still apply, but the user-mode layer is significantly different.

The .NET CLR's secondary loader picks up where the native loader leaves off, doing JIT compilation, type loading, and assembly resolution for managed code. We touched on managed images in Part 3 but didn't dig into the CLR's own machinery.

Manifest-based DLL redirection, side-by-side assemblies, API sets, safe-mode loader paths, delayed CFG dispatch, image load notifications, EDR userland hooks via ETW-TI — each is its own world.

What we did cover is the load-bearing core: the four tools, the file format, the kernel mapping, and the user-mode loader. With that foundation, every more advanced topic builds on structures we've already walked through. The IAT you can hook is the same IAT the loader patches. The TLS callbacks anti-debug malware uses are the same TLS callbacks that legitimate code uses for per-thread setup. The PEB shellcode walks is the same PEB the loader bookkeeps and Process Hacker displays.

Where to go next

For deeper study: Solomon, Russinovich, and Ionescu's Windows Internals (now in its 7th edition, split into Parts 1 and 2) is the canonical reference for everything kernel-side. Matt Pietrek's classic articles Peering Inside the PE (1994, 2002) are still excellent on the file format. The Microsoft PE specification is the authoritative document for the format itself. Maurice Lambert's open-source PE loader implementations on GitHub are good study material for seeing the loader's algorithms in code. ReactOS mirrors much of the Windows loader's structure with readable C source. And the SANS Internet Storm Center, the Mandiant blog, and individual security researchers' write-ups regularly cover specific tricks, techniques, and CVEs that exercise the structures we've walked through.

If you wanted to actually write a PE loader — a reflective loader for in-process module loading, a manual mapper for analysis, a custom runtime for a research project — you now have the conceptual scaffolding. The implementation is mostly bookkeeping: walk the headers, apply the relocations, resolve the imports. The hard work is understanding what each structure is for, and you have that now.

The PE format is not arbitrary. Most fields exist because some producer or consumer — the compiler, the assembler, the linker, the loader, the memory manager, the signer, the debugger, the CLR, a compatibility layer, or security tooling — needs them as a stable contract. Now you can read the format and predict what each of those will do with it, and from there understand the implications for malware analysis, reverse engineering, and exploit development.

The full picture, and what comes next

We've zoomed in on hello.exe for four parts, one mechanism at a time. The last thing worth doing is zooming out. By the time the entry point is about to fire, hello.exe shares its address space with everything the operating system put there to make it possible to run: the loader's own libraries, the C runtime, the kernel's contributions to user mode, and the working memory the program will use once it starts. All of it, together:

The executable is small. hello.exe is a 17-page block in a 256-terabyte address space, surrounded by infrastructure: four loaded DLLs supplying everything from the syscall gateway to printf, kernel-shared data at the only non-ASLR'd address you'll ever see, the PEB and TEB that Windows maintains for every running process and thread, the heap waiting for the first malloc, and the stack waiting for the first function call. We spent most of four parts on the executable, but in some sense the executable is a passenger. The address space and its loaded infrastructure are the bigger story.

The entry point hasn't run yet. The kernel built the address space. The loader patched the executable. Everything is in place. Depending on the binary, TLS callbacks may already have executed and DLL initialization (each loaded DLL's DllMain) has already happened — those run before the EXE's AddressOfEntryPoint. What hasn't happened yet is the CRT startup path that eventually calls main. The CPU is about to begin executing it: the instruction pointer is set to AddressOfEntryPoint, the stack pointer is at the top of the main thread's stack, and the very next instruction will be the first byte of CRT startup. From that moment on, execution uses the stack heavily — frames pushed and popped, return addresses laid down by every call, arguments spilled from registers when there are more than four of them, locals allocated, and unwinders reconstructing call history when exceptions or debuggers need it. Code itself executes from executable pages (almost always .text); the stack is where the shape of execution lives. Every exception throw walks the stack. Every debugger backtrace reads it. Every crash dump is, fundamentally, a snapshot of stacks.

And the .pdata and .xdata sections we kept calling "exception tables" without explaining what they actually do? They're the metadata that makes walking those stacks correctly possible — across every function in the process, with no frame pointers required.

The next series picks that up: how x86-64 stack frames are laid out, how the Microsoft x64 calling convention shapes them, how locals and arguments and return addresses interleave, and how the unwinder uses .pdata and .xdata to walk a stack it has never seen before — for exceptions, for debuggers, for ETW, for crash dumps. The PE loading series got the program loaded; the next series watches it run.