Relocations, imports, TLS callbacks, and the path from "image mapped" to "entry point runs." The kernel built the foundation; the user-mode loader assembles the rest of the house before the program can start.
In Part 3 we watched the kernel finish its mapping work. By the time control returns to user mode, our hello.exe sits at a randomized base address with its sections in the right page ranges, each page protected according to its Characteristics flags, and the demand-paging machinery ready to bring bytes off the disk as the program touches them. Everything the file format demands of the kernel is done.
The program still can't run.
Two things are missing, and both are concrete. First, some of the absolute addresses the linker baked into the binary may be wrong. The linker assumed ImageBase = 0x140000000, and the kernel honored ASLR by loading the image at, say, 0x7FF6A8B00000 instead. Every 64-bit pointer the linker hard-coded against the old base now points at empty memory. The .reloc section we noted back in Part 1, then dissected in Part 2, exists specifically to tell the loader what to patch. Forty-seven such pointers exist in our binary; the loader has to find and adjust each one.
Second, every cross-module function call in the program — every call to printf, malloc, Sleep, GetLastError — currently goes through an Import Address Table slot that doesn't yet hold a function address. The loader has to find each imported DLL, recursively map any DLLs that DLL depends on, walk each DLL's export table, look up the requested function, compute its actual address in memory, and write that address into the right IAT slot. Two DLLs and roughly fifty imports in our case; in a real production binary it can be dozens of DLLs and thousands of imports.
And it has to do both of those things before any of the program's own code runs.
This is the user-mode loader's job. It's the last leg of the journey we started in Part 1, and it's where every architectural decision of the PE format finally pays off. The loader is also where the most analyst-relevant mechanisms live: the IAT is the hooking surface that both EDRs and malware exploit, the DLL search order is the basis of an entire class of persistence attacks, TLS callbacks are the canonical pre-entry-point execution window. By the end of this post you'll know all of them — how they work mechanically, what the loader actually does, and why these structures got shaped the way they did.
The "loader" is not a separate program. It's code that lives inside a DLL named ntdll.dll, and it runs in the address space of the process being created. ntdll.dll is special in two ways: it's mapped into every Windows process unconditionally (no matter what the binary's import table says), and it's mapped into every process at the same time, by the kernel, as part of the same setup pass that mapped the EXE.
When the kernel finishes building the new process, it sets the CPU's instruction pointer not to the EXE's entry point but to a function inside ntdll called LdrInitializeThunk. That function calls into LdrpInitializeProcess and a family of related Ldrp* routines that together implement the user-mode loader. The loader runs entirely in user mode (no syscalls except where it specifically needs the kernel), reads the EXE's PE headers that the kernel has already mapped, and performs the work we'll walk through in the next sections.
The loader keeps its bookkeeping in a structure called the Process Environment Block, or PEB. Every process has exactly one PEB, allocated by the kernel and reachable through a CPU register — at gs:[0x60] on x86-64 (or via a documented field in the Thread Environment Block). The PEB contains a field called Ldr, of type PEB_LDR_DATA, that holds three linked lists of every module currently loaded in the process. The lists are kept in three different orders — load order, memory order, and initialization order — so different consumers can walk them efficiently.
This data structure is at the center of two important things. First, the loader itself uses it: when the loader encounters an import on, say, kernel32.dll, it consults PEB_LDR_DATA to check whether kernel32 is already mapped (because some earlier DLL needed it) before going to the trouble of locating and mapping it again. Second, anyone who needs to find loaded modules at runtime walks the same lists. Shellcode that needs to call GetProcAddress without an import table walks PEB.Ldr.InMemoryOrderModuleList. The Windows API GetModuleHandle walks the same lists. Process inspection tools walk them too. We'll see the shellcode walk concretely later in this part.
For now, the picture to hold in your head is: ntdll.dll is mapped into every process by the kernel; the kernel transfers initial control to a function inside it; that function reads the EXE's PE headers, performs the loader's work, and finally transfers control to the EXE's entry point. The kernel's mapping work is done before user-mode code runs at all. Everything we describe from here on happens in user mode, in functions that live inside ntdll.
The first concrete work the loader does is fix the addresses the linker got wrong.
To understand what "wrong" means here, recall what the linker did back in Part 1. When the linker emitted our binary, it had to bake addresses into the machine code: addresses of global variables that other code references, addresses of string constants, function pointers in vtables, and so on. For most code on x86-64, the linker uses RIP-relative addressing — "the variable is 0x2EDA bytes after this instruction" — and these references survive ASLR untouched, because the distance between two things inside the same image doesn't change when the image is loaded somewhere new.
But not every reference can be RIP-relative. A 64-bit absolute pointer in a data table (a function pointer, say, or a string-table entry pointing into .rdata) has to hold a full 64-bit virtual address. The linker computes that address as ImageBase + RVA, with ImageBase = 0x140000000 for our binary. If the kernel honored ASLR and mapped the image at 0x7FF6A8B00000 instead, every such pointer is off by a delta of 0x7FF6A8B00000 - 0x140000000 = 0x7FF5A8B00000. The values in those slots are now garbage — they reference memory that doesn't belong to our process.
The Base Relocation Table — the .reloc section in our file layout, pointed to by data directory entry 5 — exists for exactly this purpose. It's a list of every location inside the image that holds a baked-in absolute address, organized in a format that lets the loader apply fixups efficiently. The loader reads the table, computes the actual delta (actual_base - preferred_base), and adds that delta to every value the table points at.
The relocation table is structured as a sequence of variable-length blocks. Each block describes fixups for one 4 KB page of the image. The reason for the page-based grouping is space efficiency: instead of storing a full 32-bit RVA for every fixup, each block stores one 32-bit "this block applies to this page" header, followed by 16-bit entries that each carry only the 12 bits needed to address a location within that page. Twenty-bit savings per fixup.
A block looks like this:
+0 DWORD VirtualAddress // RVA of the 4 KB page this block describes
+4 DWORD SizeOfBlock // total size including header and entries
+8 WORD Entry[0] // 4-bit type, 12-bit offset within the page
+10 WORD Entry[1]
...
Each entry is a 16-bit value. The top 4 bits encode the relocation type. The bottom 12 bits encode an offset within the page named by the block header.
For x86-64 binaries you'll see exactly two relocation types in practice:
IMAGE_REL_BASED_DIR64 (type 10) — the location named by this entry holds a 64-bit absolute address. Add the delta to the 64-bit value there.IMAGE_REL_BASED_ABSOLUTE (type 0) — no-op. Used as padding when a block has an odd number of real entries, because SizeOfBlock has to be a multiple of 4 bytes.Other types exist for 32-bit binaries (HIGHLOW, type 3, for 32-bit absolute fixups) and for non-x86 architectures (ARM_MOV32, THUMB_MOV32, various MIPS and IA64 types), but in modern x86-64 you'll see DIR64 and ABSOLUTE almost exclusively.
The loader iterates the blocks. For each block, it iterates the entries. For each DIR64 entry, it computes the target address as actual_base + block_VA + entry_offset, reads the 64-bit value at that address, adds the delta (actual_base - preferred_base), and writes the result back. ABSOLUTE entries are skipped. When the entire table has been processed, every baked-in 64-bit pointer in the image matches the image's actual load address.
Two practical notes. First, the pages the loader writes to during relocation are usually code and data pages — which are PAGE_EXECUTE_READ and PAGE_WRITECOPY respectively when the kernel mapped them. The loader temporarily promotes these pages to a writable protection while it applies fixups, then restores the original protection when it's done. The copy-on-write mechanism from Part 3 still applies: the writes give this process its own private copies of the patched pages, and other processes sharing the same image are unaffected. Second, if the image happens to load at its preferred ImageBase — which is rare with ASLR but not impossible — the delta is zero and the loader can skip the table entirely. Older Windows versions (and binaries built without DYNAMIC_BASE) made this the common case; modern ASLR makes it the rare case.
Our binary has a .reloc section of 132 bytes (0x84), pointed to by data directory entry 5 at RVA 0x10000. It contains four blocks, describing fixups in pages 0x7000, 0x8000, 0x9000, and 0xE000. Together they hold 47 DIR64 fixups and 3 ABSOLUTE padding entries.
The pages being patched correspond to the sections containing 64-bit absolute pointers. Block 0 (page 0x7000) is in .text — one stray absolute reference in the code. Block 1 (page 0x8000) is in .data, containing 9 absolute pointers to other data. Block 2 (page 0x9000) is in .rdata, the biggest block, containing 33 pointers — these are mostly string-table entries pointing into the read-only data section itself. Block 3 (page 0xE000) is in .CRT — four pointers to C runtime initialization routines that _initterm will later walk. Every fixup is in a writable or initially-writable section, never inside the executable code's RIP-relative instruction stream.
Let's look at one of those blocks in detail.
.reloc section, decoded from its 28 bytes on disk. The 8-byte header names the page; the ten 16-bit entries each name an offset within that page and a relocation type. Nine of these are real DIR64 fixups targeting absolute 64-bit pointers in .data; the tenth is a no-op ABSOLUTE padding entry that brings the block size up to a multiple of 4.With relocations applied, the loader knows every address in the image is now correct for the chosen base. The next problem is that the image references functions in other DLLs that haven't been loaded yet. hello.exe imports from KERNEL32.dll and msvcrt.dll; a real production application might import from several dozen DLLs. None of those are mapped into the process yet (except ntdll, which the kernel always pre-maps). The loader has to find each one, map it in, and recursively repeat the process for whatever that DLL imports.
The mechanical part is straightforward. The loader walks the Import Directory — a sequence of IMAGE_IMPORT_DESCRIPTOR structures pointed to by data directory entry 1 — and for each descriptor reads the DLL name string. Locating the DLL on disk is the interesting part: Windows uses a specific search order that's been carefully designed (and contested, and tweaked) over decades. Once the DLL is located, the loader maps it using the same NtCreateSection(SEC_IMAGE) / NtMapViewOfSection mechanism we walked through in Part 3, applies relocations to it (computed against its preferred base, not the EXE's), and records it in PEB_LDR_DATA. Then the loader recurses: it walks that DLL's own import table and loads its dependencies, depth-first.
For hello.exe, the recursive chain is short. KERNEL32.dll imports from ntdll.dll (already mapped) and a few other system DLLs. msvcrt.dll imports from KERNEL32.dll (which the loader is already in the process of loading, so it just resolves to the existing entry) and a couple of others. After a handful of recursive descents, every DLL the process needs is mapped in. A real Win32 application might end up with 50+ DLLs loaded by the time the recursion completes — every shell extension, every input-method DLL, every common-controls helper, every dependency-of-a-dependency.
When the loader has a DLL name like KERNEL32.dll and no path, it consults a defined search order. The order matters because it determines which copy of the DLL gets loaded if multiple copies exist on the system, and it's a famously hijackable attack surface: place a malicious KERNEL32.dll-named file somewhere that gets searched before the real one, and you can intercept the entire program's calls to Win32 APIs.
The standard search order for desktop applications (with Safe DLL Search Mode enabled, which has been the default since Windows XP SP2) goes like this:
kernel32, user32, ntdll, and dozens of others — and routes them straight to System32, bypassing the cascade. For every other DLL, the loader walks the six steps in order until it finds a file with the requested name. Without Safe DLL Search Mode, the current directory moves up to position 2, making search-order hijacking from a user-writable working directory trivial.There are three things to take away from the search order.
First, KnownDLLs is a privileged shortcut. Microsoft maintains a list of system DLLs that are pre-mapped into a special section object at boot time and shared across every process. When the loader is asked for one of these, it doesn't search anywhere — it just maps the pre-existing section. The KnownDLLs list includes kernel32.dll, user32.dll, ntdll.dll, gdi32.dll, and several dozen others. A common beginner question is "can a malicious kernel32.dll placed in the application directory hijack a real program?" The answer is no, because kernel32 is a KnownDLL — the search order never gets consulted for it. Search-order hijacking targets non-system DLLs that legitimate applications load by name.
Second, Safe DLL Search Mode moves the current directory from position 2 to position 5. The unsafe ordering — where the current working directory is searched right after the application directory — was the default behavior on early Windows versions. When applications could be launched from a working directory the user controlled (downloads folder, USB drive, network share), an attacker who dropped a same-named DLL there could intercept calls before any of the system directories were checked. Safe DLL Search Mode, default since Windows XP SP2, pushes the CWD down the list. The registry value HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\SafeDllSearchMode controls this; in 2026 the default is still enabled.
Third, modern Windows offers a more restrictive opt-in API. SetDefaultDllDirectories lets a process declare "search only System32 and directories I explicitly add via AddDllDirectory." Used together, these functions remove the application directory, current directory, and PATH from the search entirely — eliminating most search-order hijacking surface at the cost of breaking applications that depend on side-by-side DLLs. The recommendation in Microsoft's security guidance is to call SetDefaultDllDirectories(LOAD_LIBRARY_SEARCH_SYSTEM32) at process startup if you're writing a hardened application.
Packaged apps (UWP, MSIX) bypass all of this. They use a different search order that consults the package dependency graph declared in the application manifest, then System32. There's no PATH search and no current-directory search; the packaging system handles dependency resolution at install time.
Once the loader has located and mapped every dependency DLL, the next job is to connect the executable's function references to the actual function addresses in those DLLs. This is the most analyst-relevant mechanism in the entire loader, and it's where the format's design choices have the biggest practical consequences: every cross-DLL function call your program makes routes through this machinery, and anyone who can write to the machinery controls those calls.
The mechanism is a three-party contract. The executable says what it needs through its Import Table. Each DLL advertises what it offers through its Export Table. The loader sits between them, walks both, and fills in the executable's Import Address Table with the resolved function pointers. After the loader is done, the IAT is a flat array of function addresses, and every CALL [IAT+offset] instruction in the program's code reads the right address.
The Import Directory (data directory entry 1) is an array of IMAGE_IMPORT_DESCRIPTOR structures, one per imported DLL, terminated by a zeroed descriptor. Each descriptor has five fields:
typedef struct _IMAGE_IMPORT_DESCRIPTOR {
DWORD OriginalFirstThunk; // RVA of the Import Lookup Table (ILT)
DWORD TimeDateStamp; // 0 unless this is a bound import
DWORD ForwarderChain; // 0 unless this DLL forwards exports
DWORD Name; // RVA of the DLL name string ("KERNEL32.dll")
DWORD FirstThunk; // RVA of the Import Address Table (IAT)
} IMAGE_IMPORT_DESCRIPTOR;
For each imported DLL there are two parallel arrays of 64-bit thunks (on PE32+): the Import Lookup Table at OriginalFirstThunk and the Import Address Table at FirstThunk. On disk, both arrays hold identical contents: each entry identifies a function by name or by ordinal. After the loader runs, the ILT still holds those name/ordinal references — preserved forever as a record of what the binary was asking for — while the IAT has been overwritten with real function pointers.
Each thunk is a 64-bit value, and the topmost bit determines how to interpret the rest. If the top bit is set (0x8000000000000000), the low 16 bits are an ordinal number — the function is being imported by its position in the DLL's export table. If the top bit is clear, the thunk value is an RVA pointing to an IMAGE_IMPORT_BY_NAME structure, which is just a 2-byte hint followed by a null-terminated function name. Most imports use the by-name variant; ordinals appear mostly for stable APIs in system DLLs.
For hello.exe, the Import Directory has two non-zero descriptors. The first names KERNEL32.dll; its ILT at RVA 0xD040 and IAT at RVA 0xD1C8 each hold 13 thunks (plus a terminating zero), pointing at name structures for DeleteCriticalSection, EnterCriticalSection, GetLastError, InitializeCriticalSection, Sleep, TlsGetValue, VirtualProtect, and a handful of others. The second names msvcrt.dll; its ILT at RVA 0xD0B0 and IAT at RVA 0xD238 hold 34 thunks for functions like fprintf, malloc, exit, __getmainargs — the C-runtime functions our program uses, directly or indirectly. All are imported by name; none use ordinals. The IAT data directory at RVA 0xD1C8 has Size 0x188 (392 bytes), accounting for both DLLs' 47 patched entries plus the two terminating zeros — 49 64-bit slots in total.
KERNEL32.dll import descriptor and its two parallel thunk arrays. Before the loader runs, the ILT and IAT hold identical name-reference RVAs. After the loader runs, the IAT has been overwritten with the actual virtual addresses of Sleep, GetLastError, and the other 11 imported functions; the ILT is preserved as a record of what was requested. The compiler-emitted CALL [rip+disp] instructions always dereference the IAT, so the call site stays the same and only the IAT contents matter at runtime.The other half of the handshake is the DLL's IMAGE_EXPORT_DIRECTORY, pointed to by data directory entry 0 in the DLL's own headers. It advertises which symbols this DLL makes available and where each one lives. The structure has several fields but the three that matter for the lookup are three parallel arrays:
AddressOfFunctions // array of RVAs; each = "where this function lives" (indexed by ordinal)
AddressOfNames // array of RVAs; each → a null-terminated function name string
AddressOfNameOrdinals // array of WORDs; each = an ordinal (index into AddressOfFunctions)
The design looks convoluted on first read, but each array has a job. AddressOfFunctions is the canonical address book: ordinal N's function lives at DLL_base + AddressOfFunctions[N]. AddressOfNames is an alphabetically-sorted directory of names; the sort order lets the loader use binary search instead of a linear scan, which matters in DLLs with thousands of exports (ntdll.dll has over 2,000). AddressOfNameOrdinals is the bridge between the two: AddressOfNameOrdinals[i] tells you which ordinal corresponds to AddressOfNames[i].
Why the indirection? Because a DLL can export the same function by both an ordinal and a name, or by an ordinal alone, or by a name alone. The three-array scheme handles all three cases uniformly. The flat AddressOfFunctions table is the source of truth for ordinals; the name/ordinal arrays are added on top to support name-based lookups.
To resolve "Sleep" in kernel32.dll, the loader binary-searches AddressOfNames for the string, finds it at some index i, reads AddressOfNameOrdinals[i] to get the ordinal — say, 1407 — then reads AddressOfFunctions[1407] to get the function's RVA, and finally computes DLL_base + RVA as the function's actual virtual address. That address is what gets written into the EXE's IAT slot for Sleep.
AddressOfNames; the same index in AddressOfNameOrdinals gives the ordinal; the ordinal indexes AddressOfFunctions to retrieve the function's RVA within the DLL. Two array reads after the binary search. The same machinery runs for every import in every PE you've ever launched.The full loader algorithm for our binary's KERNEL32 imports looks like this. For each thunk in the ILT at RVA 0xD040:
AddressOfFunctions, skip the name search. Otherwise, the thunk is an RVA pointing at an IMAGE_IMPORT_BY_NAME structure; read the name string.kernel32's AddressOfNames for that string. The names are sorted, so this is O(log N) — fast even when N is in the thousands.AddressOfNameOrdinals at the same index to get the ordinal.AddressOfFunctions at that ordinal to get the function's RVA within kernel32.kernel32_base + function_RVA. That's the absolute virtual address of the function in this process's address space.0xD1C8 + thunk_index * 8.Repeat for the 12 other KERNEL32 thunks. Then move on to the next import descriptor (msvcrt.dll) and repeat for all 34 of its thunks. When the loop terminates, every IAT slot in hello.exe contains a real 64-bit function pointer, and the program's CALL [rip+disp] instructions will dereference those pointers to actual code in the loaded DLLs.
One last mechanical detail. The IAT pages were mapped PAGE_WRITECOPY when the kernel loaded the image (this was Part 3's story — writable image pages start out shared and become private on first write). The loader's IAT patching is exactly those first writes. They trigger copy-on-write faults that give this process its own private copies of the IAT pages. By the time the loader finishes, those pages have been promoted to PAGE_READWRITE and the IAT contents are private to this process. Some hardened loaders and runtimes take an extra step here: after patching, they re-protect the IAT pages back to PAGE_READONLY, which prevents later in-process IAT writes from succeeding without an explicit VirtualProtect call. Where this happens depends on configuration; modern process-mitigation features can opt into related hardening, and the IAT data directory entry's existence (separate from any individual import descriptor's FirstThunk) is part of what lets the loader find and re-protect the IAT region as a unit.
The IAT design has a property worth dwelling on. Every cross-DLL function call in the executable goes through exactly one specific memory location — the IAT slot for that function — and the CALL instruction reads that location at runtime, every single call. The compiler doesn't bake function addresses into the call sites; it bakes the IAT slot's address into the call sites and lets the loader supply the function address through the slot. From the program's perspective this is just an extra layer of indirection. From an attacker's or defender's perspective it's a hook point — overwrite the slot, and every call to that function now goes through your code instead.
Implementing an IAT hook is mechanically simple. The attacker (or defender; the technique is symmetric) does this:
GetModuleHandle for an in-process hook, or read it from the PEB if working in a remote process.kernel32.dll).VirtualProtect from PAGE_READONLY back to PAGE_READWRITE, if the hardened-loader case applies), save the original function pointer, overwrite the slot with the hook function's address, and restore the original protection.From that point on, every call to that imported function dispatches to the hook function. The hook does whatever it wants — log the call, modify parameters, decline the call entirely, return a fake value — and optionally chains to the original function using the saved pointer.
EDRs use IAT hooks legitimately to monitor security-sensitive APIs: every call to CreateFile, VirtualAllocEx, or NtCreateThread can be intercepted, inspected, and either allowed or blocked. Malware uses the same mechanism for unsavory purposes: hide files from FindFirstFileW, fake registry contents from RegQueryValueExW, redirect network traffic from WSAConnect. The technique is invisible to code that doesn't go looking for it, because the call site doesn't change — only the bytes in a specific memory slot do, and inspecting whether an IAT slot points "outside" its expected DLL is something most programs never do.
There are countermeasures. Hardened loaders re-protect the IAT to PAGE_READONLY after patching, making the hook attempt require an explicit VirtualProtect call (and turning hook installation into a noisier action). Some EDRs do their own IAT-integrity scans. Modern Windows offers Control Flow Guard (CFG) and Code Integrity Guard (CIG), which raise the bar in different ways. But the fundamental architectural fact remains: every call through the IAT is one indirect dereference away from being redirected, and that's what makes the IAT the canonical hooking surface on Windows.
Shellcode — code injected into a running process via buffer overflow, process injection, reflective DLL injection, or any other technique — has the opposite problem from a normal executable. It has no PE headers of its own, no IAT, no import descriptors. It's just a raw blob of instructions sitting somewhere in the target process's memory. When it needs to call a Win32 API, there's no IAT slot to dereference: it has to find the function address some other way.
The canonical technique is the PEB walk. Every Windows process has its ntdll, its kernel32, and its PEB_LDR_DATA linked lists; the lists name every loaded module and record where it lives. Walk the lists, find kernel32, parse its export directory, locate GetProcAddress, and from there resolve everything else. The pseudocode looks like this:
; x86-64, conceptual — actual implementations vary in length and obfuscation
mov rax, gs:[0x60] ; TEB → PEB pointer at offset 0x60
mov rax, [rax + 0x18] ; PEB → PEB_LDR_DATA
mov rax, [rax + 0x20] ; → InMemoryOrderModuleList head
; iterate the doubly-linked list of LDR_DATA_TABLE_ENTRY structures
next_module:
mov rbx, [rax] ; next entry
cmp , "kernel32.dll"
je found_kernel32
mov rax, rbx
jmp next_module
found_kernel32:
; module base is at offset 0x20 in LDR_DATA_TABLE_ENTRY
mov rcx, [rbx + 0x20] ; kernel32 base address
; now parse PE headers from rcx, walk Export Directory,
; find GetProcAddress by name, compute its absolute address
; use GetProcAddress to resolve LoadLibraryA, then anything else
Three facts make this work. First, gs:[0x60] on x86-64 always points to the current thread's TEB; the TEB has the PEB pointer at offset 0x60; the PEB has the loader data at offset 0x18. These offsets have been stable across Windows versions for many years. Second, kernel32.dll is a KnownDLL — it's mapped into every process unconditionally, so shellcode can rely on finding it. Third, PE headers are mapped at the module's base address (we saw this in Part 3: the headers region is the first page of the mapped image, read-only and accessible at runtime). So once shellcode has the base address, it can navigate to the export directory using the same field-arithmetic any PE parser uses.
The relationship between IAT-based calls and shellcode-style resolution is the same mechanism running at different times. Imports are early binding: the loader resolves every function once at startup, writes the addresses into the IAT, and the program reads them cheaply at runtime. Shellcode is late binding: each call resolves its target on demand by walking the loader's data structures itself. The export table doesn't care which one is reading it; both end up at DLL_base + function_RVA.
Two things happen between import resolution and the executable's entry point, both of which give code a chance to run before main(): TLS callbacks, and DllMain dispatch for every loaded DLL. Both are legitimate features. Both have been abused for anti-analysis purposes often enough that any malware analyst learns to look for them.
The TLS Directory (data directory entry 9, lives in the .tls section in our binary) is the PE format's mechanism for thread-local storage initialization. Its core job is mundane: it describes per-thread variables that the runtime should allocate and zero-initialize when a new thread starts. The interesting part is one of its fields, AddressOfCallBacks, which points at a null-terminated array of function pointers. Those functions — the TLS callbacks — get invoked by the loader at specific lifecycle events: process attach, process detach, thread attach, thread detach.
The lifecycle event we care about is process attach. When the loader is preparing the image to run, it walks the TLS callback arrays for the EXE and every loaded DLL, in load order, and invokes each callback with the DLL_PROCESS_ATTACH reason. This happens before the loader transfers control to AddressOfEntryPoint. The callback receives a module handle, an integer reason code, and a reserved parameter — and from inside the callback, the program is fully mapped, imports are fully resolved, and arbitrary code can run.
That last fact is what makes TLS callbacks a famous anti-debugging technique. A debugger that launches a program normally sets an initial breakpoint at AddressOfEntryPoint and pauses there. A TLS callback runs before the breakpoint hits. If a binary uses a TLS callback to check for debugger presence (via IsDebuggerPresent, the BeingDebugged field of the PEB, NtQueryInformationProcess, or any of the dozens of other techniques), the check happens before the analyst has a chance to inspect anything. The callback can exit the process, jump to garbage, decrypt the actual payload, or rewrite the entry point — by the time the analyst's debugger pauses, the malware has already had its say.
This technique isn't new. The Ursnif/Gozi-ISFB malware family used TLS callbacks for anti-analysis and process injection; the GRUM botnet used them to execute its unpacking code; many packers have leaned on them over the years. The pattern is old enough that modern debuggers (x64dbg, IDA Pro, WinDbg) all have explicit "break on TLS callback" options that pause execution at the first TLS callback rather than at AddressOfEntryPoint. If you're analyzing an unknown PE, that option is one of the first things to enable.
For our hello.exe, the TLS Directory is populated and AddressOfCallBacks points at a small array of two callback functions emitted by the MinGW C runtime (RVAs 0x1600 and 0x15D0, both inside .text). These are CRT housekeeping — they initialize per-thread runtime state for any threads the program later creates — and don't do anything anti-analysis-like. But they do run before AddressOfEntryPoint. A debugger configured to break at the entry point of our binary would miss them entirely. This is a useful reminder that "TLS callbacks exist in this binary" is not by itself a malware signal; many benign C runtimes emit them. The signal is what the callbacks do, which requires actually disassembling them.
DllMain for code that hides in supporting libraries, and C++ static constructors invoked by the CRT before main itself. Debugger configuration matters here — the default "break on main" misses everything left of that final dot.After TLS callbacks fire, the loader makes one more pass: it calls each loaded DLL's DllMain function with the DLL_PROCESS_ATTACH reason code. This is the DLL's chance to do per-process initialization — allocate internal buffers, register with subsystems, initialize singletons, set up thread-local storage that wasn't covered by the TLS template. The DllMain calls happen in dependency order: if foo.dll imports from bar.dll, the loader calls bar.dll!DllMain first, so by the time foo.dll!DllMain runs it can rely on bar.dll being fully initialized.
One feature of DllMain dispatch is worth knowing about: the loader lock. This is a process-wide critical section that the loader holds for the entire duration of any loader operation, including DllMain dispatch. It exists to serialize loader work — if two threads both tried to load DLLs concurrently, the bookkeeping in PEB_LDR_DATA would corrupt. The lock guarantees that loader operations are mutually exclusive.
The consequence is that DllMain runs under the loader lock. Code inside DllMain can't do anything that would require taking the loader lock again — most notably, it can't call LoadLibrary (which would deadlock against the lock the current thread already holds). It also shouldn't wait on threads created from within DllMain, since those threads' own DllMain invocations would also need the loader lock. Microsoft's documentation calls out a long list of "things you must not do in DllMain"; in practice, most DLLs put a minimal amount of code there and defer real initialization to a function the program calls explicitly after the loader is done.
From an analyst's perspective, DllMain is a third pre-main execution window. Malware that drops or sideloads a DLL and induces a target program to load it gets to run code inside that DLL's DllMain before the target's own logic resumes — which is why DLL hijacking attacks are often paired with malicious DllMain initialization. The principle is the same as TLS callbacks: any code that runs during loader dispatch happens before any "interactive" debugger pauses by default.
With imports resolved, TLS callbacks run, and DllMain dispatched for every loaded DLL, the loader has done its job. The last step is the simplest mechanically and the most consequential conceptually: jump to the executable's entry point.
The entry point address is computed as actual_base + AddressOfEntryPoint, where AddressOfEntryPoint is the RVA we saw back in Part 2 sitting in the Optional Header. For our binary it's 0x1410, which we converted to file offset 0x810 back in Part 1 and found held the bytes 55 48 89 E5 48 83 EC 20 (push rbp / mov rbp,rsp / sub rsp,0x20). The loader sets RIP to that virtual address and returns; the next instruction the CPU fetches is the first instruction of our program.
Except — that first instruction isn't main. Almost no compiled C or C++ program has main at its entry point. The linker fills in AddressOfEntryPoint with the address of a CRT startup function (the exact name depends on the toolchain and subsystem: mainCRTStartup for console MSVC programs, WinMainCRTStartup for GUI ones, wmainCRTStartup and wWinMainCRTStartup for the wide-char variants; MinGW uses its own variants of the same idea). That function is the C runtime's bootstrap. It runs before main and does work that main assumes is already done.
The CRT startup function's responsibilities, in rough order:
argv / argc / environment.stdin, stdout, stderr)..CRT$XCA, .CRT$XCC, ..., .CRT$XCZ; the linker merges these into a single contiguous range in the final .CRT section, and a helper function called _initterm walks that range and calls each non-null function pointer.main (or WinMain, etc.), passing the parsed arguments.main returns, run C atexit handlers and C++ destructors via the parallel .CRT$XPA..XPZ table, then call ExitProcess with main's return value.The C++ static-constructor invocation is the third pre-main execution window we mentioned earlier. Anything that constructs a global C++ object — an std::string at namespace scope, a logging singleton, a registration helper that registers itself with some other subsystem — runs during _initterm, before main. Malware that's been compiled with the same trick (an unusual but not unheard-of pattern) can hide its activation in a class constructor whose object is never explicitly referenced anywhere.
For analysts: a breakpoint on main won't catch any of this. A breakpoint on _initterm or on the entry-point function itself (whatever AddressOfEntryPoint resolves to) will. x64dbg can be configured to break on the "system breakpoint" — a stop earlier than the entry point, inside ntdll's loader initialization, before any user code has run — which catches even pre-loader inspection. Knowing which breakpoint to set is part of doing serious analysis.
Once main is called, control passes to the program's own logic. The loader's work is complete. From this point forward the loader stays in the process, dormant, only reactivating if the program calls LoadLibrary or FreeLibrary or some other operation that requires module bookkeeping. The PEB and its linked lists remain populated, queryable, and walkable — by the program itself, by injected code, by analysis tools.
We've now walked through every phase of the process between "user double-clicks hello.exe" and "main() runs." Here's the full sequence in one place, with the work split across kernel, ntdll loader, and program phases.
ntdll, does the relocation-imports-TLS work in the middle. The program's own code — starting with the CRT startup function, then ranging through C++ static constructors, and finally arriving at main — fills the bottom. Each pre-main execution window is marked. Everything from your hello.c source code in Part 1 to this point has been a structured transformation, and this is where it lands.Several of the structures we've walked through are easy to inspect on a real binary. The standard tools are all already on a Windows development machine.
dumpbin /imports hello.exe dumps the Import Directory. You'll see one block per imported DLL, with each block listing the function names, ordinals, and (after binding) the resolved addresses. Run this against our hello.exe and you'll get exactly the KERNEL32.dll and msvcrt.dll blocks we walked through above.
dumpbin /exports kernel32.dll (or any DLL) dumps the Export Directory. You'll see the ordinal-to-name mapping in tabular form. Try it against ntdll.dll to see the 2,000+ exports it advertises.
dumpbin /relocations hello.exe dumps every entry in the Base Relocation Table. For our binary you'll see the same 4 blocks and 47 fixups we counted above, broken out by virtual address.
For inspecting a running process rather than a static binary, x64dbg is the canonical free choice. The crucial configuration step: open Options → Preferences → Events, and check "System breakpoint" and "TLS callbacks". With those enabled, x64dbg pauses before the entry point on every launched binary, so you can step through TLS callbacks and the loader's final transitions instead of skipping them. Process Hacker (now called System Informer) shows the PEB and loader linked lists directly — open any process, navigate to the General tab to see PEB → Ldr → InMemoryOrderModuleList as a tree, and to the Modules tab to see every loaded DLL with its base address and path.
For looking inside a DLL's export directory in even more detail than dumpbin offers, CFF Explorer and PE-bear both provide GUI inspection of every PE structure including the export directory's three-array layout.
If you don't have a Windows box to hand, most of these inspections are doable from WSL or Linux too. objdump -p hello.exe shows the import directory, and Python's pefile module (pip install pefile) walks every structure in the format. x64dbg-style live debugging requires Windows, but static structural inspection works anywhere.
We started in Part 1 with a 99-line C source file and four command-line tools. We end here with a fully running process whose every byte we can account for.
Part 1 traced the path from source code to executable. The preprocessor expanded macros and inlined headers. The compiler turned each C statement into machine instructions, leaving placeholder references for things it didn't yet know (external function addresses, string-constant locations). The assembler packed those instructions into .obj files with their metadata. The linker stitched the object files together, resolved internal references, and recorded external dependencies as imports that the loader would later satisfy.
Part 2 dissected the PE file format itself. The DOS stub, the PE signature, the COFF file header, the Optional Header with its 16 data directories, and the section headers and section bodies. Every field had a job; every byte had a purpose. We learned to convert between three coordinate systems (file offset, RVA, virtual address) and saw how the section table is the bridge between them.
Part 3 watched the kernel transform the file into a mapped image. NtCreateSection built a section object describing the image; NtMapViewOfSection wired its prototype PTEs into a new process's address space; the kernel applied per-section page protections. The image was now present in virtual address space but not yet in physical memory — demand-paging would bring pages in lazily as the program touched them, and copy-on-write would make writable pages private to each process that wrote to them. The mapping operation itself read almost nothing from disk.
Part 4 — this post — has filled in everything the user-mode loader does after the kernel returns. Relocations correct the absolute addresses ASLR invalidated. The recursive DLL loader walks the import directory, navigates the DLL search order, and maps each dependency. The import/export handshake walks both sides' tables to write resolved function pointers into the IAT. TLS callbacks and DllMain dispatch run pre-main initialization code (and pre-main anti-analysis tricks). Finally, the loader hands control to AddressOfEntryPoint, the CRT startup function does runtime initialization, C++ constructors fire, and main is called.
The picture across the four parts is unified. Every field in the PE format exists because something downstream — the linker, the loader, the kernel — needs to consume it. Every step of the loader's work has a corresponding piece of metadata in the file. The format isn't arbitrary; it's the structured language by which the compiler-and-linker pipeline communicates with the operating system about what kind of program this is and how to bring it to life.
Several substantial topics fell outside the scope we set:
Delay-loaded imports are a CRT-implemented optimization where DLLs aren't loaded until first use. They appear in data directory entry 13 (Delay Import Descriptor) with a structure parallel to but distinct from the regular Import Directory; the runtime supplies thunk stubs that invoke LoadLibrary and GetProcAddress lazily. Worth a separate post.
Exception handling and SEH — the .pdata and .xdata sections we noted in Parts 2 and 3 are the runtime function table and unwind data that x86-64 uses for table-driven exception handling. The mechanism is rich enough to deserve its own treatment.
Packaged apps (UWP/MSIX) follow a different loader path with their own search order, container, and integrity model. The kernel mechanisms we walked through still apply, but the user-mode layer is significantly different.
The .NET CLR's secondary loader picks up where the native loader leaves off, doing JIT compilation, type loading, and assembly resolution for managed code. We touched on managed images in Part 3 but didn't dig into the CLR's own machinery.
Manifest-based DLL redirection, side-by-side assemblies, API sets, safe-mode loader paths, delayed CFG dispatch, image load notifications, EDR userland hooks via ETW-TI — each is its own world.
What we did cover is the load-bearing core: the four tools, the file format, the kernel mapping, and the user-mode loader. With that foundation, every more advanced topic builds on structures we've already walked through. The IAT you can hook is the same IAT the loader patches. The TLS callbacks anti-debug malware uses are the same TLS callbacks that legitimate code uses for per-thread setup. The PEB shellcode walks is the same PEB the loader bookkeeps and Process Hacker displays.
For deeper study: Solomon, Russinovich, and Ionescu's Windows Internals (now in its 7th edition, split into Parts 1 and 2) is the canonical reference for everything kernel-side. Matt Pietrek's classic articles Peering Inside the PE (1994, 2002) are still excellent on the file format. The Microsoft PE specification is the authoritative document for the format itself. Maurice Lambert's open-source PE loader implementations on GitHub are good study material for seeing the loader's algorithms in code. ReactOS mirrors much of the Windows loader's structure with readable C source. And the SANS Internet Storm Center, the Mandiant blog, and individual security researchers' write-ups regularly cover specific tricks, techniques, and CVEs that exercise the structures we've walked through.
If you wanted to actually write a PE loader — a reflective loader for in-process module loading, a manual mapper for analysis, a custom runtime for a research project — you now have the conceptual scaffolding. The implementation is mostly bookkeeping: walk the headers, apply the relocations, resolve the imports. The hard work is understanding what each structure is for, and you have that now.
The PE format is not arbitrary. Every field exists because the loader needs it. Now you can read the format and predict what the loader will do — and from there, understand the implications for malware analysis, reverse engineering, and exploit development.