What changes when 39 KB on disk becomes 68 KB in memory. Alignment, page permissions, and the surprising fact that Windows doesn't eagerly copy your executable into memory.
In Part 2 we took apart a PE file on disk. We can now point at any byte of hello.exe and name what it is — headers, section table, code, data, relocations, imports. We can convert between the three coordinate systems. We know which fields the loader reads when, and what each does. But everything we described was static: the file as bytes, not the program as a running process.
This part walks across the gap. The 39,424-byte file on disk becomes a 69,632-byte image in the process's virtual address space — 30,208 bytes larger. None of those extra bytes are new content from the linker; they're alignment padding and memory-only zero-fill. The mapped image starts out backed by the same file bytes on disk for everything that has on-disk content, but the image won't stay literally identical to the file forever: the loader patches relocations and IAT entries, and any writable section that gets modified diverges through copy-on-write. We'll see all of that in detail. Where does the initial 30 KB size difference come from?
The short answer is alignment. The PE format packs sections tightly on disk and spreads them out on page boundaries in memory, because the CPU enforces permissions at the page level and you can't make half a page executable. The long answer is more interesting, and the long answer is the subject of this post.
There's a second thing to know about the file-to-memory transformation. Most tutorials describe it as a copy operation: "the loader reads each section from disk and writes it to memory." That's a useful first mental model, but it's not how normal Windows image loading is implemented. Windows does not eagerly copy the entire executable into private memory. It creates an image section backed by the file, maps that section into the process, and lets the memory manager fault pages in on demand as the program touches them.
The result is that the image can be fully mapped in virtual address space without all of its bytes being resident in physical memory. A code path the program never executes may never be faulted in at all. Later loader writes — relocations, IAT patching, TLS setup — modify private copy-on-write pages rather than rewriting the original file-backed image for everyone.
By the end of Part 3 you'll know exactly what fills the 30 KB difference in our binary, exactly what page permissions every region of the running image gets, and which kernel data structures and mapping steps are involved. You'll also know why two processes that both load kernel32.dll share the same physical pages for its code — and what happens the moment one of them tries to modify a global variable.
The PE format stores two alignment values in the Optional Header, and they govern almost everything about the file-to-memory transformation. We met both in Part 2; here's why they're set the way they are.
FileAlignment in our binary is 0x200 — 512 bytes. Every section's PointerToRawData is a multiple of 512. Sections pack tightly on disk, with at most 511 bytes of padding between any two. The PE specification defines this field as a power-of-two value between 512 bytes and 64 KiB, with 512 as the default — small enough to keep alignment padding cheap and large enough to play nicely with the I/O stack on the path between the file and the CPU.
SectionAlignment in our binary is 0x1000 — 4,096 bytes, the size of one memory page on x86-64. Every section's VirtualAddress is a multiple of 4,096. Sections do not share pages with each other in the running image. If a section's actual content is 200 bytes, it still occupies a full 4 KB page in memory, with the remaining 3,896 bytes of that page belonging to that section but containing no useful content.
Two questions follow. Why doesn't the on-disk format also use 4 KB alignment, since that's what the running image needs anyway? And why do sections need to live on separate pages in memory at all?
The answer to the first question is space. Our hello.exe has ten sections, several of which are tiny — .tls is 16 bytes of real content, .CRT is 96 bytes, .reloc is 132 bytes. With FileAlignment = 0x200, each of these costs 512 bytes of disk space. With FileAlignment = 0x1000, each would cost 4,096 bytes. Across our ten sections that would inflate the file by roughly 26 KB — a 65% size increase. Real applications with dozens of sections would pay a much larger tax. Disk space costs nothing in 2026, but disk I/O does: smaller files load faster, take less time to transfer over a network, and waste less of the disk cache. The format chose to pack tightly on disk and stretch in memory.
The answer to the second question is permissions, and it deserves its own section.
Modern CPUs do not enforce memory permissions byte by byte. They enforce them in fixed-size chunks called pages. On x86-64, the standard page size is 4,096 bytes — the same value as SectionAlignment. (Larger pages exist; "huge pages" of 2 MB or 1 GB are used for performance in specific scenarios, but for ordinary application code the unit is 4 KB.) A page is the smallest unit at which the operating system can say "this is executable" or "this is read-only" and have the CPU honor it.
The mechanism is hardware-level. The CPU's memory management unit, the MMU, conceptually walks a per-process page table on every memory access — checking whether the page is mapped, whether the access type is allowed, and whether the right privilege level is in effect. In practice, the CPU caches recent translations in a Translation Lookaside Buffer (TLB), so a full page-table walk only happens on a TLB miss; most accesses to recently-used pages are resolved in a single cycle. The page table entries (PTEs) encode the permissions. If a program tries to write to a read-only page, the MMU traps before any byte hits memory; the kernel sees a page fault and decides what to do (typically: terminate the offending program with an access violation). If a program tries to execute an instruction from a page that isn't marked executable, same outcome.
This is the core insight behind DEP (Data Execution Prevention) and W^X (write XOR execute) policies: stacks, heaps, and data sections get pages marked writable but not executable, so even if an attacker overflows a buffer and writes shellcode onto the stack, the CPU refuses to execute it. The PTE is enforced before the first byte of the would-be shellcode gets fetched.
The page-level granularity has a direct consequence for executable formats. If two regions of the same program want different permissions — code wants R+X, data wants R+W — they cannot share a page. Forcing them to share a page would require the OS to mark that page R+W+X, which is the worst case for security (an attacker who overwrites bytes anywhere on that page could execute them as code) and is explicitly disallowed by modern Windows for normal image-mapped pages.
In ordinary executable images like ours, SectionAlignment is the page size (4 KB on x86-64), so each section starts on its own page boundary. The linker computes RVAs to satisfy this constraint, padding between sections with as many zero bytes as needed to reach the next multiple of SectionAlignment. Once mapped, every section lives in its own page range, and the loader is free to apply per-section permissions to each range. The PE spec also allows SectionAlignment values smaller than a page in special cases (the spec requires only that SectionAlignment ≥ FileAlignment), but binaries with such alignments can't get per-section page protections from the OS, and they're vanishingly rare in modern Windows.
Here's the basic shape, illustrated with a generic image whose section sizes have been compressed to fit on the page. Our actual hello.exe follows the same pattern but spreads .text across seven pages instead of two; we'll see its full page-by-page map later in this post.
Our image is 30,208 bytes larger than our file. We can account for every byte of that difference. Let's tally.
The right way to think about it is: for each region of the image, how many bytes does it contribute that aren't in the file? Two things contribute. First, the trailing alignment padding on every section's last page — bytes in the section's page range that lie beyond SizeOfRawData. Second, sections whose VirtualSize exceeds SizeOfRawData entirely (the BSS case), where the whole memory region is bytes-not-in-file.
For our binary, the per-region memory-only contributions are:
headers padding 3,072 bytes (page is 4,096, file headers + pad fill 1,024)
.text trailing padding 1,024 bytes (page slack beyond SizeOfRawData)
.data padding to page 3,584 bytes
.rdata padding to page 512 bytes
.pdata padding to page 2,560 bytes
.xdata padding to page 2,560 bytes
.bss (entire section) 4,096 bytes (no on-disk content)
.idata padding to page 2,048 bytes
.CRT padding to page 3,584 bytes
.tls padding to page 3,584 bytes
.reloc padding to page 3,584 bytes
─────────────
total memory-only 30,208 bytes
Every byte of the file-to-image delta accounted for. Two things to notice. First, alignment padding dwarfs the actual .bss contribution by a factor of six (26,112 bytes of padding versus 4,096 bytes of .bss). In binaries with a lot of small sections — and our ten-section hello.exe is on the small end of typical — alignment padding becomes a significant fraction of the image. A binary with a hundred sections that each contained 200 bytes of real content would have around 100 sections × roughly 3,900 bytes of padding each, almost 400 KB of memory-only bytes. Second, the padding doesn't occupy file space at all. The file packs sections to FileAlignment = 0x200 boundaries, so the cost of an "extra" memory page is just one 512-byte file-alignment chunk per section, not a full page. The PE format gets to have the dense disk layout and the spaced-out memory layout simultaneously, and the only thing that pays for it is the memory manager's mapping work — which is cheaper than an eager copy.
One detail worth knowing: the bytes in the padding regions aren't undefined. The Windows loader explicitly zero-fills the trailing portion of each section's last page. A section with 200 bytes of content gets those 200 bytes mapped from the file, and the remaining 3,896 bytes of its page are zero. This zero-fill behavior ensures that nothing leaks from one process to another through stale page contents. (The closely related term "demand-zero" usually refers to entire pages that exist only in memory, like the .bss pages — pages that have no file backing at all, just a kernel guarantee that they materialize zero-filled on first access. We'll see that case in detail later.)
Checkpoint: up to now we've explained the layout difference — why the image is larger than the file, what fills the gap, and why page boundaries matter. The next few sections shift gears, from PE format into Windows memory-manager mechanism: section objects, demand paging, prototype PTEs, and copy-on-write. These are the data structures and mechanisms that make the mapping cheap.
Most explanations of executable loading describe a copy operation. The loader, the story goes, reads each section from disk and writes it into the allocated virtual memory region; once the copy is complete, control transfers to the entry point. This description is intuitive and almost always wrong.
Windows performs the file-to-memory transformation through a different mechanism, called memory-mapped files. The kernel does not allocate physical memory and copy bytes into it. Instead, it creates a data structure called a section object that records "this range of virtual address space is logically equivalent to these bytes on disk." It then sets up the process's page tables so that each virtual page in the image's range points at — but is not yet backed by — a corresponding chunk of the file. Until the program touches a page, no physical memory is consumed and no file I/O happens. When the program does touch a page, the CPU raises a page fault, and the kernel reads the relevant 4 KB chunk from disk on demand. This is called demand paging.
The consequence is striking. Code paths your program never executes never enter physical memory. The error-handling code your program would only run if a system call failed sits on the disk, indexed but unloaded, until the moment that failure actually happens. If your program runs and exits without that failure ever occurring, the bytes of that code path never leave the disk. The same applies to entire DLLs that get loaded but never called into, to .rdata strings the program never reads, to TLS initialization data for threads that never get spawned. The image is fully mapped — the virtual address space is fully reserved — but the physical memory footprint can be a small fraction of the image size.
At the API boundary, image mapping is built around two ideas: create an image section from the executable file, and map a view of that section into a process address space. In user-mode terms, the closest documented analogy is CreateFileMapping(..., SEC_IMAGE) followed by MapViewOfFile. In native terms, the related routines are NtCreateSection and NtMapViewOfSection; inside the kernel, the memory manager implements the details using section objects, control areas, subsections, and prototype PTEs. The exact call path during process creation is more complex than a user-mode app's manual call sequence — image mapping happens deep inside NtCreateUserProcess — but the two ideas above describe the shape of what's happening.
Step by step:
First, the kernel opens the PE file and creates a section object with the SEC_IMAGE attribute. SEC_IMAGE is special. It tells the kernel: this isn't an ordinary file you're going to map as a flat array of bytes; this is a Portable Executable, and I want you to interpret it as such. The kernel parses the PE headers right there, validates them, and uses the VirtualAddress, VirtualSize, SizeOfRawData, and Characteristics from each section header to build an internal data structure called a control area. The control area knows how to map each section to its on-disk location and what page permissions each section should get. Critically, the control area is shared across all processes that load the same image — if a hundred processes all import kernel32.dll, there's exactly one control area in the kernel for it.
Second, the kernel maps a view of the section object into the new process's address space. This reserves SizeOfImage bytes of contiguous virtual address space at ImageBase (or at a randomized base if ASLR moved it), and sets up the process's page table entries so that each page in the range points back to the control area's subsection records. At the end of the mapping step, the image is "mapped" in every sense the process can observe. The virtual addresses are valid, the page permissions are correct, the entry point's address is computable. But no section content has yet been read from disk into RAM.
Third — and this is the part most people don't realize is happening — when the CPU first tries to fetch an instruction from the entry point, the page table entry for that page is in a state called "not present, valid." The MMU traps, generating a page fault. The kernel's page-fault handler examines the PTE, sees that the page belongs to a section object with on-disk backing, computes which 4 KB chunk of the file corresponds to this virtual page, allocates a physical page, reads the chunk into it, updates the PTE to mark the page present, and dismisses the fault. The CPU restarts the original instruction fetch, which now succeeds. The program runs, never knowing any of this happened.
The data structure that makes shared image mapping efficient is the prototype PTE. A regular PTE — a page table entry in a process's page tables — is private to that process. It records the physical page backing one virtual page for one process. A prototype PTE is different: it's a PTE-shaped record that lives in the kernel's control area for the image, shared by every process that maps the image, and records the canonical "where this page should come from" information.
When the kernel maps an image into a process, the per-process PTEs are set to a special state that says "consult the prototype PTE for this virtual address." If the prototype PTE says the page is in physical memory, the per-process PTE is updated to point at the same physical page directly. If the prototype PTE says the page is still on disk, a page fault triggers the disk read, the physical page gets allocated, and the prototype PTE is updated — once, for all processes that share this image. Other processes that later fault on the same page find it already resident and just adopt the physical address.
The architectural benefit: a 100 MB DLL loaded by fifty processes does not consume 5 GB of RAM. The physical pages are shared. Each process pays the cost of its per-process page table entries (a few bytes per virtual page), but the physical pages backing the DLL's .text are loaded once and referenced fifty times. Two processes calling the same function in ntdll.dll are literally executing the same physical bytes; the MMU just translates their different virtual addresses to the same physical location.
Here's the structural picture of what's in the kernel when an image is mapped.
Here's what happens in detail the first time our running hello.exe touches a code page.
The total time per fault depends on whether the file's data is already in the system file cache and what kind of storage backs it. A soft fault — one satisfied from a page already resident in physical memory (perhaps because another process loaded the same DLL, or the kernel cached the file) — completes very quickly; the kernel just patches up the per-process PTE to share the existing physical page. A hard fault that requires a real disk read takes much longer, with the exact latency depending on storage type, file cache state, antivirus and EDR scanning, memory pressure, and the CPU. The important distinction isn't a specific timing number; it's that the page is materialized lazily, on first access.
Two related observations. The first time you run a large application after booting your machine, every page of code that runs has to come off the disk through a page fault. Subsequent runs reuse the file-cached pages, which is why "warm" launches feel snappier than "cold" launches. That observation is the visible surface of the demand-paging mechanism. Second, this is why working set size — the actual number of physical pages a process has resident at a given moment — is usually much smaller than the process's virtual address space. Tools like VMMap make this distinction explicit; Task Manager exposes related working-set and private-bytes numbers depending on which columns you choose. A process with a 100 MB virtual footprint might have a 12 MB working set if it never touched 88 MB of its mapped code.
It also explains a behavior that puzzles people new to Windows internals: why does opening a single Notepad window show "ntdll.dll" loaded? Because every Windows process maps ntdll.dll as part of its startup, but the actual physical pages backing ntdll.dll are shared by every process system-wide. Adding another Notepad doesn't double ntdll.dll's memory cost; it pays only for the per-process PTEs, a few kilobytes.
The picture so far has a puzzle in it. If every process that loads a DLL shares its physical pages, and writable global variables in .data are part of the mapped image, what stops one process from corrupting another process's globals? Process A and Process B both load kernel32.dll, both see its .data section at the same RVA, both can write to it — and yet they don't interfere with each other. How?
The mechanism is called copy-on-write. The initial page protection for writable image-mapped pages is not PAGE_READWRITE but PAGE_WRITECOPY (or, for executable sections that are also writable, PAGE_EXECUTE_WRITECOPY). The "WRITECOPY" suffix tells the MMU and the kernel: this page is currently shared and effectively read-only; on the first write attempt from any process, do something special.
The "something special" is straightforward. When a process attempts to write to a copy-on-write page, the MMU raises a page fault (because the page is currently marked not-writable in the per-process PTE). The kernel's page-fault handler sees the fault is for a writable-image page, allocates a fresh physical page, copies the contents of the original shared page into it, updates the writing process's per-process PTE to point at the new private copy, and promotes the permissions to plain PAGE_READWRITE. The instruction restarts and the write goes through, but it goes through to the process's private copy. Other processes that haven't written to that page still share the original.
The result is that writable image pages start out shared and silently become private to a process the first time that process modifies them. From the program's perspective, this is completely invisible. The program writes to its global variable; the value sticks; the variable is private to this process. The fact that the page was shared with thirty other processes a moment ago and is now private to this one is the kernel's business, not the program's.
A subtler case is executable sections that are also writable — a section with both MEM_EXECUTE and MEM_WRITE set in its Characteristics. The kernel maps those pages with PAGE_EXECUTE_WRITECOPY, the executable analog of PAGE_WRITECOPY. Pages stay shared and executable until something writes; the first write promotes them to plain PAGE_EXECUTE_READWRITE in the writing process. This is rare in production binaries because writable executable sections are a security smell — anything that writes to its own code is doing something unusual — but it exists, and tools like VMMap will show PAGE_EXECUTE_WRITECOPY if you happen to load a binary that has such a section.
There's a special case worth knowing about. The PE format defines a IMAGE_SCN_MEM_SHARED flag (0x10000000) that, when set in a section's Characteristics, asks the loader to make that section actually shared across processes — bypassing copy-on-write entirely. Writes from any process become visible to every other process that has mapped the image. It's how some legacy DLLs implemented inter-process data sharing before named shared memory became commonplace. The flag still works, but Microsoft's Security Development Lifecycle prohibits shared sections in shipping binaries: a writable-and-shared section is a cross-process write surface, useful to an attacker who can persuade their target to load the same DLL. SDL-compliant code reviews and secure build processes flag them as a vulnerability. If you encounter one in malware analysis, it's worth a hard look — it may exist for the same reason the SDL forbids it.
The mapping story we've described is for native PE images — code that the CPU can execute directly. Managed .NET binaries (the ones whose Optional Header data directory 14, the COM Descriptor, is populated) have a different relationship with memory. The CIL bytecode inside a .NET image is mapped read-only into the process address space just like any other section content. But the CPU can't execute CIL directly; the bytes need to be just-in-time compiled to native machine code first.
When the CLR JIT compiles a method, the generated native code lives outside the original image mapping, in private runtime-managed memory allocated via VirtualAlloc. Depending on runtime version and configuration, those code pages may be writable during generation or patching and executable when run; modern runtimes increasingly avoid long-lived writable-and-executable pages, using protection transitions or separate writable/executable views. The important point for PE analysis is that the executed native code is not the original CIL bytes inside the mapped PE section. These regions show up in VMMap as private memory regions with no associated image, a distinctive signature that tools use to identify JIT-compiled code at runtime. Managed binaries on disk are mostly inert data; the actual running code is generated dynamically and lives outside the file-backed mapping.
The mapping mechanism described above handles ordinary sections cleanly: each section's pages get prototype PTEs pointing into the file. But three regions of a running image work differently, and they're worth naming explicitly.
The .bss section has SizeOfRawData = 0 in its section header. There are no on-disk bytes to map. When the kernel sets up the control area, the prototype PTEs for the .bss page range are marked demand-zero: they don't point at any file offset. On first access, the page-fault handler doesn't read from the disk — it allocates a fresh physical page, zero-fills it, and updates the PTE. The "file" backing for these pages is the implicit guarantee that uninitialized data starts as zero. Demand-zero pages are extraordinarily cheap: no I/O, no shared backing, just one zeroed page per first-touching process.
This is how C's promise about uninitialized static and global variables ("they start as zero") gets fulfilled. The compiler emits the variables into .bss, the linker computes the section size, the PE format records a non-zero VirtualSize with zero SizeOfRawData, and the loader arranges for the pages to materialize zero-filled on first use. No code anywhere needs to actually zero anything; the page just arrives that way.
The headers region — the first page or two of the mapped image, from RVA 0 to SizeOfHeaders — gets mapped read-only and is backed by the file in the ordinary way. The interesting fact about this region is that it's accessible at runtime. A program can read its own PE headers by looking at the bytes starting at its ImageBase. Windows provides GetModuleHandle(NULL) to retrieve the base address of the current module's image, and ImageNtHeader() to navigate from there to the NT headers. Many runtime introspection techniques rely on this — anything that reflects on the binary's own structure, looks up its imports, or finds its resource section is reading bytes that the loader mapped at startup and are still sitting in their original page. The padding at the end of the headers page (bytes between the end of the section table at 0x318 and the next page boundary at 0x400 in our binary) is zero-filled by the loader, but the structures up to that point are intact and readable.
Trailing partial pages are the subtle case. A section whose VirtualSize isn't a multiple of SectionAlignment has its last page only partially filled with real content; the remaining bytes — from the end of VirtualSize to the next page boundary — are not in the file at all, but they belong to the section's page range. The Windows loader zero-fills those trailing bytes so that no stale data from elsewhere leaks into the process. The same goes for any whole pages that fall within the section's range but past SizeOfRawData on disk — which is exactly what happens with .bss when SizeOfRawData is zero, and can also happen when VirtualSize is otherwise larger than SizeOfRawData.
This zero-fill behavior is important for analysts because it's a way malware sometimes hides things. Some packers store unpacking-stub data in the gap between VirtualSize and the next page boundary, relying on the fact that disassemblers focused on VirtualSize won't see it but the actual runtime page still contains useful (or harmful) bytes. The Windows loader's zero-fill defeats that for honest binaries; packed binaries sometimes manipulate VirtualSize values to control exactly which bytes survive in memory.
In Part 2 we read the Characteristics bitfield of each section header. The loader's job is to translate those flags into the Windows page protection constants the kernel actually enforces. Here's the full translation table for our binary, based on each section's flag combination:
Section Characteristics flags Page protection
─────────────────────────────────────────────────────────────────────────────
headers (loader-controlled, not from Characteristics) PAGE_READONLY
.text CNT_CODE | CNT_INITIALIZED_DATA | EXECUTE | READ PAGE_EXECUTE_READ
.data CNT_INITIALIZED_DATA | READ | WRITE PAGE_WRITECOPY → PAGE_READWRITE on first write
.rdata CNT_INITIALIZED_DATA | READ PAGE_READONLY
.pdata CNT_INITIALIZED_DATA | READ PAGE_READONLY
.xdata CNT_INITIALIZED_DATA | READ PAGE_READONLY
.bss CNT_UNINITIALIZED_DATA | READ | WRITE PAGE_READWRITE (demand-zero, no file backing)
.idata CNT_INITIALIZED_DATA | READ | WRITE PAGE_WRITECOPY → PAGE_READWRITE on first write
.CRT CNT_INITIALIZED_DATA | READ | WRITE PAGE_WRITECOPY → PAGE_READWRITE on first write
.tls CNT_INITIALIZED_DATA | READ | WRITE PAGE_WRITECOPY → PAGE_READWRITE on first write
.reloc CNT_INITIALIZED_DATA | DISCARDABLE | READ PAGE_READONLY (released after relocation)
A few things to notice. The .text section gets PAGE_EXECUTE_READ, never PAGE_EXECUTE_READWRITE. Ordinary modern compilers don't produce writable code sections; if you see one, you're probably looking at an obfuscated binary, a self-modifying loader, or a JIT compiler that uses an image-mapped scratch region. The .rdata, .pdata, and .xdata sections all get PAGE_READONLY because they contain read-only data; the read-only protection is what makes attempts to overwrite a string constant or an exception-handler table fail loudly.
The four writable sections — .data, .idata, .CRT, .tls — all start out as PAGE_WRITECOPY and become PAGE_READWRITE the first time each individual page is written. .idata is interesting here: it holds the Import Address Table (IAT), which the loader patches during startup. Each IAT entry gets overwritten by the loader with the resolved function pointer, which triggers the copy-on-write fault and gives each process its own private copy of the IAT page. (We'll dig into the IAT patching itself in Part 4.)
The .reloc section is special because of the IMAGE_SCN_MEM_DISCARDABLE flag. The Microsoft PE specification defines this flag as "the section can be discarded as needed" — permissive language that lets the loader free the section's physical backing once it has applied the relocation entries it contains. In practice, .reloc is read by the loader during startup, the fixups are applied, and the kernel is then free to reclaim those physical pages. The virtual address range itself usually remains reserved, but the working-set footprint of .reloc at steady state is near zero. The discardable flag is the format's way of saying "the loader needs this for startup but the process doesn't need it after that."
Here's the complete page-by-page map of hello.exe as it lives in memory after the kernel finishes mapping it.
hello.exe as it lives in memory, with its initial protection. .text is seven pages of executable code; everything else is one page each. Writable sections start as PAGE_WRITECOPY and become private on first write. The .bss page is demand-zero and has no file backing. .reloc is released once relocations are applied.We can now sequence the events that take place between the moment you double-click hello.exe and the moment your screen says "hello." The mapping work we've been describing is one part of a longer process-creation pipeline; here's where it fits.
When Explorer asks Windows to run a program, the user-mode CreateProcess API issues a call to the kernel — NtCreateUserProcess on modern Windows. That kernel call is responsible for the whole process-creation pipeline, but the steps directly relevant to PE mapping are:
hello.exe as a file. It needs a handle that will outlive the call, because the running image will continue to be backed by this file for the life of the process.SEC_IMAGE attribute. This is the operation that NtCreateSection exposes at the API boundary; inside NtCreateUserProcess the equivalent happens internally. The kernel parses the PE headers. If validation fails — wrong machine type, malformed headers, conflicting fields — the operation fails and the process is never created. If validation succeeds, the kernel builds a control area data structure describing the image, including one subsection record per PE section and one prototype PTE per page in the image.SizeOfImage bytes long and starts either at ImageBase (if the binary asks for that exact address and the OS hasn't been told to randomize) or at a randomized base (if ASLR is in effect for this image). For our binary, with HIGH_ENTROPY_VA and DYNAMIC_BASE both set in DllCharacteristics, the kernel picks a randomized 64-bit address each run.NtMapViewOfSection at the API boundary. This is the moment the image becomes "loaded" from the operating system's bookkeeping perspective. The kernel walks the section table and installs per-process PTEs for every page in the image's range, with each PTE pointing at the corresponding prototype PTE. It also applies per-section page protections — PAGE_EXECUTE_READ for .text, PAGE_WRITECOPY for .data, and so on.At this point, the image is mapped but it can't run yet. The next stages of process creation handle relocations (if the actual base address differs from the preferred ImageBase), import resolution (filling in the IAT with addresses from the DLLs the binary imports), TLS initialization, and so on. Those stages happen partly in the kernel, partly in user mode — the user-mode loader, ntdll!LdrInitializeThunk and related routines, finishes the job that the kernel started. This is the territory of Part 4.
The thing to take away from this part is that the mapping operation is, in concrete terms, two ideas: create an image section from the file and map a view of it into the process. Those two operations correspond to the NtCreateSection and NtMapViewOfSection Native API surface and to kernel data structures like the control area and prototype PTEs. The PE format and the kernel are co-designed: every field we walked through in Part 2 has a purpose at one of these two steps. The user-mode loader's job — the topic of Part 4 — is to finish the work after the kernel has done the structural mapping.
Unlike Parts 1 and 2, the experiments in this part require an actual Windows system — the things we've described (page protections, working sets, shared bits, COW state) only exist on a running Windows process. The most informative free tools are VMMap and Process Hacker (now System Informer).
VMMap (from Microsoft Sysinternals) shows the virtual address space of a running process in a hierarchical view: total committed memory, mapped images, image sections within each image, and the protection of every region. To use it, launch any program — Notepad will do — and open VMMap, then attach to the process. The "Image" rows show every mapped DLL and EXE; expanding one shows the per-section breakdown with protections that match exactly what we tabulated for our hello.exe. The "Details" pane shows working-set information for each region, which is how you can see what's actually in physical memory versus just reserved.
Process Hacker / System Informer is a more general-purpose process inspector with a "Memory" tab on every process's properties dialog. It shows the same information VMMap shows but with the option to right-click any region and read or write its memory directly. It also displays the Shared column, which is the copy-on-write status we discussed — pages that haven't yet been written show as Shared; pages that have been COW'd to private copies show as Private.
For an even closer look at the kernel side, WinDbg in kernel-debugging mode lets you inspect control areas and prototype PTEs directly. The !ca extension dumps a control area; !pte walks the page tables for a virtual address; !vad lists the Virtual Address Descriptors that describe the mapped image ranges. This is more work to set up — you need a kernel-debugger connection — but it shows the data structures we described in this part as actual bytes in kernel memory.
One specific thing worth trying: launch the same program twice from two separate console windows, then look at the Shared column in Process Hacker's Memory tab for both instances. The pages backing each instance's mapped image will mostly be marked Shared. Modify a global variable in one instance (if you can — most ordinary programs don't expose this) and watch the corresponding page transition from Shared to Private. This is copy-on-write in action.
If you don't have a Windows machine handy, the next best thing is to study screenshots of these tools online; the Sysinternals documentation and Hunt & Hackett's blog have several walkthroughs. The concepts transfer, even if you can't poke at the bytes directly.
You now know what the file-to-memory transformation actually is. The 39 KB file becomes a 68 KB image not by getting larger but by getting spread out: alignment padding fills the gaps between sections, .bss appears as a zero-filled region with no on-disk presence, and the headers region rounds up to a page. None of those extra bytes are new content. They're just the consequence of a layout that's optimized for page-level permission enforcement.
The mapping itself is not a copy. Windows manufactures a section object, sets up the process's page tables to reference prototype PTEs that describe how to fetch pages from the file on demand, and applies per-section protections — then returns. The actual bytes don't enter physical memory until something tries to access them. Code paths your program never executes never enter physical memory. DLLs loaded by many processes share their pages. Writable sections start out shared and become private through copy-on-write on first write. The PE format and the Windows memory manager are co-designed to make all of this nearly free.
But this mapped image still can't run. Several things are missing. If the binary didn't load at its preferred ImageBase, the absolute addresses baked into the code by the linker are wrong, and the Base Relocation Table — the .reloc section we mapped earlier — has to drive a sweep through the image patching them. The Import Address Table still contains unresolved thunk entries: the loader has to walk the import descriptors, load each required DLL (recursively), resolve each imported function, and overwrite the IAT entries with real function addresses. TLS callbacks registered in the .tls section need to run before the entry point. The DLL search order has to be followed correctly so that dependencies resolve to the intended modules rather than attacker-controlled lookalikes earlier in the search path. None of that has happened yet.
That's Part 4: the loader's actual work, after the kernel has done the structural mapping.