The File-to-Memory Stretch — PE Loading, Part 3

What changes when 39 KB on disk becomes 68 KB in memory. Alignment, page permissions, and the surprising fact that Windows doesn't eagerly copy your executable into memory.

In Part 2 we took apart hello.exe on disk. We can name every byte of it — headers, section table, code, data, relocations, imports. We can translate between the three coordinate systems (file offset, RVA, virtual address). What we have not done is watch the file become a running program.

That transformation is stranger than it looks. A 39 KB file on disk becomes a 68 KB image in memory — bigger, somehow, even though the operating system has not added any code or data. Stranger still, the OS does not really copy the file into memory; it sets up some bookkeeping and lets the bytes come in lazily, only when the program touches them. And then there is a small magic trick that lets twenty processes share the same loaded DLL without stepping on each other's writable globals.

Three facts carry the whole story. On disk, sections pack tightly — at 512-byte boundaries, to save space. In memory, they spread out — to 4 KB boundaries, so the CPU can enforce permissions per page. And the mapping is lazy — bytes don't enter physical RAM until the program touches them.

By the end of this post you'll understand all three, you'll know exactly where the extra 30 KB comes from, and you'll have a mental model of what Windows is actually doing when you double-click a binary.

Two layouts, one binary

The PE format stores two alignment values in the Optional Header, and almost everything about the file-to-memory transformation follows from them.

FileAlignment in hello.exe is 0x200 — 512 bytes. Every section's content starts at a multiple of 512 on disk. Sections pack tightly, with at most 511 bytes of padding between any two.

SectionAlignment is 0x1000 — 4,096 bytes, the size of one memory page on x86-64. Every section's in-memory address is a multiple of 4 KB. Sections don't share pages. If a section's real content is 16 bytes (yes, really, some are), it still occupies a full 4 KB page in memory.

On disk, space matters. Our binary has ten sections, and several are tiny. .tls holds 16 bytes of real content. .CRT holds 96. .reloc holds 132. With 4 KB alignment on disk, each of those would balloon to 4,096 bytes. Across ten sections that's a 65% file-size penalty for no functional gain. The format picked the smaller alignment for disk.

In memory, permissions matter. The CPU only enforces memory permissions in 4 KB chunks (we'll see why in the next section). For each section to get its own permission set — .text executable, .data writable, .rdata read-only — each section has to live on its own pages. The format picked the larger alignment for memory.

The same binary, two layouts. Same code, same data — packed tight in one form, spread out in the other.

Here is the section table from hello.exe, straight from the file. Every number we work with in this post comes from this table; I'll keep pointing back to it.

Four numbers per section. VirtualSize is how much real content the section has. VirtualAddress is where it lives in memory (always page-aligned — every value ends in three hex zeros). SizeOfRawData is how much of the file the section occupies (rounded up to FileAlignment). PointerToRawData is the section's file offset.

One thing to notice immediately: .bss has SizeOfRawData = 0 and PointerToRawData = 0. It exists in memory but takes zero bytes on disk. We'll come back to why.

The picture below shows the two layouts side by side, drawn at the same scale, so you can see what the format is actually trading.

What's a process, anyway?

Before we go further, the words process and address space are going to do a lot of heavy lifting. Let's pin them down.

A process is a running instance of a program. When you double-click hello.exe, Windows creates a process to run it. If you double-click it again while the first one is still running, you get a second process — same code on disk, two independent runs, each with its own state. Two Notepad windows on your screen, two Chrome tabs in separate processes, two terminal sessions: every one of those is a separate process.

Each process gets its own private virtual address space — a range of memory addresses the program can refer to. The "virtual" matters. The addresses look real to the program (you can write a pointer, print its value, follow it), but they're not RAM addresses. The hardware translates each virtual address to wherever the operating system has decided to put the real bytes, and the OS can put them anywhere — or nowhere, until the program actually touches them. Each process's address space is its own. The address 0x140001000 in Process 1 has nothing to do with the address 0x140001000 in Process 2; they're two unrelated locations in two unrelated worlds.

A thread is a single line of execution inside a process. Every process has at least one — the "main" thread that runs the program's main function. Programs can spawn more threads to do work in parallel. All threads inside the same process share that process's address space; they look at the same memory, the same globals, the same mapped image. We won't lean on threads much in this post, but they matter in Part 4.

From here on, when the post says "the process's address space" or "Process A and Process B," this is what it means.

Five more words we'll lean on

With process, address space, and thread established, five more terms deserve to be pinned down. They're the words I'll use over and over for the rest of this post, and getting them straight up front saves confusion later.

A page is the smallest chunk of memory the CPU can manage as a unit. On x86-64 a page is 4,096 bytes (= 0x1000 = 4 KB). The CPU sets permissions per page: a page can be read-only, or readable and executable, or readable and writable. It can't be "executable in the first half and read-only in the second half" — the granularity is the whole page.

One Windows-specific quirk worth knowing: Windows enforces permissions at 4 KB page granularity, but it allocates virtual address space at a coarser 64 KB granularity. VirtualAlloc, MapViewOfFile, and the image-mapping path all round their base addresses down to a 64 KB boundary. This is why preferred image bases like 0x140000000 are always 64 KB-aligned, and why ASLR picks new bases that are also 64 KB-aligned — even though the page-level protections inside the mapped image still change every 4 KB. You'll see the 64 KB number appear in tools like VMMap as the "allocation base," distinct from the 4 KB pages inside it.

A page is mapped in a process when the process has a valid virtual address for it. Being mapped is a bookkeeping fact: the OS has reserved that address and recorded what should appear there. It doesn't mean the bytes are in RAM. They might still be on disk; the OS just knows where to find them.

A mapped page is resident when its contents have actually been brought into physical RAM. Resident pages cost real memory; mapped-but-not-resident pages don't. The collection of pages a process has resident at this moment is its working set — usually a small fraction of the total mapped size.

And copy-on-write is the mechanism that lets several processes share the same physical page of memory as long as none of them writes to it. The first time any one process tries to modify the page, the kernel quietly gives that process its own private copy and lets it write to that. The other processes keep sharing the original. We'll cover this in detail in its own section.

Here is the single most important thing to take from this vocabulary list: mapped size is not RAM usage. The "69,632-byte image" we keep mentioning is the size of hello.exe's mapped virtual address space. The actual RAM that a running hello.exe uses for its image is almost always less — sometimes much less — because most of the pages get pulled into RAM only on demand. Internalize this distinction now and the rest of the post will land more cleanly.

Pages and permissions

Now we can answer the question that the alignment section left dangling. Why must sections live on their own pages?

Because the CPU enforces permissions per page, and only per page. A page can be readable, or readable and executable, or readable and writable — but not all three at once for ordinary code (the combination has a name, "RWX," and it's a known security smell). Permissions are encoded in the page table, the hardware data structure the CPU consults on every memory access. If a program tries to write to a read-only page, the CPU traps the write before any byte actually moves. If it tries to execute an instruction from a page that isn't marked executable, the CPU traps that too. This is the foundation of modern memory safety: the hardware refuses to do the wrong thing.

Now apply that constraint to a PE file. .text wants to be executable and read-only. .data wants to be writable. .rdata wants to be plain read-only. If two of those sections shared a page, the OS would have to grant the page both sets of permissions — which is exactly the dangerous combination the format is trying to avoid. So sections don't share pages. The linker pads every section out to the next 4 KB boundary, and the kernel applies one permission set to each page range. Simple, mechanical, and the entire reason for SectionAlignment = 0x1000.

Where the 30 KB comes from

We can now answer the first puzzle. The file is 39,424 bytes; the in-memory image is 69,632 bytes. The difference is 30,208 bytes. Where does it come from?

The answer has two parts. Most of it is alignment padding: each section's content gets rounded up to fill its 4 KB page (or pages), and the unused tail is zero. A small remainder comes from .bss — that one section with SizeOfRawData = 0 we noticed earlier. .bss exists only in memory.

The picture below makes both ideas concrete. For each section, it shows three things drawn to scale: the real content (in dark color), any file-alignment slack already on disk (medium shade), and the memory-only padding the loader has to add to bring the section up to its page boundary (pale, dashed outline). The number on the right is that section's contribution to the 30,208-byte total. Read row by row.

Most of the 30 KB is alignment padding, not memory-only content. Of the 30,208 bytes, only the 4,096-byte .bss segment is a memory-only region. The other 26,112 bytes are gaps between sections.

Tiny sections pay disproportionately. .tls has 16 bytes of real content but occupies a full 4 KB page; same for .CRT with its 96 bytes, and .reloc with its 132. Each of these three sections individually contributes more memory-only bytes than .text, which holds 27,496 bytes of actual code. That's the cost of being able to set permissions per page. Real applications with dozens of sections pay this tax many times over — but on a 64-bit machine with terabytes of virtual address space, nobody cares.

One useful detail: the padding bytes aren't garbage. The Windows loader zero-fills the trailing portion of each section's last page, so the gaps appear as zeros in the running image. Nothing from elsewhere leaks in.

Mapping, not copying

The intuitive answer — the one most tutorials give — is that the loader reads each section from disk and writes it into memory. That intuition is wrong. Windows doesn't copy the file into memory at load time. It does something stranger and cheaper.

The kernel sets up a mapping. It tells itself, in effect: "the bytes for these virtual addresses are at these file offsets — when somebody asks, fetch them." That's all. No section-body pages are eagerly copied into the process. The loader and memory manager read only the metadata they need to create the image section (headers, the section table, signature checks); executable and data pages get faulted in lazily as the program touches them. The image's virtual address range is reserved, the page permissions are configured, the entry point's address is computable — but the bulk of the actual bytes haven't moved yet.

The bytes come in lazily. The first time the program tries to execute an instruction or read from a memory address, the CPU notices the page isn't in RAM, raises a page fault, the kernel reads 4 KB from the file into a fresh physical page, and the program continues. This is called demand paging. It's the second of our three founding facts.

Two operations, not one

The mapping is actually built from two distinct steps, and it pays to see them separately.

Step 1: create an image section. The kernel opens hello.exe, parses its headers, and builds a record that describes the file as an executable image: where each PE section starts in the virtual layout, how big it is, what page permissions it needs. This record lives in the kernel and describes the file. It's not attached to any particular process yet. Windows calls this record an image section. (In the Windows kernel it's a section object created through NtCreateSection with the SEC_IMAGE flag; the data structures it owns are called a control area and a set of prototype PTEs — one per page. You don't need to remember those names to follow this post, but you'll see them in debuggers and in Windows Internals.)

Step 2: map a view of that section. Now the kernel takes the image section and attaches it to a specific process — installs the page-table bookkeeping in that process's address space so the process can refer to the image by virtual addresses. Windows calls this attachment a map view: a view of the image section, seen from inside one process. (From user-mode code this is what MapViewOfFile does; under the hood it's the syscall NtMapViewOfSection.)

Why split it in two? Because a single image section can be mapped into many processes at once. The kernel creates one image-section record per loaded file, and every process that loads that file gets its own map view onto the same record. That's the trick that makes DLL sharing cheap. The first process to load kernel32.dll causes the kernel to build the image section. Every subsequent process just gets a map view onto the existing record — no re-parsing, no duplicate state, and (because the file is the source of truth for all of them) no duplicate RAM either. We'll see in the next section what happens when one of those processes wants to write.

Three states of one mapping

To make this concrete, let's watch the same image at three moments in time. The left column shows what the process's address space contains; the right column shows the on-disk file. The columns stay fixed across all three states — what changes is what's inside them. Click the buttons to step forward.

State 1 is the boring initial condition: the file exists, the process exists, but nothing connects them yet. State 2 is the result of the mapping work — the kernel has built the bookkeeping that says "this virtual address comes from this file offset," but no actual content has been pulled into RAM. State 3 is the first time the program touches a mapped address; one page (and only one) gets brought into RAM by a page fault, while the rest stay on disk.

Code paths the program never executes may never become resident in this process's working set. The error-handling code that only runs when a system call fails sits on disk, indexed but unloaded, until the failure happens. If the failure never happens during this process's life, those bytes may never be paged in for this process. The same goes for DLLs that get loaded but never called, for string constants the program never reads, for TLS initialization data for threads that never spawn. The image is fully mapped — the virtual address space is fully reserved — but the actual RAM footprint is often a small fraction of the mapped size.

"May never" rather than "never" because real systems blur the lines. Windows' SuperFetch / SysMain service watches launch patterns and prefetches pages of frequently-launched applications speculatively, the cache manager pulls in pages around the ones the program touches (read-ahead), an EDR or AV scan can fault entire executables in for inspection at load time, and memory compression can keep "evicted" pages around in a compressed form. The architectural truth — pages don't enter the working set until something causes them to — holds. The real-system behaviour is just messier than the bare model suggests.

This is the difference between mapped and resident that we pinned down in the vocabulary section, in concrete terms. A process can have a 68 KB mapped image but only 12 KB of it resident in RAM at any moment.

It's also why "warm" launches feel snappier than "cold" launches. First launch after boot: every page has to be faulted in from disk. Second launch: most of those pages are still cached in RAM from last time, so the faults are satisfied without disk I/O. Same program, same code path, very different perceived speed.

Sharing without conflict

The mapping story has a puzzle hidden in it. If every process that loads kernel32.dll shares its physical pages, and kernel32.dll has a writable .data section with global variables, what stops Process A from overwriting Process B's globals? They're literally looking at the same memory.

When the kernel maps a writable image-page, it doesn't actually mark the page writable. It marks the page shared and read-only, with a special note: "if anyone tries to write here, don't just refuse — call me." Process A and Process B can read the page freely. Both see the same bytes, both share the same physical RAM. So far, no problem.

The interesting moment is when one of them writes. Process A executes a write instruction, but the page is marked read-only, so the CPU traps before any byte hits memory. The kernel's page-fault handler runs. It looks at the trap, sees the page is supposed to be writable (just shared until now), and does something quiet: it allocates a fresh physical page, copies the contents of the shared page into it, points Process A's bookkeeping at the new page, and marks the new page genuinely read-write. Then the kernel returns and lets the original write instruction complete — to the new page, not the shared one.

From the program's perspective this is completely invisible. The program writes to its global variable. The value sticks. The variable is private to this process. Whether the page was shared with thirty other processes a moment ago or already private — that's the kernel's business, not the program's.

Three states tell the whole story. The middle state — the moment of the write — is where the trick actually happens:

Copy-on-write is one of the cleverest mechanisms in modern operating systems, and almost every reader has been quietly using it for years without thinking about it. Every time you run two copies of a program, every time the same DLL gets loaded into a dozen processes, the same trick is doing the same work. The OS lies a little about sharing, then quietly stops lying the moment the lie becomes a problem.

The scale of what this saves is worth pausing on. A 100 MB DLL loaded by fifty processes does not consume 5 GB of RAM. The physical pages backing the DLL's .text are loaded once and referenced fifty times; each process pays only the cost of its own per-process page-table entries — a few bytes per virtual page. Two processes calling the same function in ntdll.dll are literally executing the same physical bytes; the MMU just translates their different virtual addresses to the same physical location. The writable sections start out shared the same way and only diverge under copy-on-write. Without this mechanism, the modern Windows desktop — with dozens of processes simultaneously mapping ntdll, kernel32, user32, gdi32, and the rest — would not fit in RAM.

(There is one subtle exception worth knowing in passing: the PE format has a flag that asks for a section to be actually shared across processes — writes from one process become visible to all the others. Modern toolchains don't produce these, because they're a cross-process write surface useful to attackers. If you see one in a binary you're analyzing, look closer.)

Page permissions, section by section

We've covered the mechanism. Now the result. After the kernel finishes mapping hello.exe, here's what each section's pages look like:

These protections aren't arbitrary or kernel-determined — each one is derived from the section's Characteristics field, which we walked through in Part 2. The header's IMAGE_SCN_MEM_EXECUTE, IMAGE_SCN_MEM_READ, and IMAGE_SCN_MEM_WRITE bits map almost mechanically onto the runtime page flags the kernel installs. The PE file already declares what protection each section deserves; the kernel just honors that declaration when it sets up the page tables.

Read-only is the default for non-code. Strings, exception tables, unwind metadata — none of these should ever be modified. Marking them read-only means an accidental write fails loudly with an access violation instead of silently corrupting the program.

Code is read + execute, never write. Writable code is a security smell. Modern compilers don't produce it. If you see .text with write permission, the binary is doing something unusual (a self-modifying loader, an obfuscator, a JIT).

Writable sections start as copy-on-write. They become genuinely writable for a process only after that process actually writes to them. This is what lets two processes share the same loaded DLL without trampling each other.

And .bss is the special case we've been promising. It has no file backing — SizeOfRawData = 0, remember? When the kernel sets up the mapping for the .bss page range, it marks those pages "no file; materialize as zeros on first access." On first read, the kernel allocates a fresh physical page, zero-fills it, and hands it over. This is the mechanism that fulfills C's old promise: uninitialized globals start as zero. No code zeroes them; the page just arrives that way.

.reloc has a quieter special case worth knowing. Its Characteristics field has the IMAGE_SCN_MEM_DISCARDABLE bit set, which tells the loader: once you've finished using me, you can throw me away. After the loader walks the Base Relocation Table and patches the absolute addresses scattered through the image (Part 4), the original relocation entries are no longer needed for anything. Windows is free to reclaim that page and reuse it for something else. A nice cleanup detail: the work the section exists to support happens once, at load time, and then the bytes are gone.

From double-click to mapped image

One last picture to tie everything together: the sequence of events when you double-click hello.exe, up to the moment the image is mapped and the user-mode loader takes over (which is Part 4's job).

Steps 1 and 2 are in user mode, where Explorer asks Windows for a launch and kernel32.dll prepares the request. Step 3 is the transition into the kernel — everything from there through step 7 is kernel work. Step 8 hands control back to user mode, where the user-mode loader takes over to do all the work that's still missing (which is a lot — see the wrap-up).

The single sentence to remember: mapping is two operations — set up the bookkeeping that says "this file becomes this image," then attach that bookkeeping to a process. Everything else in steps 4 through 7 is the kernel doing the homework around those two ideas.

Try it yourself, on Windows

Inspect the mapping of a running process

The things we've described — page protections, working sets, copy-on-write state — only exist on a running Windows process, so the experiments here need an actual Windows machine. Two free Sysinternals-class tools are worth installing.

VMMap shows the virtual address space of a running process in a hierarchical view. Attach it to Notepad and expand any "Image" row to see the per-section protections that match what we tabulated for hello.exe. The Details pane indicates which pages are resident versus just reserved.

Process Hacker (now System Informer) has a Memory tab on every process's properties dialog. Its "Shared" column reflects copy-on-write status — Shared means still on the shared backing, Private means already copy-on-written.

The single experiment worth running: launch the same program twice, open the Memory tab for both instances, and check the Shared column on each writable section. Most pages will still be shared between the two — copy-on-write in action.

For a closer look at the kernel side, WinDbg in kernel-debugging mode lets you inspect the data structures behind everything we've described. !ca dumps a control area; !pte walks the page tables for a virtual address; !vad lists the Virtual Address Descriptors that describe each mapped region. This is more setup work — you need a kernel-debugger connection to a target VM — but it shows the section objects, prototype PTEs, and per-process page tables as actual bytes in kernel memory. If "image section" and "prototype PTE" still feel like abstractions after this post, looking at them in the debugger turns them concrete fast.

What we have at the end of Part 3

Recall the three facts we opened with. Tight on disk. Spread to pages in memory. Mapping is lazy. Now you know all three concretely.

You also know exactly where the 30 KB difference between file and image comes from. The bulk of it — 23,040 bytes — is per-section page-padding: every section gets rounded up to fill its 4 KB page allocation, and small sections pay a disproportionate tax. .tls's 16 bytes of content takes a full 4 KB page; .reloc's 132 bytes takes another. Add up that rounding across all ten sections and you reach 23 KB. Another 4,096 bytes come from .bss, which has no on-disk presence at all and materializes as a zero-filled page on first access. The remaining 3,072 bytes are the headers region rounding up — the PE headers occupy 1,024 bytes on disk but get a full 4 KB page in memory. 23,040 + 4,096 + 3,072 = 30,208. The whole difference accounted for. None of the extra bytes is new content; they're the consequence of a layout optimized for page-level permission enforcement.

You know that mapping is not copying. The kernel manufactures a record describing how each page of the image relates to bytes in the file, attaches that record to a new process, and lets the memory manager pull pages into RAM only when the program touches them. DLLs loaded by many processes share their pages. Writable sections start out shared and become private through copy-on-write on first write.

If the binary didn't load at its preferred ImageBase — and on modern Windows with ASLR enabled, it usually does not — the absolute addresses baked into the code by the linker are wrong, and the Base Relocation Table has to drive a sweep through the image patching every one of them. The Import Address Table is still full of placeholder thunk entries; the loader has to walk the import descriptors, load each required DLL (recursively, because they have their own imports), resolve each imported function, and overwrite the IAT entries with real function pointers. TLS callbacks need to run before the entry point. The DLL search order has to be followed correctly so dependencies resolve to the intended modules, not attacker-controlled lookalikes earlier in the path. None of that has happened yet.