Complete Memory Hierarchy and Components
Memory Hierarchy Overview
The memory system is organized as a hierarchy from fastest/smallest to slowest/largest:
CPU Registers (Hardware) ←── Fastest, Smallest
↓
L1 Cache (Hardware)
↓
L2 Cache (Hardware)
↓
L3 Cache (Hardware)
↓
TLB (Hardware)
↓
Main Memory/RAM (Hardware)
↓
Page Cache (Software in RAM)
↓
Swap Space (Software on Disk)
↓
Disk Storage (Hardware) ←── Slowest, Largest
Physical Components (Hardware)
1. CPU Registers
Location: Inside CPU die
Size: 64-bit × ~16 general purpose registers
Speed: 1 CPU cycle (~0.3ns at 3GHz)
Purpose: Hold currently executing instruction operands
Example:
RAX = 0x7f0000001000 (virtual address)
RBX = 0x12345678 (data value)
2. CPU Caches (L1, L2, L3)
L1 Cache:
├── Size: 32KB instruction + 32KB data per core
├── Speed: 2-4 CPU cycles (~1ns)
├── Location: Inside each CPU core
└── Purpose: Cache recently used instructions/data
L2 Cache:
├── Size: 256KB-512KB per core
├── Speed: 10-20 CPU cycles (~5ns)
├── Location: Inside CPU, per core or shared
└── Purpose: Backup for L1 cache misses
L3 Cache:
├── Size: 8MB-32MB shared
├── Speed: 30-40 CPU cycles (~10ns)
├── Location: Inside CPU, shared among cores
└── Purpose: Last level before main memory
3. Translation Lookaside Buffer (TLB)
TLB (Hardware):
├── Size: 64-1024 entries
├── Speed: 1-2 CPU cycles
├── Location: Inside CPU (part of MMU)
├── Purpose: Cache virtual → physical address translations
└── Structure: [Virtual Page → Physical Frame + Permissions]
Example TLB Entry:
Virtual Page: 0x7f0000001000 → Physical Frame: 0x12345000 + RWX
4. Main Memory (RAM)
DDR4/DDR5 DRAM:
├── Size: 8GB-128GB typical
├── Speed: 100-300 CPU cycles (~50-100ns)
├── Location: DIMM slots on motherboard
├── Purpose: Primary working storage
└── Organization: Banks, Rows, Columns
Physical addressing:
[Channel][DIMM][Rank][Bank][Row][Column]
5. Storage Devices
NVMe SSD:
├── Size: 512GB-8TB typical
├── Speed: 300,000+ CPU cycles (~100μs)
├── Location: M.2 or PCIe slot
└── Purpose: Persistent storage
SATA SSD:
├── Size: 256GB-4TB typical
├── Speed: 600,000+ CPU cycles (~200μs)
Traditional HDD:
├── Size: 1TB-18TB typical
├── Speed: 30,000,000+ CPU cycles (~10ms)
├── Location: SATA connection
└── Purpose: Bulk storage
Software Components (Kernel Managed)
1. Virtual Memory System
Per-Process Virtual Address Space:
┌─────────────────────────────────────────────┐
│ 0x00000000 - 0x00400000: Code (.text) │
│ 0x00400000 - 0x00600000: Data (.data) │
│ 0x00600000 - 0x00800000: Heap │
│ 0x7f0000000000 - 0x7f8000000000: mmap area │ ← Our test
│ 0x7fffffffe000 - 0x8000000000: Stack │
└─────────────────────────────────────────────┘
Managed by: Kernel's Memory Management Unit (MMU) software
2. Page Tables (Software Data Structure in RAM)
Multi-Level Page Table (x86-64):
PML4 → PDPT → PD → PT → Physical Page
│ │ │ │
│ │ │ └─ Page Table Entry (PTE)
│ │ └────── Page Directory Entry (PDE)
│ └──────────── Page Directory Pointer (PDPE)
└─────────────────── Page Map Level 4 (PML4E)
Location: Stored in kernel's physical RAM
Size: ~1MB per process (typical)
Swappable: ❌ NEVER swapped (kernel needs immediate access)
Each entry contains:
├── Physical address (bits 12-51)
├── Present bit (P)
├── Read/Write bit (R/W)
├── User/Supervisor bit (U/S)
├── Accessed bit (A)
├── Dirty bit (D)
└── Other control bits
Physical Layout in RAM:
┌─────────────────┐ ← PML4 table (4KB page)
│ 512 × 8-byte │
│ entries │
├─────────────────┤
│ PDPT table │ ← Referenced by PML4[0]
├─────────────────┤
│ Page Directory │ ← Referenced by PDPT[0]
├─────────────────┤
│ Page Table │ ← Referenced by PD[0]
└─────────────────┘
3. Physical Frames (Hardware RAM Organization)
Physical Memory Divided into Frames:
┌────────────────────────────────────────────┐
│ Frame 0: 0x00000000-0x00000fff (4KB) │ ← Used by kernel
│ Frame 1: 0x00001000-0x00001fff (4KB) │ ← Page table data
│ Frame 2: 0x00002000-0x00002fff (4KB) │ ← User process page
│ ... │
│ Frame N: Physical RAM limit │
└────────────────────────────────────────────┘
Frame Management:
├── Location: Physical RAM chips
├── Size: 4KB for regular pages, 2MB/1GB for huge pages
├── Tracked by: struct page (Linux kernel)
├── Swappable: Depends on usage
│ ├── ✅ User data pages: Can be swapped
│ ├── ❌ Kernel code/data: Never swapped
│ ├── ❌ Page tables: Never swapped
│ └── ❌ Slab allocations: Usually not swapped
Frame Descriptor (struct page) per frame:
struct page {
unsigned long flags; // Page status flags
atomic_t _refcount; // Reference count
atomic_t _mapcount; // Mapping count
struct list_head lru; // LRU list linkage
void *virtual; // Virtual address (if mapped)
// ... many other fields
};
These descriptors: Stored in mem_map[] array in RAM, never swapped
4. Slab Allocator (Kernel Memory Management)
Purpose: Efficient allocation of kernel objects
Location: Physical RAM (kernel space)
Swappable: ❌ NEVER swapped
Structure:
Cache → Slab → Objects
Example - dentry cache (directory entries):
┌─────────────────────────────────────────┐
│ dentry_cache (kmem_cache) │
├─────────────────────────────────────────┤
│ Slab 1: [obj][obj][obj][free][free] │ ← 4KB or larger
│ Slab 2: [obj][obj][obj][obj][obj] │
│ Slab 3: [free][free][free][free] │
└─────────────────────────────────────────┘
Common Slab Caches:
├── kmalloc-* (general kernel allocations)
├── vm_area_struct (VMA objects)
├── task_struct (process descriptors)
├── dentry (directory entries)
├── inode_cache (filesystem inodes)
├── buffer_head (block device buffers)
└── skbuff_head_cache (network packet headers)
Slab Types:
1. Regular slabs: Normal kernel objects
2. DMA slabs: DMA-capable memory
3. NUMA slabs: NUMA-aware allocations
Physical Layout:
┌─────────────┐ ← Physical frame 100
│ Slab Header │
├─────────────┤
│ Object 1 │ ← e.g., task_struct
├─────────────┤
│ Object 2 │ ← e.g., task_struct
├─────────────┤
│ Object 3 │ ← e.g., task_struct
└─────────────┘
View slab info: cat /proc/slabinfo
Purpose: Cache file contents in RAM
Location: Part of main memory, managed by kernel
Size: Dynamic, uses available RAM
Structure:
File: /home/user/data.txt
├── Page 0 (offset 0-4095): [Cached in RAM at 0x12340000]
├── Page 1 (offset 4096-8191): [Cached in RAM at 0x12341000]
└── Page 2 (offset 8192-...): [Not in cache, on disk]
Managed by: Linux page cache subsystem
5. Page Cache (Software in RAM)
Purpose: Cache file contents in RAM
Location: Part of main memory, managed by kernel
Size: Dynamic, uses available RAM
Swappable: ✅ CAN be swapped (but usually written back to original file)
Structure:
File: /home/user/data.txt
├── Page 0 (offset 0-4095): [Cached in RAM at frame 1000]
├── Page 1 (offset 4096-8191): [Cached in RAM at frame 1001]
└── Page 2 (offset 8192-...): [Not in cache, on disk]
Under memory pressure:
├── Clean pages: Simply freed (can re-read from file)
├── Dirty pages: Written back to file, then freed
└── Anonymous pages: Written to swap space
Managed by: Linux page cache subsystem
6. Swap Space (Software on Disk)
Swap Partition:
/dev/sda2 → Raw disk partition used as swap
Swap File:
/swapfile → Regular file used as swap
Swap Areas:
├── Swap header (metadata)
├── Swap slots (4KB each for regular pages)
└── Swap out/in algorithms (LRU, etc.)
Purpose: Extend virtual memory when RAM is full
How Everything is Tied Together
Memory Access Flow
1. CPU Instruction Execution
// C code:
int *ptr = (int*)0x7f0000001000;
int value = *ptr; // Memory read instruction
// Assembly (simplified):
mov rax, 0x7f0000001000 // Load virtual address into register
mov ebx, [rax] // Read from memory at virtual address
2. Virtual Address Translation Pipeline
Step 1: CPU checks TLB
├── TLB Hit: Get physical address immediately (1-2 cycles)
└── TLB Miss: Continue to page table walk
Step 2: Page Table Walk (if TLB miss)
├── Hardware walks page tables in RAM
├── Loads translation into TLB
├── Takes 100+ cycles
└── May cause multiple cache misses
Step 3: Physical Address Generated
├── Virtual 0x7f0000001000 → Physical 0x12345000
└── Ready for memory access
3. Physical Memory Access Pipeline
Step 1: CPU checks L1 Cache
├── L1 Hit: Return data (2-4 cycles)
└── L1 Miss: Check L2
Step 2: CPU checks L2 Cache
├── L2 Hit: Return data, update L1 (10-20 cycles)
└── L2 Miss: Check L3
Step 3: CPU checks L3 Cache
├── L3 Hit: Return data, update L2 & L1 (30-40 cycles)
└── L3 Miss: Access main memory
Step 4: Main Memory Access
├── Read from DRAM (100-300 cycles)
├── Update all cache levels
└── Return data to CPU
4. Page Fault Handling (Software)
Page Not Present Scenarios:
Scenario 1: First Access to mmap'd Region
├── Page fault interrupt
├── Kernel allocates physical page
├── Updates page table
├── Restarts instruction
└── Normal cache/TLB flow continues
Scenario 2: Swapped Out Page
├── Page fault interrupt
├── Kernel identifies page in swap
├── Allocates new physical page
├── Reads data from swap device
├── Updates page table
└── Restarts instruction
Scenario 3: File-Backed Page Not in Cache
├── Page fault interrupt
├── Kernel checks page cache
├── If not cached: Read from file into page cache
├── Maps page cache page into process
└── Updates page table
Complete Memory Access Example
Initial State
Virtual Address: 0x7f0000001000
TLB: Empty
Page Table: Points to swapped page
Physical RAM: Page not present
Swap: Page stored in /dev/sda2 slot 1234
Access Sequence
1. CPU executes: mov eax, [0x7f0000001000]
2. TLB Lookup:
├── Miss (no entry for 0x7f0000001000)
└── Continue to page table walk
3. Page Table Walk:
├── PML4[entry] → PDPT physical address
├── PDPT[entry] → PD physical address
├── PD[entry] → PT physical address
├── PT[entry] → Shows page swapped (not present)
└── Generate page fault interrupt
4. Page Fault Handler (Kernel):
├── Identify: Page in swap slot 1234
├── Allocate: New physical page at 0x87654000
├── I/O: Read from swap device to physical page
├── Update: Page table entry to point to 0x87654000
├── TLB: Invalidate old entries
└── Return: Resume instruction
5. Instruction Restart:
├── TLB lookup: Still miss
├── Page table walk: Now finds physical 0x87654000
├── Update TLB: 0x7f0000001000 → 0x87654000
└── Physical access: Check caches
6. Cache Hierarchy:
├── L1 miss (first access)
├── L2 miss (first access)
├── L3 miss (first access)
├── DRAM access: Read from 0x87654000
├── Populate: L3, L2, L1 caches
└── Return: Data to CPU register
Total: ~10,000+ CPU cycles for first access
Subsequent: 2-4 cycles (L1 cache hit)
Detailed Residency and Swappability Analysis
What Lives Where and Can Be Swapped
Always in RAM (Never Swapped)
1. Kernel Code & Data:
├── Location: Low physical memory (0x0-0x400000 typical)
├── Size: 10-50MB
├── Why: Kernel needs immediate access for interrupts, syscalls
└── Includes: System call handlers, interrupt handlers, core data structures
2. Page Tables:
├── Location: Kernel physical memory
├── Size: ~1MB per active process
├── Why: Hardware MMU needs immediate access for translation
└── Swapping would cause infinite recursion (need page tables to swap!)
3. Slab Allocator Objects:
├── Location: Kernel physical memory
├── Size: Variable (MBs to GBs)
├── Why: Critical kernel data structures need fast access
└── Examples: task_struct, dentry, inode, skb
4. DMA Buffers:
├── Location: Physically contiguous RAM
├── Size: Variable
├── Why: Hardware devices need stable physical addresses
└── Used for: Network cards, disk controllers, graphics
5. Huge Pages (Traditional):
├── Location: Physical RAM
├── Size: 2MB or 1GB each
├── Why: Performance optimization, swapping defeats purpose
└── Our test case uses these!
6. Frame Descriptors (struct page):
├── Location: mem_map[] array in RAM
├── Size: ~40 bytes × number of physical frames
├── Why: Needed to manage physical memory itself
└── Example: 32GB RAM = ~320MB for frame descriptors
Can Be Swapped to Disk
1. User Process Pages (Anonymous):
├── Location: Physical RAM → Swap space when swapped
├── Examples: malloc() memory, stack pages, heap
├── Swap destination: /dev/sda2 or /swapfile
└── Retrieved via: Page fault handling
2. User Process Pages (File-backed):
├── Location: Physical RAM → Original file when swapped
├── Examples: mmap'd files, program code, shared libraries
├── Swap destination: Original file (not swap space)
└── Retrieved via: Page fault + file I/O
3. Page Cache (Clean):
├── Location: Physical RAM → Simply freed (not written anywhere)
├── Why: Can re-read from original file
├── Examples: Recently read file contents
└── Retrieved via: File I/O when accessed again
4. Page Cache (Dirty):
├── Location: Physical RAM → Written to file, then freed
├── Why: Must preserve modifications
├── Examples: Modified file contents not yet saved
└── Process: Writeback to file, then page can be freed
Memory Layout in Physical RAM
Typical Physical Memory Organization
Physical Address Range: 0x00000000 - 0x7FFFFFFFF (32GB example)
0x00000000 ┌─────────────────────────────────────┐
│ BIOS/UEFI Reserved │
0x00100000 ├─────────────────────────────────────┤
│ Kernel Code & Data │ ← Never swapped
0x01000000 ├─────────────────────────────────────┤
│ Frame Descriptors (mem_map) │ ← Never swapped
0x05000000 ├─────────────────────────────────────┤
│ Slab Allocator Areas │ ← Never swapped
│ ├── task_struct cache │
│ ├── dentry cache │
│ ├── inode cache │
│ └── many others... │
0x10000000 ├─────────────────────────────────────┤
│ Page Tables (all processes) │ ← Never swapped
0x20000000 ├─────────────────────────────────────┤
│ DMA Buffers │ ← Never swapped
0x30000000 ├─────────────────────────────────────┤
│ Page Cache │ ← Can be freed/swapped
│ (file contents cached in RAM) │
0x50000000 ├─────────────────────────────────────┤
│ User Process Pages │ ← Can be swapped
│ ├── Anonymous pages │
│ ├── File-backed pages │
│ └── Shared memory pages │
0x7FFFFFFF └─────────────────────────────────────┘
Typical proportions in a running system:
├── Kernel + metadata: 10-20% (never swapped)
├── Page cache: 40-60% (can be freed)
├── User pages: 20-40% (can be swapped)
└── Free: 5-10% (available)
Enhanced Storage Component Summary
Component | Type | Size | Speed | Purpose | Location | Swappable |
---|---|---|---|---|---|---|
Registers | Hardware | 64B × 16 | 0.3ns | Active computation | CPU die | ❌ Never |
L1 Cache | Hardware | 64KB | 1ns | Recent instructions/data | CPU die | ❌ Never |
L2 Cache | Hardware | 512KB | 5ns | Cache backup | CPU die | ❌ Never |
L3 Cache | Hardware | 16MB | 10ns | Last level cache | CPU die | ❌ Never |
TLB | Hardware | 1024 entries | 1ns | Address translation | CPU MMU | ❌ Never |
Page Tables | Software | ~1MB/process | 100ns | Virtual→Physical mapping | Kernel RAM | ❌ Never |
Frame Descriptors | Software | ~40B/frame | 100ns | Track physical pages | Kernel RAM | ❌ Never |
Slab Objects | Software | Variable | 100ns | Kernel data structures | Kernel RAM | ❌ Never |
Main RAM (User) | Hardware | ~20GB | 100ns | User process pages | DRAM | ✅ To swap |
Page Cache | Software | Dynamic | 100ns | File caching | DRAM | ✅ To file |
Swap Space | Software | 8GB | 100μs | Virtual memory extension | Disk | N/A (is swap) |
Disk Storage | Hardware | 1TB | 10ms | Persistent storage | SSD/HDD | N/A (persistent) |
Memory Management Decision Tree
When Memory Pressure Occurs
Kernel Memory Reclaim Algorithm:
1. Check Page Cache:
├── Clean pages: Free immediately (can re-read from file)
├── Dirty pages: Write to file, then free
└── Result: Fast memory recovery
2. Check User Pages:
├── Anonymous pages: Write to swap space
├── File-backed pages: Write to original file (if dirty)
└── Result: Slower but recovers memory
3. Never Touch:
├── Kernel code/data
├── Page tables
├── Slab objects
├── DMA buffers
└── Hardware structures
4. Last Resort:
├── OOM (Out of Memory) killer
├── Kill processes to free memory
└── System remains stable
Real-World Example: Our Hugemmap06 Test
Memory Allocation Breakdown
// Our test allocates:
addr = mmap(NULL, 51 * 2MB, MAP_HUGETLB | MAP_ANONYMOUS, ...);
What gets allocated where:
1. Huge Pages (102MB):
├── Location: Physical RAM frames
├── Swappable: ❌ Never (huge pages can't swap)
├── Usage: User data pages
└── COW: Creates private copies per thread
2. Page Table Entries:
├── Location: Kernel RAM (page tables)
├── Size: ~4KB for mapping 51 huge pages
├── Swappable: ❌ Never
└── Content: Virtual→Physical mappings
3. VMA Structures:
├── Location: Slab cache (vm_area_struct)
├── Size: ~200 bytes per mmap() call
├── Swappable: ❌ Never (kernel object)
└── Purpose: Track memory regions per process
4. Thread Stacks (50 threads):
├── Location: Regular RAM pages
├── Size: 8MB × 50 = 400MB
├── Swappable: ✅ Yes (can swap to disk)
└── Content: Thread execution stacks
Total memory usage:
├── Never swappable: 102MB (huge pages) + 4KB (page tables) + 10KB (VMAs)
├── Swappable: 400MB (thread stacks)
└── System impact: ~502MB that must stay in RAM during test
Key Relationships and Interactions
Hardware ↔ Software Interface
Hardware provides:
├── Raw storage (RAM, disk)
├── Address translation (TLB, MMU)
├── Caching (L1, L2, L3)
├── Interrupts (page faults)
└── Physical frame management
Software manages:
├── Virtual memory layout
├── Page table structures
├── Page fault handling
├── Swap algorithms
├── Page cache policies
├── Process memory mappings
├── Slab allocation policies
└── Memory reclaim strategies
Critical Dependencies
Page Tables depend on:
├── Physical frames (to store page table pages)
├── Never swapped (would create circular dependency)
└── Slab allocator (for dynamic page table allocation)
Slab Allocator depends on:
├── Physical frames (for slab pages)
├── Page tables (to map slab areas)
└── Never swapped (kernel needs immediate access)
Frame Descriptors depend on:
├── Physical RAM (mem_map[] array)
├── Bootstrap allocation (allocated at boot)
└── Never swapped (needed to manage swapping itself!)
TLB depends on:
├── Page tables (source of translations)
├── Hardware implementation
└── Software invalidation (when mappings change)
Memory Pressure Response Chain
1. Application requests memory → Page fault
2. Kernel checks available frames
3. If frames available: Allocate directly
4. If low on frames: Start reclaim
├── a) Free clean page cache pages
├── b) Write dirty page cache to files
├── c) Swap anonymous user pages
├── d) Never touch kernel structures
5. If still no memory: OOM killer
6. Update page tables and TLB
7. Resume application
This hierarchy ensures that:
- Critical kernel structures stay in fast RAM (never swapped)
- User data can be moved to slower storage when needed
- Hardware acceleration (TLB, caches) speeds up common operations
- Software policies manage the complexity transparently
The beauty is that applications see a simple flat virtual memory space, while the system orchestrates this complex multi-layered storage hierarchy automatically!