Linux Kernel Mastery: Design Principles & Case Studies

Complete Memory Hierarchy and Components

Memory Hierarchy Overview

The memory system is organized as a hierarchy from fastest/smallest to slowest/largest:

CPU Registers (Hardware) ←── Fastest, Smallest
    ↓
L1 Cache (Hardware)
    ↓
L2 Cache (Hardware)  
    ↓
L3 Cache (Hardware)
    ↓
TLB (Hardware)
    ↓
Main Memory/RAM (Hardware)
    ↓
Page Cache (Software in RAM)
    ↓
Swap Space (Software on Disk)
    ↓
Disk Storage (Hardware) ←── Slowest, Largest

Physical Components (Hardware)

1. CPU Registers

Location: Inside CPU die
Size: 64-bit × ~16 general purpose registers
Speed: 1 CPU cycle (~0.3ns at 3GHz)
Purpose: Hold currently executing instruction operands

Example:
RAX = 0x7f0000001000  (virtual address)
RBX = 0x12345678      (data value)

2. CPU Caches (L1, L2, L3)

L1 Cache:
├── Size: 32KB instruction + 32KB data per core
├── Speed: 2-4 CPU cycles (~1ns)  
├── Location: Inside each CPU core
└── Purpose: Cache recently used instructions/data

L2 Cache:
├── Size: 256KB-512KB per core
├── Speed: 10-20 CPU cycles (~5ns)
├── Location: Inside CPU, per core or shared
└── Purpose: Backup for L1 cache misses

L3 Cache:
├── Size: 8MB-32MB shared
├── Speed: 30-40 CPU cycles (~10ns)
├── Location: Inside CPU, shared among cores
└── Purpose: Last level before main memory

3. Translation Lookaside Buffer (TLB)

TLB (Hardware):
├── Size: 64-1024 entries
├── Speed: 1-2 CPU cycles
├── Location: Inside CPU (part of MMU)
├── Purpose: Cache virtual → physical address translations
└── Structure: [Virtual Page → Physical Frame + Permissions]

Example TLB Entry:
Virtual Page: 0x7f0000001000 → Physical Frame: 0x12345000 + RWX

4. Main Memory (RAM)

DDR4/DDR5 DRAM:
├── Size: 8GB-128GB typical
├── Speed: 100-300 CPU cycles (~50-100ns)
├── Location: DIMM slots on motherboard  
├── Purpose: Primary working storage
└── Organization: Banks, Rows, Columns

Physical addressing:
[Channel][DIMM][Rank][Bank][Row][Column]

5. Storage Devices

NVMe SSD:
├── Size: 512GB-8TB typical
├── Speed: 300,000+ CPU cycles (~100μs)
├── Location: M.2 or PCIe slot
└── Purpose: Persistent storage

SATA SSD:
├── Size: 256GB-4TB typical  
├── Speed: 600,000+ CPU cycles (~200μs)

Traditional HDD:
├── Size: 1TB-18TB typical
├── Speed: 30,000,000+ CPU cycles (~10ms)
├── Location: SATA connection
└── Purpose: Bulk storage

Software Components (Kernel Managed)

1. Virtual Memory System

Per-Process Virtual Address Space:
┌─────────────────────────────────────────────┐
│ 0x00000000 - 0x00400000: Code (.text)      │
│ 0x00400000 - 0x00600000: Data (.data)      │  
│ 0x00600000 - 0x00800000: Heap              │
│ 0x7f0000000000 - 0x7f8000000000: mmap area │ ← Our test
│ 0x7fffffffe000 - 0x8000000000: Stack       │
└─────────────────────────────────────────────┘

Managed by: Kernel's Memory Management Unit (MMU) software

2. Page Tables (Software Data Structure in RAM)

Multi-Level Page Table (x86-64):
PML4 → PDPT → PD → PT → Physical Page
 │      │     │    │
 │      │     │    └─ Page Table Entry (PTE)
 │      │     └────── Page Directory Entry (PDE)  
 │      └──────────── Page Directory Pointer (PDPE)
 └─────────────────── Page Map Level 4 (PML4E)

Location: Stored in kernel's physical RAM
Size: ~1MB per process (typical)
Swappable: ❌ NEVER swapped (kernel needs immediate access)

Each entry contains:
├── Physical address (bits 12-51)
├── Present bit (P)
├── Read/Write bit (R/W)
├── User/Supervisor bit (U/S)
├── Accessed bit (A)
├── Dirty bit (D)
└── Other control bits

Physical Layout in RAM:
┌─────────────────┐ ← PML4 table (4KB page)
│ 512 × 8-byte    │
│ entries         │
├─────────────────┤
│ PDPT table      │ ← Referenced by PML4[0]
├─────────────────┤
│ Page Directory  │ ← Referenced by PDPT[0]
├─────────────────┤
│ Page Table      │ ← Referenced by PD[0]
└─────────────────┘

3. Physical Frames (Hardware RAM Organization)

Physical Memory Divided into Frames:
┌────────────────────────────────────────────┐
│ Frame 0: 0x00000000-0x00000fff (4KB)      │ ← Used by kernel
│ Frame 1: 0x00001000-0x00001fff (4KB)      │ ← Page table data
│ Frame 2: 0x00002000-0x00002fff (4KB)      │ ← User process page
│ ...                                        │
│ Frame N: Physical RAM limit                │
└────────────────────────────────────────────┘

Frame Management:
├── Location: Physical RAM chips
├── Size: 4KB for regular pages, 2MB/1GB for huge pages
├── Tracked by: struct page (Linux kernel)
├── Swappable: Depends on usage
│   ├── ✅ User data pages: Can be swapped
│   ├── ❌ Kernel code/data: Never swapped
│   ├── ❌ Page tables: Never swapped
│   └── ❌ Slab allocations: Usually not swapped

Frame Descriptor (struct page) per frame:
struct page {
    unsigned long flags;        // Page status flags
    atomic_t _refcount;        // Reference count
    atomic_t _mapcount;        // Mapping count
    struct list_head lru;      // LRU list linkage
    void *virtual;             // Virtual address (if mapped)
    // ... many other fields
};

These descriptors: Stored in mem_map[] array in RAM, never swapped

4. Slab Allocator (Kernel Memory Management)

Purpose: Efficient allocation of kernel objects
Location: Physical RAM (kernel space)
Swappable: ❌ NEVER swapped

Structure:
Cache → Slab → Objects

Example - dentry cache (directory entries):
┌─────────────────────────────────────────┐
│ dentry_cache (kmem_cache)               │
├─────────────────────────────────────────┤
│ Slab 1: [obj][obj][obj][free][free]     │ ← 4KB or larger
│ Slab 2: [obj][obj][obj][obj][obj]       │
│ Slab 3: [free][free][free][free]        │
└─────────────────────────────────────────┘

Common Slab Caches:
├── kmalloc-* (general kernel allocations)
├── vm_area_struct (VMA objects)
├── task_struct (process descriptors)
├── dentry (directory entries)
├── inode_cache (filesystem inodes)
├── buffer_head (block device buffers)
└── skbuff_head_cache (network packet headers)

Slab Types:
1. Regular slabs: Normal kernel objects
2. DMA slabs: DMA-capable memory
3. NUMA slabs: NUMA-aware allocations

Physical Layout:
┌─────────────┐ ← Physical frame 100
│ Slab Header │
├─────────────┤
│ Object 1    │ ← e.g., task_struct
├─────────────┤
│ Object 2    │ ← e.g., task_struct  
├─────────────┤
│ Object 3    │ ← e.g., task_struct
└─────────────┘

View slab info: cat /proc/slabinfo

Purpose: Cache file contents in RAM
Location: Part of main memory, managed by kernel
Size: Dynamic, uses available RAM

Structure:
File: /home/user/data.txt
├── Page 0 (offset 0-4095):    [Cached in RAM at 0x12340000]
├── Page 1 (offset 4096-8191): [Cached in RAM at 0x12341000]  
└── Page 2 (offset 8192-...):  [Not in cache, on disk]

Managed by: Linux page cache subsystem

5. Page Cache (Software in RAM)

Purpose: Cache file contents in RAM
Location: Part of main memory, managed by kernel
Size: Dynamic, uses available RAM
Swappable: ✅ CAN be swapped (but usually written back to original file)

Structure:
File: /home/user/data.txt
├── Page 0 (offset 0-4095):    [Cached in RAM at frame 1000]
├── Page 1 (offset 4096-8191): [Cached in RAM at frame 1001]  
└── Page 2 (offset 8192-...):  [Not in cache, on disk]

Under memory pressure:
├── Clean pages: Simply freed (can re-read from file)
├── Dirty pages: Written back to file, then freed
└── Anonymous pages: Written to swap space

Managed by: Linux page cache subsystem

6. Swap Space (Software on Disk)

Swap Partition:
/dev/sda2 → Raw disk partition used as swap

Swap File:
/swapfile → Regular file used as swap

Swap Areas:
├── Swap header (metadata)
├── Swap slots (4KB each for regular pages)
└── Swap out/in algorithms (LRU, etc.)

Purpose: Extend virtual memory when RAM is full

How Everything is Tied Together

Memory Access Flow

1. CPU Instruction Execution

// C code:
int *ptr = (int*)0x7f0000001000;
int value = *ptr;  // Memory read instruction

// Assembly (simplified):
mov rax, 0x7f0000001000  // Load virtual address into register
mov ebx, [rax]           // Read from memory at virtual address

2. Virtual Address Translation Pipeline

Step 1: CPU checks TLB
├── TLB Hit: Get physical address immediately (1-2 cycles)
└── TLB Miss: Continue to page table walk

Step 2: Page Table Walk (if TLB miss)
├── Hardware walks page tables in RAM
├── Loads translation into TLB  
├── Takes 100+ cycles
└── May cause multiple cache misses

Step 3: Physical Address Generated
├── Virtual 0x7f0000001000 → Physical 0x12345000
└── Ready for memory access

3. Physical Memory Access Pipeline

Step 1: CPU checks L1 Cache
├── L1 Hit: Return data (2-4 cycles)
└── L1 Miss: Check L2

Step 2: CPU checks L2 Cache  
├── L2 Hit: Return data, update L1 (10-20 cycles)
└── L2 Miss: Check L3

Step 3: CPU checks L3 Cache
├── L3 Hit: Return data, update L2 & L1 (30-40 cycles)  
└── L3 Miss: Access main memory

Step 4: Main Memory Access
├── Read from DRAM (100-300 cycles)
├── Update all cache levels
└── Return data to CPU

4. Page Fault Handling (Software)

Page Not Present Scenarios:

Scenario 1: First Access to mmap'd Region
├── Page fault interrupt
├── Kernel allocates physical page
├── Updates page table
├── Restarts instruction
└── Normal cache/TLB flow continues

Scenario 2: Swapped Out Page
├── Page fault interrupt  
├── Kernel identifies page in swap
├── Allocates new physical page
├── Reads data from swap device
├── Updates page table
└── Restarts instruction

Scenario 3: File-Backed Page Not in Cache
├── Page fault interrupt
├── Kernel checks page cache
├── If not cached: Read from file into page cache
├── Maps page cache page into process
└── Updates page table

Complete Memory Access Example

Initial State

Virtual Address: 0x7f0000001000
TLB: Empty
Page Table: Points to swapped page
Physical RAM: Page not present
Swap: Page stored in /dev/sda2 slot 1234

Access Sequence

1. CPU executes: mov eax, [0x7f0000001000]

2. TLB Lookup:
   ├── Miss (no entry for 0x7f0000001000)
   └── Continue to page table walk

3. Page Table Walk:
   ├── PML4[entry] → PDPT physical address
   ├── PDPT[entry] → PD physical address  
   ├── PD[entry] → PT physical address
   ├── PT[entry] → Shows page swapped (not present)
   └── Generate page fault interrupt

4. Page Fault Handler (Kernel):
   ├── Identify: Page in swap slot 1234
   ├── Allocate: New physical page at 0x87654000
   ├── I/O: Read from swap device to physical page
   ├── Update: Page table entry to point to 0x87654000
   ├── TLB: Invalidate old entries
   └── Return: Resume instruction

5. Instruction Restart:
   ├── TLB lookup: Still miss
   ├── Page table walk: Now finds physical 0x87654000
   ├── Update TLB: 0x7f0000001000 → 0x87654000
   └── Physical access: Check caches

6. Cache Hierarchy:
   ├── L1 miss (first access)
   ├── L2 miss (first access)
   ├── L3 miss (first access)  
   ├── DRAM access: Read from 0x87654000
   ├── Populate: L3, L2, L1 caches
   └── Return: Data to CPU register

Total: ~10,000+ CPU cycles for first access
Subsequent: 2-4 cycles (L1 cache hit)

Detailed Residency and Swappability Analysis

What Lives Where and Can Be Swapped

Always in RAM (Never Swapped)

1. Kernel Code & Data:
   ├── Location: Low physical memory (0x0-0x400000 typical)
   ├── Size: 10-50MB
   ├── Why: Kernel needs immediate access for interrupts, syscalls
   └── Includes: System call handlers, interrupt handlers, core data structures

2. Page Tables:
   ├── Location: Kernel physical memory  
   ├── Size: ~1MB per active process
   ├── Why: Hardware MMU needs immediate access for translation
   └── Swapping would cause infinite recursion (need page tables to swap!)

3. Slab Allocator Objects:
   ├── Location: Kernel physical memory
   ├── Size: Variable (MBs to GBs)
   ├── Why: Critical kernel data structures need fast access
   └── Examples: task_struct, dentry, inode, skb

4. DMA Buffers:
   ├── Location: Physically contiguous RAM
   ├── Size: Variable  
   ├── Why: Hardware devices need stable physical addresses
   └── Used for: Network cards, disk controllers, graphics

5. Huge Pages (Traditional):
   ├── Location: Physical RAM
   ├── Size: 2MB or 1GB each
   ├── Why: Performance optimization, swapping defeats purpose
   └── Our test case uses these!

6. Frame Descriptors (struct page):
   ├── Location: mem_map[] array in RAM
   ├── Size: ~40 bytes × number of physical frames
   ├── Why: Needed to manage physical memory itself
   └── Example: 32GB RAM = ~320MB for frame descriptors

Can Be Swapped to Disk

1. User Process Pages (Anonymous):
   ├── Location: Physical RAM → Swap space when swapped
   ├── Examples: malloc() memory, stack pages, heap
   ├── Swap destination: /dev/sda2 or /swapfile
   └── Retrieved via: Page fault handling

2. User Process Pages (File-backed):
   ├── Location: Physical RAM → Original file when swapped
   ├── Examples: mmap'd files, program code, shared libraries
   ├── Swap destination: Original file (not swap space)
   └── Retrieved via: Page fault + file I/O

3. Page Cache (Clean):
   ├── Location: Physical RAM → Simply freed (not written anywhere)
   ├── Why: Can re-read from original file
   ├── Examples: Recently read file contents
   └── Retrieved via: File I/O when accessed again

4. Page Cache (Dirty):
   ├── Location: Physical RAM → Written to file, then freed
   ├── Why: Must preserve modifications
   ├── Examples: Modified file contents not yet saved
   └── Process: Writeback to file, then page can be freed

Memory Layout in Physical RAM

Typical Physical Memory Organization

Physical Address Range: 0x00000000 - 0x7FFFFFFFF (32GB example)

0x00000000 ┌─────────────────────────────────────┐
           │ BIOS/UEFI Reserved                  │
0x00100000 ├─────────────────────────────────────┤
           │ Kernel Code & Data                  │ ← Never swapped
0x01000000 ├─────────────────────────────────────┤
           │ Frame Descriptors (mem_map)         │ ← Never swapped
0x05000000 ├─────────────────────────────────────┤
           │ Slab Allocator Areas                │ ← Never swapped
           │ ├── task_struct cache               │
           │ ├── dentry cache                    │
           │ ├── inode cache                     │
           │ └── many others...                  │
0x10000000 ├─────────────────────────────────────┤
           │ Page Tables (all processes)         │ ← Never swapped
0x20000000 ├─────────────────────────────────────┤
           │ DMA Buffers                         │ ← Never swapped
0x30000000 ├─────────────────────────────────────┤
           │ Page Cache                          │ ← Can be freed/swapped
           │ (file contents cached in RAM)       │
0x50000000 ├─────────────────────────────────────┤
           │ User Process Pages                  │ ← Can be swapped
           │ ├── Anonymous pages                 │
           │ ├── File-backed pages               │
           │ └── Shared memory pages             │
0x7FFFFFFF └─────────────────────────────────────┘

Typical proportions in a running system:
├── Kernel + metadata: 10-20% (never swapped)
├── Page cache: 40-60% (can be freed)
├── User pages: 20-40% (can be swapped)
└── Free: 5-10% (available)

Enhanced Storage Component Summary

Component	Type	Size	Speed	Purpose	Location	Swappable
Registers	Hardware	64B × 16	0.3ns	Active computation	CPU die	❌ Never
L1 Cache	Hardware	64KB	1ns	Recent instructions/data	CPU die	❌ Never
L2 Cache	Hardware	512KB	5ns	Cache backup	CPU die	❌ Never
L3 Cache	Hardware	16MB	10ns	Last level cache	CPU die	❌ Never
TLB	Hardware	1024 entries	1ns	Address translation	CPU MMU	❌ Never
Page Tables	Software	~1MB/process	100ns	Virtual→Physical mapping	Kernel RAM	❌ Never
Frame Descriptors	Software	~40B/frame	100ns	Track physical pages	Kernel RAM	❌ Never
Slab Objects	Software	Variable	100ns	Kernel data structures	Kernel RAM	❌ Never
Main RAM (User)	Hardware	~20GB	100ns	User process pages	DRAM	✅ To swap
Page Cache	Software	Dynamic	100ns	File caching	DRAM	✅ To file
Swap Space	Software	8GB	100μs	Virtual memory extension	Disk	N/A (is swap)
Disk Storage	Hardware	1TB	10ms	Persistent storage	SSD/HDD	N/A (persistent)

Memory Management Decision Tree

When Memory Pressure Occurs

Kernel Memory Reclaim Algorithm:

1. Check Page Cache:
   ├── Clean pages: Free immediately (can re-read from file)
   ├── Dirty pages: Write to file, then free
   └── Result: Fast memory recovery

2. Check User Pages:
   ├── Anonymous pages: Write to swap space
   ├── File-backed pages: Write to original file (if dirty)
   └── Result: Slower but recovers memory

3. Never Touch:
   ├── Kernel code/data
   ├── Page tables
   ├── Slab objects
   ├── DMA buffers
   └── Hardware structures

4. Last Resort:
   ├── OOM (Out of Memory) killer
   ├── Kill processes to free memory
   └── System remains stable

Real-World Example: Our Hugemmap06 Test

Memory Allocation Breakdown

// Our test allocates:
addr = mmap(NULL, 51 * 2MB, MAP_HUGETLB | MAP_ANONYMOUS, ...);

What gets allocated where:

1. Huge Pages (102MB):
   ├── Location: Physical RAM frames
   ├── Swappable: ❌ Never (huge pages can't swap)
   ├── Usage: User data pages
   └── COW: Creates private copies per thread

2. Page Table Entries:
   ├── Location: Kernel RAM (page tables)
   ├── Size: ~4KB for mapping 51 huge pages
   ├── Swappable: ❌ Never
   └── Content: Virtual→Physical mappings

3. VMA Structures:
   ├── Location: Slab cache (vm_area_struct)
   ├── Size: ~200 bytes per mmap() call
   ├── Swappable: ❌ Never (kernel object)
   └── Purpose: Track memory regions per process

4. Thread Stacks (50 threads):
   ├── Location: Regular RAM pages  
   ├── Size: 8MB × 50 = 400MB
   ├── Swappable: ✅ Yes (can swap to disk)
   └── Content: Thread execution stacks

Total memory usage:
├── Never swappable: 102MB (huge pages) + 4KB (page tables) + 10KB (VMAs)
├── Swappable: 400MB (thread stacks)
└── System impact: ~502MB that must stay in RAM during test

Key Relationships and Interactions

Hardware ↔ Software Interface

Hardware provides:
├── Raw storage (RAM, disk)
├── Address translation (TLB, MMU)
├── Caching (L1, L2, L3)
├── Interrupts (page faults)
└── Physical frame management

Software manages:
├── Virtual memory layout
├── Page table structures  
├── Page fault handling
├── Swap algorithms
├── Page cache policies
├── Process memory mappings
├── Slab allocation policies
└── Memory reclaim strategies

Critical Dependencies

Page Tables depend on:
├── Physical frames (to store page table pages)
├── Never swapped (would create circular dependency)
└── Slab allocator (for dynamic page table allocation)

Slab Allocator depends on:
├── Physical frames (for slab pages)
├── Page tables (to map slab areas)
└── Never swapped (kernel needs immediate access)

Frame Descriptors depend on:
├── Physical RAM (mem_map[] array)
├── Bootstrap allocation (allocated at boot)
└── Never swapped (needed to manage swapping itself!)

TLB depends on:
├── Page tables (source of translations)
├── Hardware implementation
└── Software invalidation (when mappings change)

Memory Pressure Response Chain

1. Application requests memory → Page fault
2. Kernel checks available frames
3. If frames available: Allocate directly
4. If low on frames: Start reclaim
   ├── a) Free clean page cache pages
   ├── b) Write dirty page cache to files
   ├── c) Swap anonymous user pages
   ├── d) Never touch kernel structures
5. If still no memory: OOM killer
6. Update page tables and TLB
7. Resume application

This hierarchy ensures that:

Critical kernel structures stay in fast RAM (never swapped)
User data can be moved to slower storage when needed
Hardware acceleration (TLB, caches) speeds up common operations
Software policies manage the complexity transparently

The beauty is that applications see a simple flat virtual memory space, while the system orchestrates this complex multi-layered storage hierarchy automatically!