Course Content
Lesson 3 – Memory Management Deep Dive
0/1
Linux Kernel Mastery: Design Principles & Case Studies

Complete Memory Hierarchy and Components

Memory Hierarchy Overview

The memory system is organized as a hierarchy from fastest/smallest to slowest/largest:

CPU Registers (Hardware) ←── Fastest, Smallest
    ↓
L1 Cache (Hardware)
    ↓
L2 Cache (Hardware)  
    ↓
L3 Cache (Hardware)
    ↓
TLB (Hardware)
    ↓
Main Memory/RAM (Hardware)
    ↓
Page Cache (Software in RAM)
    ↓
Swap Space (Software on Disk)
    ↓
Disk Storage (Hardware) ←── Slowest, Largest

Physical Components (Hardware)

1. CPU Registers

Location: Inside CPU die
Size: 64-bit × ~16 general purpose registers
Speed: 1 CPU cycle (~0.3ns at 3GHz)
Purpose: Hold currently executing instruction operands

Example:
RAX = 0x7f0000001000  (virtual address)
RBX = 0x12345678      (data value)

2. CPU Caches (L1, L2, L3)

L1 Cache:
├── Size: 32KB instruction + 32KB data per core
├── Speed: 2-4 CPU cycles (~1ns)  
├── Location: Inside each CPU core
└── Purpose: Cache recently used instructions/data

L2 Cache:
├── Size: 256KB-512KB per core
├── Speed: 10-20 CPU cycles (~5ns)
├── Location: Inside CPU, per core or shared
└── Purpose: Backup for L1 cache misses

L3 Cache:
├── Size: 8MB-32MB shared
├── Speed: 30-40 CPU cycles (~10ns)
├── Location: Inside CPU, shared among cores
└── Purpose: Last level before main memory

3. Translation Lookaside Buffer (TLB)

TLB (Hardware):
├── Size: 64-1024 entries
├── Speed: 1-2 CPU cycles
├── Location: Inside CPU (part of MMU)
├── Purpose: Cache virtual → physical address translations
└── Structure: [Virtual Page → Physical Frame + Permissions]

Example TLB Entry:
Virtual Page: 0x7f0000001000 → Physical Frame: 0x12345000 + RWX

4. Main Memory (RAM)

DDR4/DDR5 DRAM:
├── Size: 8GB-128GB typical
├── Speed: 100-300 CPU cycles (~50-100ns)
├── Location: DIMM slots on motherboard  
├── Purpose: Primary working storage
└── Organization: Banks, Rows, Columns

Physical addressing:
[Channel][DIMM][Rank][Bank][Row][Column]

5. Storage Devices

NVMe SSD:
├── Size: 512GB-8TB typical
├── Speed: 300,000+ CPU cycles (~100μs)
├── Location: M.2 or PCIe slot
└── Purpose: Persistent storage

SATA SSD:
├── Size: 256GB-4TB typical  
├── Speed: 600,000+ CPU cycles (~200μs)

Traditional HDD:
├── Size: 1TB-18TB typical
├── Speed: 30,000,000+ CPU cycles (~10ms)
├── Location: SATA connection
└── Purpose: Bulk storage

Software Components (Kernel Managed)

1. Virtual Memory System

Per-Process Virtual Address Space:
┌─────────────────────────────────────────────┐
│ 0x00000000 - 0x00400000: Code (.text)      │
│ 0x00400000 - 0x00600000: Data (.data)      │  
│ 0x00600000 - 0x00800000: Heap              │
│ 0x7f0000000000 - 0x7f8000000000: mmap area │ ← Our test
│ 0x7fffffffe000 - 0x8000000000: Stack       │
└─────────────────────────────────────────────┘

Managed by: Kernel's Memory Management Unit (MMU) software

2. Page Tables (Software Data Structure in RAM)

Multi-Level Page Table (x86-64):
PML4 → PDPT → PD → PT → Physical Page
 │      │     │    │
 │      │     │    └─ Page Table Entry (PTE)
 │      │     └────── Page Directory Entry (PDE)  
 │      └──────────── Page Directory Pointer (PDPE)
 └─────────────────── Page Map Level 4 (PML4E)

Location: Stored in kernel's physical RAM
Size: ~1MB per process (typical)
Swappable: ❌ NEVER swapped (kernel needs immediate access)

Each entry contains:
├── Physical address (bits 12-51)
├── Present bit (P)
├── Read/Write bit (R/W)
├── User/Supervisor bit (U/S)
├── Accessed bit (A)
├── Dirty bit (D)
└── Other control bits

Physical Layout in RAM:
┌─────────────────┐ ← PML4 table (4KB page)
│ 512 × 8-byte    │
│ entries         │
├─────────────────┤
│ PDPT table      │ ← Referenced by PML4[0]
├─────────────────┤
│ Page Directory  │ ← Referenced by PDPT[0]
├─────────────────┤
│ Page Table      │ ← Referenced by PD[0]
└─────────────────┘

3. Physical Frames (Hardware RAM Organization)

Physical Memory Divided into Frames:
┌────────────────────────────────────────────┐
│ Frame 0: 0x00000000-0x00000fff (4KB)      │ ← Used by kernel
│ Frame 1: 0x00001000-0x00001fff (4KB)      │ ← Page table data
│ Frame 2: 0x00002000-0x00002fff (4KB)      │ ← User process page
│ ...                                        │
│ Frame N: Physical RAM limit                │
└────────────────────────────────────────────┘

Frame Management:
├── Location: Physical RAM chips
├── Size: 4KB for regular pages, 2MB/1GB for huge pages
├── Tracked by: struct page (Linux kernel)
├── Swappable: Depends on usage
│   ├── ✅ User data pages: Can be swapped
│   ├── ❌ Kernel code/data: Never swapped
│   ├── ❌ Page tables: Never swapped
│   └── ❌ Slab allocations: Usually not swapped

Frame Descriptor (struct page) per frame:
struct page {
    unsigned long flags;        // Page status flags
    atomic_t _refcount;        // Reference count
    atomic_t _mapcount;        // Mapping count
    struct list_head lru;      // LRU list linkage
    void *virtual;             // Virtual address (if mapped)
    // ... many other fields
};

These descriptors: Stored in mem_map[] array in RAM, never swapped

4. Slab Allocator (Kernel Memory Management)

Purpose: Efficient allocation of kernel objects
Location: Physical RAM (kernel space)
Swappable: ❌ NEVER swapped

Structure:
Cache → Slab → Objects

Example - dentry cache (directory entries):
┌─────────────────────────────────────────┐
│ dentry_cache (kmem_cache)               │
├─────────────────────────────────────────┤
│ Slab 1: [obj][obj][obj][free][free]     │ ← 4KB or larger
│ Slab 2: [obj][obj][obj][obj][obj]       │
│ Slab 3: [free][free][free][free]        │
└─────────────────────────────────────────┘

Common Slab Caches:
├── kmalloc-* (general kernel allocations)
├── vm_area_struct (VMA objects)
├── task_struct (process descriptors)
├── dentry (directory entries)
├── inode_cache (filesystem inodes)
├── buffer_head (block device buffers)
└── skbuff_head_cache (network packet headers)

Slab Types:
1. Regular slabs: Normal kernel objects
2. DMA slabs: DMA-capable memory
3. NUMA slabs: NUMA-aware allocations

Physical Layout:
┌─────────────┐ ← Physical frame 100
│ Slab Header │
├─────────────┤
│ Object 1    │ ← e.g., task_struct
├─────────────┤
│ Object 2    │ ← e.g., task_struct  
├─────────────┤
│ Object 3    │ ← e.g., task_struct
└─────────────┘

View slab info: cat /proc/slabinfo
Purpose: Cache file contents in RAM
Location: Part of main memory, managed by kernel
Size: Dynamic, uses available RAM

Structure:
File: /home/user/data.txt
├── Page 0 (offset 0-4095):    [Cached in RAM at 0x12340000]
├── Page 1 (offset 4096-8191): [Cached in RAM at 0x12341000]  
└── Page 2 (offset 8192-...):  [Not in cache, on disk]

Managed by: Linux page cache subsystem

5. Page Cache (Software in RAM)

Purpose: Cache file contents in RAM
Location: Part of main memory, managed by kernel
Size: Dynamic, uses available RAM
Swappable: ✅ CAN be swapped (but usually written back to original file)

Structure:
File: /home/user/data.txt
├── Page 0 (offset 0-4095):    [Cached in RAM at frame 1000]
├── Page 1 (offset 4096-8191): [Cached in RAM at frame 1001]  
└── Page 2 (offset 8192-...):  [Not in cache, on disk]

Under memory pressure:
├── Clean pages: Simply freed (can re-read from file)
├── Dirty pages: Written back to file, then freed
└── Anonymous pages: Written to swap space

Managed by: Linux page cache subsystem

6. Swap Space (Software on Disk)

Swap Partition:
/dev/sda2 → Raw disk partition used as swap

Swap File:
/swapfile → Regular file used as swap

Swap Areas:
├── Swap header (metadata)
├── Swap slots (4KB each for regular pages)
└── Swap out/in algorithms (LRU, etc.)

Purpose: Extend virtual memory when RAM is full

How Everything is Tied Together

Memory Access Flow

1. CPU Instruction Execution

// C code:
int *ptr = (int*)0x7f0000001000;
int value = *ptr;  // Memory read instruction

// Assembly (simplified):
mov rax, 0x7f0000001000  // Load virtual address into register
mov ebx, [rax]           // Read from memory at virtual address

2. Virtual Address Translation Pipeline

Step 1: CPU checks TLB
├── TLB Hit: Get physical address immediately (1-2 cycles)
└── TLB Miss: Continue to page table walk

Step 2: Page Table Walk (if TLB miss)
├── Hardware walks page tables in RAM
├── Loads translation into TLB  
├── Takes 100+ cycles
└── May cause multiple cache misses

Step 3: Physical Address Generated
├── Virtual 0x7f0000001000 → Physical 0x12345000
└── Ready for memory access

3. Physical Memory Access Pipeline

Step 1: CPU checks L1 Cache
├── L1 Hit: Return data (2-4 cycles)
└── L1 Miss: Check L2

Step 2: CPU checks L2 Cache  
├── L2 Hit: Return data, update L1 (10-20 cycles)
└── L2 Miss: Check L3

Step 3: CPU checks L3 Cache
├── L3 Hit: Return data, update L2 & L1 (30-40 cycles)  
└── L3 Miss: Access main memory

Step 4: Main Memory Access
├── Read from DRAM (100-300 cycles)
├── Update all cache levels
└── Return data to CPU

4. Page Fault Handling (Software)

Page Not Present Scenarios:

Scenario 1: First Access to mmap'd Region
├── Page fault interrupt
├── Kernel allocates physical page
├── Updates page table
├── Restarts instruction
└── Normal cache/TLB flow continues

Scenario 2: Swapped Out Page
├── Page fault interrupt  
├── Kernel identifies page in swap
├── Allocates new physical page
├── Reads data from swap device
├── Updates page table
└── Restarts instruction

Scenario 3: File-Backed Page Not in Cache
├── Page fault interrupt
├── Kernel checks page cache
├── If not cached: Read from file into page cache
├── Maps page cache page into process
└── Updates page table

Complete Memory Access Example

Initial State

Virtual Address: 0x7f0000001000
TLB: Empty
Page Table: Points to swapped page
Physical RAM: Page not present
Swap: Page stored in /dev/sda2 slot 1234

Access Sequence

1. CPU executes: mov eax, [0x7f0000001000]

2. TLB Lookup:
   ├── Miss (no entry for 0x7f0000001000)
   └── Continue to page table walk

3. Page Table Walk:
   ├── PML4[entry] → PDPT physical address
   ├── PDPT[entry] → PD physical address  
   ├── PD[entry] → PT physical address
   ├── PT[entry] → Shows page swapped (not present)
   └── Generate page fault interrupt

4. Page Fault Handler (Kernel):
   ├── Identify: Page in swap slot 1234
   ├── Allocate: New physical page at 0x87654000
   ├── I/O: Read from swap device to physical page
   ├── Update: Page table entry to point to 0x87654000
   ├── TLB: Invalidate old entries
   └── Return: Resume instruction

5. Instruction Restart:
   ├── TLB lookup: Still miss
   ├── Page table walk: Now finds physical 0x87654000
   ├── Update TLB: 0x7f0000001000 → 0x87654000
   └── Physical access: Check caches

6. Cache Hierarchy:
   ├── L1 miss (first access)
   ├── L2 miss (first access)
   ├── L3 miss (first access)  
   ├── DRAM access: Read from 0x87654000
   ├── Populate: L3, L2, L1 caches
   └── Return: Data to CPU register

Total: ~10,000+ CPU cycles for first access
Subsequent: 2-4 cycles (L1 cache hit)

Detailed Residency and Swappability Analysis

What Lives Where and Can Be Swapped

Always in RAM (Never Swapped)

1. Kernel Code & Data:
   ├── Location: Low physical memory (0x0-0x400000 typical)
   ├── Size: 10-50MB
   ├── Why: Kernel needs immediate access for interrupts, syscalls
   └── Includes: System call handlers, interrupt handlers, core data structures

2. Page Tables:
   ├── Location: Kernel physical memory  
   ├── Size: ~1MB per active process
   ├── Why: Hardware MMU needs immediate access for translation
   └── Swapping would cause infinite recursion (need page tables to swap!)

3. Slab Allocator Objects:
   ├── Location: Kernel physical memory
   ├── Size: Variable (MBs to GBs)
   ├── Why: Critical kernel data structures need fast access
   └── Examples: task_struct, dentry, inode, skb

4. DMA Buffers:
   ├── Location: Physically contiguous RAM
   ├── Size: Variable  
   ├── Why: Hardware devices need stable physical addresses
   └── Used for: Network cards, disk controllers, graphics

5. Huge Pages (Traditional):
   ├── Location: Physical RAM
   ├── Size: 2MB or 1GB each
   ├── Why: Performance optimization, swapping defeats purpose
   └── Our test case uses these!

6. Frame Descriptors (struct page):
   ├── Location: mem_map[] array in RAM
   ├── Size: ~40 bytes × number of physical frames
   ├── Why: Needed to manage physical memory itself
   └── Example: 32GB RAM = ~320MB for frame descriptors

Can Be Swapped to Disk

1. User Process Pages (Anonymous):
   ├── Location: Physical RAM → Swap space when swapped
   ├── Examples: malloc() memory, stack pages, heap
   ├── Swap destination: /dev/sda2 or /swapfile
   └── Retrieved via: Page fault handling

2. User Process Pages (File-backed):
   ├── Location: Physical RAM → Original file when swapped
   ├── Examples: mmap'd files, program code, shared libraries
   ├── Swap destination: Original file (not swap space)
   └── Retrieved via: Page fault + file I/O

3. Page Cache (Clean):
   ├── Location: Physical RAM → Simply freed (not written anywhere)
   ├── Why: Can re-read from original file
   ├── Examples: Recently read file contents
   └── Retrieved via: File I/O when accessed again

4. Page Cache (Dirty):
   ├── Location: Physical RAM → Written to file, then freed
   ├── Why: Must preserve modifications
   ├── Examples: Modified file contents not yet saved
   └── Process: Writeback to file, then page can be freed

Memory Layout in Physical RAM

Typical Physical Memory Organization

Physical Address Range: 0x00000000 - 0x7FFFFFFFF (32GB example)

0x00000000 ┌─────────────────────────────────────┐
           │ BIOS/UEFI Reserved                  │
0x00100000 ├─────────────────────────────────────┤
           │ Kernel Code & Data                  │ ← Never swapped
0x01000000 ├─────────────────────────────────────┤
           │ Frame Descriptors (mem_map)         │ ← Never swapped
0x05000000 ├─────────────────────────────────────┤
           │ Slab Allocator Areas                │ ← Never swapped
           │ ├── task_struct cache               │
           │ ├── dentry cache                    │
           │ ├── inode cache                     │
           │ └── many others...                  │
0x10000000 ├─────────────────────────────────────┤
           │ Page Tables (all processes)         │ ← Never swapped
0x20000000 ├─────────────────────────────────────┤
           │ DMA Buffers                         │ ← Never swapped
0x30000000 ├─────────────────────────────────────┤
           │ Page Cache                          │ ← Can be freed/swapped
           │ (file contents cached in RAM)       │
0x50000000 ├─────────────────────────────────────┤
           │ User Process Pages                  │ ← Can be swapped
           │ ├── Anonymous pages                 │
           │ ├── File-backed pages               │
           │ └── Shared memory pages             │
0x7FFFFFFF └─────────────────────────────────────┘

Typical proportions in a running system:
├── Kernel + metadata: 10-20% (never swapped)
├── Page cache: 40-60% (can be freed)
├── User pages: 20-40% (can be swapped)
└── Free: 5-10% (available)

Enhanced Storage Component Summary

Component Type Size Speed Purpose Location Swappable
Registers Hardware 64B × 16 0.3ns Active computation CPU die ❌ Never
L1 Cache Hardware 64KB 1ns Recent instructions/data CPU die ❌ Never
L2 Cache Hardware 512KB 5ns Cache backup CPU die ❌ Never
L3 Cache Hardware 16MB 10ns Last level cache CPU die ❌ Never
TLB Hardware 1024 entries 1ns Address translation CPU MMU ❌ Never
Page Tables Software ~1MB/process 100ns Virtual→Physical mapping Kernel RAM ❌ Never
Frame Descriptors Software ~40B/frame 100ns Track physical pages Kernel RAM ❌ Never
Slab Objects Software Variable 100ns Kernel data structures Kernel RAM ❌ Never
Main RAM (User) Hardware ~20GB 100ns User process pages DRAM ✅ To swap
Page Cache Software Dynamic 100ns File caching DRAM ✅ To file
Swap Space Software 8GB 100μs Virtual memory extension Disk N/A (is swap)
Disk Storage Hardware 1TB 10ms Persistent storage SSD/HDD N/A (persistent)

Memory Management Decision Tree

When Memory Pressure Occurs

Kernel Memory Reclaim Algorithm:

1. Check Page Cache:
   ├── Clean pages: Free immediately (can re-read from file)
   ├── Dirty pages: Write to file, then free
   └── Result: Fast memory recovery

2. Check User Pages:
   ├── Anonymous pages: Write to swap space
   ├── File-backed pages: Write to original file (if dirty)
   └── Result: Slower but recovers memory

3. Never Touch:
   ├── Kernel code/data
   ├── Page tables
   ├── Slab objects
   ├── DMA buffers
   └── Hardware structures

4. Last Resort:
   ├── OOM (Out of Memory) killer
   ├── Kill processes to free memory
   └── System remains stable

Real-World Example: Our Hugemmap06 Test

Memory Allocation Breakdown

// Our test allocates:
addr = mmap(NULL, 51 * 2MB, MAP_HUGETLB | MAP_ANONYMOUS, ...);

What gets allocated where:

1. Huge Pages (102MB):
   ├── Location: Physical RAM frames
   ├── Swappable: ❌ Never (huge pages can't swap)
   ├── Usage: User data pages
   └── COW: Creates private copies per thread

2. Page Table Entries:
   ├── Location: Kernel RAM (page tables)
   ├── Size: ~4KB for mapping 51 huge pages
   ├── Swappable: ❌ Never
   └── Content: Virtual→Physical mappings

3. VMA Structures:
   ├── Location: Slab cache (vm_area_struct)
   ├── Size: ~200 bytes per mmap() call
   ├── Swappable: ❌ Never (kernel object)
   └── Purpose: Track memory regions per process

4. Thread Stacks (50 threads):
   ├── Location: Regular RAM pages  
   ├── Size: 8MB × 50 = 400MB
   ├── Swappable: ✅ Yes (can swap to disk)
   └── Content: Thread execution stacks

Total memory usage:
├── Never swappable: 102MB (huge pages) + 4KB (page tables) + 10KB (VMAs)
├── Swappable: 400MB (thread stacks)
└── System impact: ~502MB that must stay in RAM during test

Key Relationships and Interactions

Hardware ↔ Software Interface

Hardware provides:
├── Raw storage (RAM, disk)
├── Address translation (TLB, MMU)
├── Caching (L1, L2, L3)
├── Interrupts (page faults)
└── Physical frame management

Software manages:
├── Virtual memory layout
├── Page table structures  
├── Page fault handling
├── Swap algorithms
├── Page cache policies
├── Process memory mappings
├── Slab allocation policies
└── Memory reclaim strategies

Critical Dependencies

Page Tables depend on:
├── Physical frames (to store page table pages)
├── Never swapped (would create circular dependency)
└── Slab allocator (for dynamic page table allocation)

Slab Allocator depends on:
├── Physical frames (for slab pages)
├── Page tables (to map slab areas)
└── Never swapped (kernel needs immediate access)

Frame Descriptors depend on:
├── Physical RAM (mem_map[] array)
├── Bootstrap allocation (allocated at boot)
└── Never swapped (needed to manage swapping itself!)

TLB depends on:
├── Page tables (source of translations)
├── Hardware implementation
└── Software invalidation (when mappings change)

Memory Pressure Response Chain

1. Application requests memory → Page fault
2. Kernel checks available frames
3. If frames available: Allocate directly
4. If low on frames: Start reclaim
   ├── a) Free clean page cache pages
   ├── b) Write dirty page cache to files
   ├── c) Swap anonymous user pages
   ├── d) Never touch kernel structures
5. If still no memory: OOM killer
6. Update page tables and TLB
7. Resume application

This hierarchy ensures that:

  • Critical kernel structures stay in fast RAM (never swapped)
  • User data can be moved to slower storage when needed
  • Hardware acceleration (TLB, caches) speeds up common operations
  • Software policies manage the complexity transparently

The beauty is that applications see a simple flat virtual memory space, while the system orchestrates this complex multi-layered storage hierarchy automatically!