all
Stage 07

Kernel Subsystems

Explore the major Linux kernel subsystems — VFS, memory management, scheduler, network stack, and the device model — understanding how they work and how to inspect them at runtime.

18 min read
81320 chars

The Linux Kernel Architecture

{:.gc-basic}

Basic

The Linux kernel is a monolithic kernel — all kernel subsystems run in the same address space (kernel space) with direct access to each other’s functions and data. This is in contrast to a microkernel (like Mach or L4) where each kernel service runs in its own isolated process, communicating via message passing.

User Space
┌─────────────┬──────────────┬─────────────┬──────────────┐
│   Shell /   │  Web Server  │   Database  │  Media Player│
│   Init      │  (nginx)     │  (sqlite)   │  (gstreamer) │
└──────┬──────┴──────┬───────┴──────┬──────┴──────┬───────┘
       │ System Calls (open, read, write, fork, ioctl...)
═══════╪═════════════╪══════════════╪═════════════╪═══════════
Kernel Space
┌──────┴──────────────┴──────────────┴─────────────┴───────┐
│                   System Call Interface                    │
├─────────┬──────────┬──────────────┬──────────────────────┤
│   VFS   │    MM    │   Process    │    Network Stack      │
│(Virtual │(Memory   │  Scheduler   │ (TCP/IP, Netfilter,   │
│File Sys)│ Mgmt)    │    (CFS)     │  sockets, drivers)   │
├─────────┴──────────┴──────────────┴──────────────────────┤
│              Device Driver Framework                       │
│  (platform bus, PCI, I2C, SPI, USB, block, char, net)    │
├──────────────────────────────────────────────────────────┤
│              Architecture-Specific Code (arch/)            │
│    (CPU init, interrupt handling, memory mapping, DMA)    │
└──────────────────────────────────────────────────────────┘
Hardware: CPUs, RAM, NAND, UART, Ethernet, I2C sensors, USB...

Key Subsystems Overview

Subsystem Source Dir Purpose
Virtual Filesystem (VFS) fs/ Unified file operation API for all filesystems
Memory Management (MM) mm/ Physical pages, virtual memory, allocators
Process Scheduler kernel/sched/ CPU time allocation between tasks
Network Stack net/ TCP/IP, sockets, Netfilter, wireless
Device Drivers drivers/ Hardware abstraction for all peripherals
Interrupt Subsystem kernel/irq/ IRQ routing, interrupt controllers
Timers kernel/time/ High-resolution timers, timekeeping
Block Layer block/ Storage I/O scheduling and queuing

/proc and /sys as Windows Into the Kernel

The two most important virtual filesystems for inspecting kernel state:

# /proc — process information and kernel state
$ ls /proc/
1/    buddyinfo   cpuinfo    filesystems  iomem    kpagecgroup  meminfo
2/    bus/        crypto     fs/          ioports  kpagecount   misc
...

# Useful /proc files:
$ cat /proc/cpuinfo | head -20
processor	: 0
model name	: ARMv7 Processor rev 0 (v7l)
BogoMIPS	: 463.07
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0xc09
CPU revision	: 0

$ cat /proc/meminfo | head -10
MemTotal:         500480 kB
MemFree:          412304 kB
MemAvailable:     394084 kB
Buffers:            2312 kB
Cached:            43780 kB
SwapCached:            0 kB

$ cat /proc/interrupts
           CPU0       CPU1
  17:     10234       8921  GIC-0  29 Level     arch_timer
  44:       512          0  GIC-0  44 Level     eth0
  57:      1024          0  GIC-0  57 Level     mmc0

# /sys — device model and kernel object hierarchy
$ ls /sys/
block/  bus/  class/  dev/  devices/  firmware/  fs/  kernel/  module/  power/

# Navigate the device hierarchy
$ ls /sys/bus/i2c/devices/
0-0050  0-0068  1-0018  2-0040

# View a device's properties
$ cat /sys/bus/i2c/devices/0-0050/name
24c256

The Linux Driver Model: Buses, Devices, and Drivers

The Linux driver model is built on three concepts:

Bus (e.g., I2C bus)
├── Device (e.g., i2c-0-0050: AT24 EEPROM at address 0x50)
│   └── Matched to Driver (e.g., at24 driver)
├── Device (e.g., i2c-0-0068: RTC at address 0x68)
│   └── Matched to Driver (e.g., ds1307 driver)
└── ...

Matching happens automatically: when a device is registered on a bus, the kernel searches all registered drivers for that bus to find one whose id_table or of_match_table matches the device’s identifier. When a match is found, the driver’s probe() function is called.

# See what drivers are bound to which devices
$ ls /sys/bus/i2c/drivers/
at24/  ds1307/  bmp280/  lis3lv02d/

# Which devices are bound to the at24 driver
$ ls /sys/bus/i2c/drivers/at24/
0-0050 -> ../../../../devices/platform/soc/i2c@12080000/i2c-0/0-0050/
bind  uevent  unbind

# Manually unbind a driver from a device
$ echo "0-0050" > /sys/bus/i2c/drivers/at24/unbind

# Manually bind a driver to a device
$ echo "0-0050" > /sys/bus/i2c/drivers/at24/bind

VFS, MM, and Scheduler Internals

{:.gc-mid}

Intermediate

Virtual Filesystem (VFS)

VFS is an abstraction layer that provides a uniform interface to all filesystems. When a userspace program calls open("/etc/passwd", O_RDONLY), it goes through VFS which routes the call to the appropriate filesystem (ext4, NFS, tmpfs, etc.).

Key VFS data structures:

/* superblock: represents a mounted filesystem */
struct super_block {
    struct list_head    s_list;         /* list of all superblocks */
    dev_t               s_dev;          /* device identifier */
    unsigned long       s_blocksize;    /* block size in bytes */
    const struct super_operations *s_op; /* filesystem operations */
    struct dentry       *s_root;        /* root dentry */
    /* ... */
};

/* inode: represents a file or directory (unique per filesystem object) */
struct inode {
    umode_t             i_mode;         /* file type and permissions */
    unsigned short      i_opflags;
    kuid_t              i_uid;          /* owner UID */
    kgid_t              i_gid;          /* owner GID */
    loff_t              i_size;         /* file size in bytes */
    struct timespec64   i_atime;        /* access time */
    struct timespec64   i_mtime;        /* modification time */
    const struct inode_operations *i_op; /* inode operations */
    const struct file_operations  *i_fop; /* file operations */
    /* ... */
};

/* dentry: represents a directory entry (maps name to inode) */
struct dentry {
    struct inode        *d_inode;       /* associated inode */
    struct dentry       *d_parent;      /* parent directory */
    struct qstr         d_name;         /* name of this entry */
    const struct dentry_operations *d_op;
    /* ... */
};

/* file: represents an open file (one per open() call) */
struct file {
    struct dentry       *f_path.dentry;
    struct inode        *f_inode;
    const struct file_operations *f_op;
    loff_t              f_pos;          /* current file position */
    unsigned int        f_flags;        /* O_RDONLY, O_WRONLY, etc. */
    /* ... */
};

The call chain for read():

User: read(fd, buf, count)
  → sys_read() [fs/read_write.c]
    → vfs_read()
      → file->f_op->read_iter()   [dispatch to filesystem]
        → ext4_file_read_iter()   [ext4 implementation]
          → generic_file_read_iter()
            → ext4_readpage()     [actual disk I/O]

Memory Management

The MM subsystem manages all of the system’s physical and virtual memory:

# Physical memory view
$ cat /proc/meminfo
MemTotal:         500480 kB     # Total physical RAM
MemFree:          412304 kB     # Unused RAM
MemAvailable:     394084 kB     # Actually available (includes reclaimable)
Buffers:            2312 kB     # Block device buffers
Cached:            43780 kB     # Page cache (file data)
Slab:              12840 kB     # Kernel slab allocator (kmalloc objects)
PageTables:         1024 kB     # Page table memory
AnonPages:         28416 kB     # Anonymous (heap/stack) pages
Mapped:            18432 kB     # Memory-mapped files (mmap)

# Per-process memory map
$ cat /proc/1234/maps
00400000-00452000 r-xp 00000000 08:01 123456  /usr/bin/myapp  [text segment]
00651000-00652000 r--p 00051000 08:01 123456  /usr/bin/myapp  [read-only data]
00652000-00654000 rw-p 00052000 08:01 123456  /usr/bin/myapp  [data segment]
b6e00000-b6e21000 r-xp 00000000 08:01 789012  /lib/libc.so.6
b7200000-b7204000 rw-p 00000000 00:00 0       [heap]
bff00000-bff21000 rwxp 00000000 00:00 0       [stack]

Key MM concepts:

Concept Description
Physical page The basic unit of physical memory (typically 4KB)
VMA (vm_area_struct) Describes a contiguous virtual memory region
mm_struct Per-process memory descriptor (page tables, VMAs)
Page tables Hardware structures mapping virtual to physical addresses
kmalloc Kernel heap allocator — physically contiguous, max ~4MB
vmalloc Kernel virtual allocator — virtually contiguous, can be larger
/* kmalloc: physically contiguous — required for DMA */
void *buf = kmalloc(4096, GFP_KERNEL);
kfree(buf);

/* vmalloc: virtually contiguous — for large kernel buffers */
void *large_buf = vmalloc(1 * 1024 * 1024);  /* 1MB */
vfree(large_buf);

/* The key difference:
 * kmalloc: physical addresses are contiguous — DMA hardware can use it
 * vmalloc: only virtual addresses are contiguous — DO NOT use for DMA
 */

Completely Fair Scheduler (CFS)

The Linux scheduler is based on the CFS algorithm. Key concepts:

# View scheduler statistics
$ cat /proc/schedstat
# Per-CPU scheduling statistics

# See process scheduling info
$ cat /proc/1234/sched
myapp (1234, #threads: 1)
---
se.exec_start                      :       1234567.890123
se.sum_exec_runtime                :         12345.678901
se.nr_migrations                   :                    0
nr_switches                        :                  567
nr_voluntary_switches              :                  234
nr_involuntary_switches            :                  333
se.load.weight                     :                 1024
se.runnable_weight                 :                 1024
policy                             :                    0    # SCHED_NORMAL
prio                               :                  120   # default nice=0
clock-delta                        :                   45

# Real-time priority
$ chrt -p 1234   # Show scheduling policy
pid 1234's current scheduling policy: SCHED_OTHER
pid 1234's current scheduling priority: 0

CFS maintains a red-black tree of runnable tasks sorted by vruntime (virtual runtime). The task with the smallest vruntime (has run the least) is always scheduled next. Nice values adjust how fast vruntime accumulates — a nice=-20 (highest priority) process’s vruntime grows more slowly, so it gets to run more often.


fork() and exec() Internally

/* How fork() works (simplified) */
/* 1. Allocate new task_struct for child */
/* 2. Copy parent's memory descriptors (mm_struct) */
/*    With COW (Copy-on-Write): both parent and child share pages */
/*    marked read-only; page fault triggers actual copy */
/* 3. Copy parent's file descriptor table */
/* 4. Copy parent's signal handlers */
/* 5. Add child to scheduler run queue */

/* How exec() works (simplified) */
/* 1. Load executable format (ELF parser in fs/binfmt_elf.c) */
/* 2. Parse ELF headers: LOAD segments, entry point, interpreter */
/* 3. Map ELF segments into new mm_struct */
/* 4. Set up stack with argv, envp */
/* 5. If dynamically linked, map ld.so and hand control to it */
/* 6. Jump to entry point (or ld.so) */

/* COW in action */
$ strace -e clone fork_test 2>&1 | grep clone
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
      child_tidptr=0xb6f45a90) = 1235

kobjects and the Device Model Hierarchy

kobject is the fundamental kernel object that provides reference counting and integration with sysfs:

/* Every device in the kernel is represented as a kobject */
struct kobject {
    const char          *name;
    struct list_head    entry;
    struct kobject      *parent;    /* parent in /sys hierarchy */
    struct kset         *kset;
    const struct kobj_type *ktype;
    struct kernfs_node  *sd;        /* sysfs directory */
    struct kref         kref;       /* reference count */
};

/* struct device wraps kobject */
struct device {
    struct kobject kobj;
    struct device *parent;
    const char *init_name;
    const struct device_type *type;
    struct bus_type *bus;
    struct device_driver *driver;
    void *driver_data;
    /* ... */
};
# The /sys tree mirrors the kobject hierarchy
$ tree /sys/devices/platform/ | head -20
/sys/devices/platform/
├── Fixed MDIO bus.0
│   ├── mdio_bus
│   │   └── Fixed MDIO Bus
│   │       └── fixed-0:00
│   ├── modalias
│   └── uevent
├── i2c@12080000
│   ├── driver -> ../../../bus/platform/drivers/i2c-s3c2410
│   ├── i2c-0
│   │   └── 0-0050
│   │       ├── driver -> ../../../../bus/i2c/drivers/at24
│   │       └── modalias

Advanced Subsystem Internals

{:.gc-adv}

Advanced

Kernel Locking: Spinlocks, Mutexes, and RCU

Linux has multiple synchronization primitives, each suited to different contexts:

#include <linux/spinlock.h>
#include <linux/mutex.h>
#include <linux/rcupdate.h>

/* Spinlock: busy-waits, can be used in interrupt context */
static DEFINE_SPINLOCK(my_lock);

void my_irq_handler(int irq, void *data)
{
    unsigned long flags;
    spin_lock_irqsave(&my_lock, flags);  /* disable local IRQs + lock */
    /* critical section */
    spin_unlock_irqrestore(&my_lock, flags);
}

/* Mutex: sleeps when contended, CANNOT be used in interrupt context */
static DEFINE_MUTEX(my_mutex);

int my_open(struct inode *inode, struct file *filp)
{
    mutex_lock(&my_mutex);      /* may sleep */
    /* critical section */
    mutex_unlock(&my_mutex);
    return 0;
}

/* RCU (Read-Copy-Update): lockless reads, copy-modify-replace writes */
struct my_data {
    struct rcu_head rcu;
    int value;
};
static struct my_data __rcu *global_data;

/* Reader: no lock needed, just RCU read-side critical section */
void reader(void)
{
    struct my_data *data;
    rcu_read_lock();
    data = rcu_dereference(global_data);
    if (data)
        pr_info("value: %d\n", data->value);
    rcu_read_unlock();
}

/* Writer: allocate new copy, update, wait for readers, free old */
void writer(int new_value)
{
    struct my_data *new, *old;
    new = kmalloc(sizeof(*new), GFP_KERNEL);
    new->value = new_value;
    old = rcu_dereference_protected(global_data, lockdep_is_held(&my_mutex));
    rcu_assign_pointer(global_data, new);
    if (old)
        kfree_rcu(old, rcu);  /* free after all readers are done */
}

RCU is the kernel’s most scalable synchronization mechanism. It allows an unlimited number of concurrent readers with zero overhead (just memory barriers) while writers create a new copy of the data, atomically switch the pointer, and then free the old copy after a “grace period” in which all pre-existing readers have finished.


Memory Allocators: SLAB/SLUB/SLOB

# View slab allocator statistics
$ cat /proc/slabinfo | head -20
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
kmalloc-8192          68        68       8192        4    8
kmalloc-4096         124       124       4096        8    8
kmalloc-2048         298       340       2048       16    8
kmalloc-1024         688       720       1024       32    8
kmalloc-512          896      1008        512       32    4
kmalloc-256         1424      1792        256       32    2
kmalloc-192         1890      2016        192       21    1
kmalloc-128         2124      2368        128       32    1
kmalloc-96           756       756         96       42    1
kmalloc-64          4328      4992         64       64    1
kmalloc-32          2560      3072         32      128    1
task_struct          512       512       3840        8    8
mm_struct            234       234       1216       26    8
files_cache          156       156        832       49    1

# Total slab memory
$ awk '/^Total/{print $2}' /proc/slabinfo 2>/dev/null || \
  grep Slab /proc/meminfo
Slab:              12840 kB

SLUB (the default since 2.6.23) improves over SLAB by reducing per-slab overhead and simplifying the allocator. SLOB is a tiny allocator for systems with very limited RAM (< 4MB).


DMA Subsystem

#include <linux/dma-mapping.h>

/* Coherent DMA: CPU and device see the same data (cache coherent) */
/* Use for control structures, small buffers */
dma_addr_t dma_handle;
void *cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
/* cpu_addr: CPU virtual address */
/* dma_handle: physical address to give to the device */
dma_free_coherent(dev, size, cpu_addr, dma_handle);

/* Streaming DMA: for large data transfers (e.g., network packets, video frames) */
dma_addr_t dma_addr = dma_map_single(dev, cpu_buf, size, DMA_TO_DEVICE);
/* device performs DMA transfer... */
dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);

/* Scatter-Gather: for non-contiguous memory regions */
struct scatterlist sg[10];
sg_init_table(sg, 10);
sg_set_page(&sg[0], page0, PAGE_SIZE, 0);
sg_set_page(&sg[1], page1, PAGE_SIZE, 0);
int n = dma_map_sg(dev, sg, 10, DMA_FROM_DEVICE);
/* device fills multiple non-contiguous buffers in one DMA operation */
dma_unmap_sg(dev, sg, 10, DMA_FROM_DEVICE);

Interrupt Subsystem: IRQ Domains and GIC

# View interrupt assignments
$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
 17:     134821     134821     134821     134821  GIC-0  29 Level     arch_timer
 44:       8192          0          0          0  GIC-0  44 Level     eth0
 57:       2048          0          0          0  GIC-0  57 Level     mmc0
 83:        512          0          0          0  GIC-0  83 Level     i2c-0
IPI0:         0          0          0          0  Rescheduling interrupts
IPI1:      1024        512        256        128  Function call interrupts

# IRQ domain hierarchy for ARM GIC (Generic Interrupt Controller)
# Hardware IRQ -> GIC irq_domain -> Linux IRQ number -> handler
# Device Tree specifies: interrupts = <GIC_SPI 44 IRQ_TYPE_LEVEL_HIGH>;
# This maps hardware SPI 44 to a Linux IRQ number (e.g., 76)

# View IRQ affinities (which CPUs handle which IRQs)
$ cat /proc/irq/44/smp_affinity
f   # binary 1111 = CPUs 0,1,2,3

# Pin an IRQ to a specific CPU
$ echo 4 > /proc/irq/44/smp_affinity   # binary 0100 = CPU2 only

Workqueue Internals

/* Workqueues allow deferred work from interrupt context */
#include <linux/workqueue.h>

struct my_driver_data {
    struct work_struct  work;
    /* ... */
};

/* Work function executes in process context (can sleep) */
static void my_work_handler(struct work_struct *work)
{
    struct my_driver_data *data =
        container_of(work, struct my_driver_data, work);
    /* Safe to sleep here, allocate memory, call blocking APIs */
    msleep(10);
    pr_info("work executed in process context\n");
}

/* From interrupt context: schedule the work */
static irqreturn_t my_irq_handler(int irq, void *devid)
{
    struct my_driver_data *data = devid;
    schedule_work(&data->work);    /* deferred to kernel worker thread */
    return IRQ_HANDLED;
}
# View kernel worker threads (kworker)
$ ps aux | grep kworker | head -10
root         5  0.0  0.0      0     0 ?        S    09:00   0:00 [kworker/0:0H-kblockd]
root         6  0.0  0.0      0     0 ?        I    09:00   0:00 [kworker/0:0-events]
root        10  0.0  0.0      0     0 ?        I    09:00   0:00 [kworker/1:0-events]
root        14  0.0  0.0      0     0 ?        I    09:00   0:00 [kworker/2:0-events]
# kworker/N:M — N = CPU number, M = thread index
# kworker/0:0H — high-priority bound kworker on CPU 0
# unbound kworkers (kworker/u8:0) are not CPU-pinned

Network Stack: sk_buff (skb)

The sk_buff (socket buffer) is the fundamental data structure that carries network packets through the kernel’s network stack:

Packet flow through the network stack (receiving):

NIC hardware DMA → sk_buff allocated
                      │
                  netif_receive_skb()        [Device driver → net core]
                      │
                  ip_rcv()                   [L3: IP layer]
                      │
               ┌──────┴──────┐
               │             │
           tcp_v4_rcv()   udp_rcv()          [L4: TCP/UDP]
               │
           sk->sk_receive_queue              [Socket receive buffer]
               │
           userspace: recv()/read()          [Copy to user]

# Transmit path is the reverse
# View network stack statistics
$ cat /proc/net/dev
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs
    lo: 12345   100    0    0    0     0          0         0     12345   100    0
  eth0: 1048576  8192   0    0    0     0          0         0    524288  4096   0

# Socket buffer sizes
$ sysctl net.core.rmem_max
net.core.rmem_max = 212992
$ sysctl net.core.wmem_max
net.core.wmem_max = 212992

# View active connections with kernel buffer info
$ ss -tmn | head -10
State  Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
ESTAB  0       0       192.168.1.100:22    192.168.1.1:54321

Exploring Subsystems at Runtime

# Memory: buddyinfo shows free pages per order in each zone
$ cat /proc/buddyinfo
Node 0, zone   Normal      0      1      2      3      4      5      6      7      8      9     10
# Each column = count of free blocks of 2^N pages

# Slab info
$ cat /proc/slabinfo | grep "^kmalloc"

# Interrupt controller details
$ cat /proc/interrupts
$ cat /sys/kernel/irq/*/chip_name

# Kernel debug filesystem (requires CONFIG_DEBUG_FS)
$ mount -t debugfs none /sys/kernel/debug
$ ls /sys/kernel/debug/
bdi  block  clk  dma_buf  gpio  hid  iommu  mmc0  regmap  regulator  tracing

# GPIO state
$ cat /sys/kernel/debug/gpio
gpiochip0: GPIOs 0-31, parent: platform/gpio@12000000, gpio-myboard:
 gpio-0   (                    ) in  hi
 gpio-1   (LED                 ) out lo
 gpio-2   (BUTTON              ) in  hi IRQ
 gpio-3   (                    ) in  lo

# Clock tree
$ cat /sys/kernel/debug/clk/clk_summary | head -20
                                 enable  prepare  protect                    rate  accuracy  phase
                                  count    count    count                        (Hz)      (ppb) (degrees)
------------------------------------------------------------------------------
osc24M                                 1        1        0                24000000         0          0
   pll-cpux                            1        1        0               816000000         0          0
      cpux                             1        1        0               816000000         0          0
   pll-ddr                             1        1        0              1200000000         0          0
      ddr                              1        1        0               600000000         0          0

Interview Questions

{:.gc-iq}

Q: What is the Virtual Filesystem (VFS) and why does it exist?

VFS is an abstraction layer in the kernel that provides a single, unified API for file operations (open, read, write, close, stat, etc.) regardless of the underlying filesystem implementation. It exists because Linux supports many filesystems (ext4, btrfs, NFS, tmpfs, FAT32, etc.) and applications should not need to know which one is in use. VFS defines a set of operations (through file_operations, inode_operations, and super_operations structures with function pointers) that each filesystem implements. When a program calls read(), VFS dispatches to ext4_file_read_iter() or nfs_file_read() as appropriate.

Q: What is the difference between kmalloc and vmalloc?

kmalloc allocates physically contiguous memory from the kernel’s SLAB/SLUB allocator. The allocated memory has a direct linear mapping to physical memory. It is limited in size (typically up to ~4MB) but is the correct choice for DMA buffers since DMA hardware needs physically contiguous memory. vmalloc allocates virtually contiguous memory that may be physically fragmented — it uses the kernel’s vmalloc area with its own page tables. It can allocate much larger buffers but introduces TLB pressure and cannot be used for DMA. Use kmalloc by default; use vmalloc only when you need a large buffer that does not need to be DMA-able.

Q: How does the Linux driver model work — what is the relationship between bus, device, and driver?

The Linux driver model has three components: a bus_type (e.g., I2C bus, PCI bus, platform bus) that knows how to match devices to drivers; device structures representing hardware instances (added when hardware is discovered via Device Tree, ACPI, or enumeration); and device_driver structures representing drivers (registered when kernel modules are loaded). When a new device is registered on a bus, or when a new driver registers for a bus, the bus’s match() function is called for all driver-device pairs. When a match is found, the driver’s probe() function is called. This decoupling allows drivers to work with any matching device and devices to be hot-plugged.

Q: What is RCU (Read-Copy-Update) and what problem does it solve?

RCU solves the problem of efficiently protecting data that is read very frequently but written rarely. Traditional locking forces readers to acquire a lock, which creates contention on busy multi-core systems. RCU allows readers to proceed completely lock-free with only memory barriers. Writers create a new copy of the data structure, make their modifications, then atomically publish the new version by updating a pointer. After all pre-existing readers finish their read-side critical sections (the “grace period”), the old version is freed. RCU is used extensively for routing tables, security policy lists, file system mounts, and process credential lookups — all of which are read millions of times per second.

Q: How does the scheduler work in Linux at a high level (CFS)?

The Completely Fair Scheduler (CFS) aims to give every runnable process an equal share of the CPU. It maintains a red-black tree of runnable processes sorted by vruntime (virtual runtime — how much CPU time the process has consumed, weighted by priority). The scheduler always picks the process with the smallest vruntime (the one that has received the least CPU time). When a process runs, its vruntime increases. Nice values (-20 to +19) affect how fast vruntime grows: a high-priority (nice=-20) process’s vruntime grows slowly so it gets selected more often. Real-time tasks use separate scheduling classes (SCHED_FIFO, SCHED_RR) that preempt all CFS tasks.

Q: What is an sk_buff and what is its role in the network stack?

sk_buff (socket buffer, commonly skb) is the kernel’s data structure that carries network packets through the entire network stack. It contains pointers to the packet data along with metadata: protocol type, interface it arrived on, timestamps, checksum offload status, and pointers to various protocol headers. As a packet travels up the receiving path (driver → IP layer → TCP layer → socket), header pointers are advanced and the skb is passed between layers without copying the data. The same structure is used for transmitting packets. The sk_buff design avoids memory copies and allows in-place header manipulation, which is critical for network performance.

Q: How do you explore kernel subsystem state at runtime?

The primary interfaces are: /proc/ for traditional kernel statistics (meminfo, interrupts, net/dev, slabinfo, buddyinfo); /sys/ for the device model hierarchy (buses, devices, drivers) and kobject-based subsystem state; /sys/kernel/debug/ (debugfs) for detailed debugging information (GPIO state, clock tree, DMA, regulator, ftrace); and /proc/<pid>/ for per-process state (maps, status, sched, fdinfo). Tool examples: cat /proc/meminfo, cat /proc/interrupts, ls /sys/bus/i2c/devices/, cat /sys/kernel/debug/gpio, cat /sys/kernel/debug/clk/clk_summary.


References

{:.gc-ref}

  • “Linux Kernel Development” by Robert Love (3rd edition) — the most accessible book covering all major subsystems including scheduler, MM, VFS, and the device model
  • “Understanding the Linux Kernel” by Bovet and Cesati (3rd edition) — deep technical detail on VFS, MM, and process management
  • “Linux Device Drivers” by Corbet, Rubini, and Kroah-Hartman (3rd edition, free online) — the definitive reference for the driver model, kobjects, and hardware interfaces: https://lwn.net/Kernel/LDD3/
  • Linux kernel online documentation: https://www.kernel.org/doc/html/latest/ — subsystem-specific documentation
  • LWN.net kernel internals articles: https://lwn.net/Kernel/Index/ — the best ongoing source for deep kernel internals coverage including RCU, memory management, and scheduler articles