DMA in Linux Drivers

Basic: Why DMA? {:.gc-basic}

Basic

Direct Memory Access (DMA) allows a peripheral to transfer data to or from main memory without CPU involvement. The CPU programs the DMA controller with a source address, destination address, and length, then continues other work while the transfer completes asynchronously.

CPU copy vs DMA comparison

CPU-driven copy (PIO):
  CPU reads word from device FIFO → writes to RAM → repeat × N
  CPU utilisation: 100% during transfer, L1/L2 cache polluted

DMA transfer:
  CPU writes descriptor to DMA controller
  DMA moves N bytes directly Device↔RAM via the bus
  CPU gets interrupt when done
  CPU utilisation during transfer: ~0%

Address types in a DMA context

Address type	Description
Virtual address (`void *`)	Kernel virtual address — what the driver uses to access the buffer
Physical address (`phys_addr_t`)	CPU physical address — may not equal bus address on systems with IOMMU
DMA address (`dma_addr_t`)	Bus/device address — what you program into the hardware DMA register

On systems without an IOMMU, physical == DMA address. On systems with an IOMMU (most modern SoCs and x86 servers), the IOMMU maps device-visible DMA addresses to physical memory, providing isolation between devices.

Setting the DMA mask

Before using DMA, a driver must declare which address bits the device can use:

#include <linux/dma-mapping.h>

static int mydrv_probe(struct platform_device *pdev)
{
    struct device *dev = &pdev->dev;
    int ret;

    /* Device can address 32-bit DMA addresses */
    ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32));
    if (ret) {
        dev_err(dev, "cannot set 32-bit DMA mask\n");
        return ret;
    }

    /* For 64-bit capable devices: */
    /* dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64)); */
    return 0;
}

Coherent DMA Mapping {:.gc-basic}

Basic

Coherent DMA (also called consistent DMA) allocates memory that is simultaneously visible to both the CPU and the device without any explicit cache synchronisation. This is achieved by mapping the buffer as uncached (or using hardware cache coherency).

Use coherent DMA for: descriptor rings, command queues, status blocks — long-lived structures that both CPU and device read/write frequently.

#include <linux/dma-mapping.h>
#include <linux/slab.h>

#define DESC_RING_SIZE  (4096)   /* 4 KB ring */

struct mydrv_priv {
    void         *desc_ring;      /* kernel virtual address */
    dma_addr_t    desc_ring_dma;  /* device-visible DMA address */
    size_t        desc_ring_size;
};

static int mydrv_probe(struct platform_device *pdev)
{
    struct mydrv_priv *priv;
    int ret;

    priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
    if (!priv)
        return -ENOMEM;

    ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
    if (ret)
        return ret;

    /* Allocate coherent DMA buffer */
    priv->desc_ring_size = DESC_RING_SIZE;
    priv->desc_ring = dma_alloc_coherent(&pdev->dev,
                                          priv->desc_ring_size,
                                          &priv->desc_ring_dma,
                                          GFP_KERNEL);
    if (!priv->desc_ring) {
        dev_err(&pdev->dev, "failed to allocate DMA ring\n");
        return -ENOMEM;
    }

    /* Zero the descriptor ring */
    memset(priv->desc_ring, 0, priv->desc_ring_size);

    /* Program hardware with the DMA address */
    writel(lower_32_bits(priv->desc_ring_dma),
           priv->base + REG_DESC_BASE_LO);
    writel(upper_32_bits(priv->desc_ring_dma),
           priv->base + REG_DESC_BASE_HI);

    dev_info(&pdev->dev, "DMA ring: virt=%p dma=0x%pad size=%zu\n",
             priv->desc_ring, &priv->desc_ring_dma, priv->desc_ring_size);

    platform_set_drvdata(pdev, priv);
    return 0;
}

static int mydrv_remove(struct platform_device *pdev)
{
    struct mydrv_priv *priv = platform_get_drvdata(pdev);

    dma_free_coherent(&pdev->dev, priv->desc_ring_size,
                      priv->desc_ring, priv->desc_ring_dma);
    return 0;
}

dma_alloc_coherent returns the kernel virtual address and fills in dma_addr_t. The dma_addr_t is what you write to the hardware DMA base register — never use the virtual address for that.

Streaming DMA Mappings {:.gc-mid}

Intermediate

Streaming DMA maps an existing kernel buffer for a single (or bounded) DMA operation, then unmaps it. The mapping may involve cache flush/invalidate operations to ensure coherency. Use streaming DMA for: packet buffers, block I/O buffers, one-shot transfers.

#include <linux/dma-mapping.h>
#include <linux/slab.h>

#define XFER_SIZE  (2048)

static int mydrv_dma_transmit(struct mydrv_priv *priv,
                               const void *data, size_t len)
{
    dma_addr_t dma_handle;
    int ret = 0;

    /* Map the buffer for DMA — flushes CPU cache (write-back) */
    dma_handle = dma_map_single(priv->dev, (void *)data, len,
                                 DMA_TO_DEVICE);

    if (dma_mapping_error(priv->dev, dma_handle)) {
        dev_err(priv->dev, "dma_map_single failed\n");
        return -ENOMEM;
    }

    /* Program hardware: tell it where to read data from */
    writel(lower_32_bits(dma_handle), priv->base + REG_TX_ADDR);
    writel(len,                        priv->base + REG_TX_LEN);
    writel(BIT(0),                     priv->base + REG_TX_START);

    /* Wait for completion (simplified — real code uses interrupt + waitqueue) */
    if (!wait_for_completion_timeout(&priv->tx_done, msecs_to_jiffies(100))) {
        dev_err(priv->dev, "DMA TX timeout\n");
        ret = -ETIMEDOUT;
    }

    /* Unmap BEFORE accessing the buffer from CPU again */
    dma_unmap_single(priv->dev, dma_handle, len, DMA_TO_DEVICE);

    return ret;
}

/* Receive: buffer filled by device, read by CPU after unmap */
static int mydrv_dma_receive(struct mydrv_priv *priv, void *buf, size_t len)
{
    dma_addr_t dma_handle;

    dma_handle = dma_map_single(priv->dev, buf, len, DMA_FROM_DEVICE);
    if (dma_mapping_error(priv->dev, dma_handle))
        return -ENOMEM;

    /* Program hardware: tell it where to write received data */
    writel(lower_32_bits(dma_handle), priv->base + REG_RX_ADDR);
    writel(len,                        priv->base + REG_RX_LEN);
    writel(BIT(0),                     priv->base + REG_RX_START);

    wait_for_completion(&priv->rx_done);

    /* Unmap: invalidates CPU cache lines so CPU reads fresh data from RAM */
    dma_unmap_single(priv->dev, dma_handle, len, DMA_FROM_DEVICE);

    /* Safe to read buf now */
    return 0;
}

DMA sync calls are used when the CPU needs to access a buffer that is still mapped (e.g., to inspect partially transferred data):

/* Give CPU access to a DMA_FROM_DEVICE buffer temporarily */
dma_sync_single_for_cpu(dev, dma_handle, len, DMA_FROM_DEVICE);
/* ... read from buf ... */
/* Return ownership to device */
dma_sync_single_for_device(dev, dma_handle, len, DMA_FROM_DEVICE);

Streaming DMA directions

Direction	Who writes	Cache action on map	Cache action on unmap
`DMA_TO_DEVICE`	CPU	Flush (writeback)	Nothing
`DMA_FROM_DEVICE`	Device	Invalidate	Invalidate
`DMA_BIDIRECTIONAL`	Both	Flush + Invalidate	Flush + Invalidate

Scatter-Gather DMA {:.gc-mid}

Intermediate

Real data (network packets, block I/O, user pages) is rarely contiguous in physical memory. A scatter-gather list allows the DMA controller to transfer non-contiguous memory regions in a single programmed operation.

#include <linux/scatterlist.h>
#include <linux/dma-mapping.h>

#define SG_ENTRIES  8

static int mydrv_sg_transfer(struct mydrv_priv *priv,
                              struct page **pages, int npages,
                              size_t len)
{
    struct scatterlist sg[SG_ENTRIES];
    struct scatterlist *s;
    int nents, i;
    dma_addr_t hw_desc_base;  /* driver-managed HW descriptor ring */

    /* Build scatter-gather list from pages */
    sg_init_table(sg, npages);
    for (i = 0; i < npages; i++)
        sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0);

    /* Map all entries — coalesces adjacent entries when possible */
    nents = dma_map_sg(priv->dev, sg, npages, DMA_FROM_DEVICE);
    if (!nents) {
        dev_err(priv->dev, "dma_map_sg failed\n");
        return -ENOMEM;
    }

    /* Iterate mapped entries and program hardware descriptors */
    i = 0;
    for_each_sg(sg, s, nents, i) {
        dma_addr_t addr = sg_dma_address(s);
        unsigned int entry_len = sg_dma_len(s);

        dev_dbg(priv->dev, "SG[%d]: dma=0x%pad len=%u\n",
                i, &addr, entry_len);

        /* Program hardware DMA descriptor */
        mydrv_fill_descriptor(priv, i, addr, entry_len,
                               (i == nents - 1) ? DESC_FLAG_LAST : 0);
    }

    /* Start DMA engine */
    writel(BIT(0), priv->base + REG_DMA_START);
    wait_for_completion(&priv->dma_done);

    /* Unmap */
    dma_unmap_sg(priv->dev, sg, npages, DMA_FROM_DEVICE);

    return 0;
}

dma_map_sg returns the number of mapped entries, which may be less than npages if the kernel coalesced adjacent physical pages. Always use the returned nents value, not the original npages, when iterating the mapped list.

DMA Engine API {:.gc-adv}

Advanced

The DMA Engine framework (drivers/dma/) provides a unified API for using DMA controllers. Instead of programming hardware registers directly, drivers use an abstract channel and operation model. This is used heavily in ASoC audio, SPI/I2C DMA mode, and UART DMA.

#include <linux/dmaengine.h>
#include <linux/dma-mapping.h>
#include <linux/completion.h>

struct mydrv_priv {
    struct dma_chan     *dma_chan;
    struct completion    dma_complete;
    dma_addr_t           rx_dma_addr;
    void                *rx_buf;
    size_t               rx_size;
};

/* Completion callback — called in tasklet context */
static void mydrv_dma_callback(void *data)
{
    struct mydrv_priv *priv = data;
    complete(&priv->dma_complete);
}

static int mydrv_probe(struct platform_device *pdev)
{
    struct mydrv_priv *priv;
    int ret;

    priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
    if (!priv)
        return -ENOMEM;

    init_completion(&priv->dma_complete);

    /* Request a DMA channel — "rx" matches the DT dma-names property */
    priv->dma_chan = dma_request_chan(&pdev->dev, "rx");
    if (IS_ERR(priv->dma_chan)) {
        ret = PTR_ERR(priv->dma_chan);
        if (ret != -EPROBE_DEFER)
            dev_err(&pdev->dev, "cannot get DMA channel: %d\n", ret);
        return ret;
    }

    /* Allocate coherent RX buffer */
    priv->rx_size    = 4096;
    priv->rx_buf     = dma_alloc_coherent(&pdev->dev, priv->rx_size,
                                           &priv->rx_dma_addr, GFP_KERNEL);
    if (!priv->rx_buf) {
        ret = -ENOMEM;
        goto err_buf;
    }

    platform_set_drvdata(pdev, priv);
    return 0;

err_buf:
    dma_release_channel(priv->dma_chan);
    return ret;
}

/* Initiate a DMA receive transfer */
static int mydrv_start_rx_dma(struct mydrv_priv *priv, size_t len)
{
    struct dma_async_tx_descriptor *desc;
    dma_cookie_t cookie;

    /* Prepare the transfer: device → memory */
    desc = dmaengine_prep_slave_single(priv->dma_chan,
                                        priv->rx_dma_addr,
                                        len,
                                        DMA_DEV_TO_MEM,
                                        DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
    if (!desc)
        return -ENOMEM;

    desc->callback       = mydrv_dma_callback;
    desc->callback_param = priv;

    reinit_completion(&priv->dma_complete);

    /* Submit and issue */
    cookie = dmaengine_submit(desc);
    if (dma_submit_error(cookie))
        return -EIO;

    dma_async_issue_pending(priv->dma_chan);

    /* Wait for completion */
    if (!wait_for_completion_timeout(&priv->dma_complete,
                                      msecs_to_jiffies(500)))
        return -ETIMEDOUT;

    return 0;
}

static int mydrv_remove(struct platform_device *pdev)
{
    struct mydrv_priv *priv = platform_get_drvdata(pdev);

    dmaengine_terminate_sync(priv->dma_chan);
    dma_free_coherent(&pdev->dev, priv->rx_size,
                      priv->rx_buf, priv->rx_dma_addr);
    dma_release_channel(priv->dma_chan);
    return 0;
}

DTS: linking device to DMA controller

mydevice@40020000 {
    compatible = "vendor,mydevice";
    reg = <0x40020000 0x100>;
    dmas = <&dma0 3>, <&dma0 4>;   /* RX channel 3, TX channel 4 */
    dma-names = "rx", "tx";        /* names used in dma_request_chan() */
};

Interview Q&A {:.gc-iq}

Interview Q&A

Q1: When should you choose coherent DMA over streaming DMA?

Choose coherent when the buffer is long-lived and both the CPU and device access it regularly without predictable access patterns — descriptor rings, status registers, command queues. The uncached mapping avoids the overhead of cache maintenance on every access. Choose streaming when you have a preallocated buffer used for a single transfer (or burst), after which the CPU needs full cached access. Streaming DMA maintains the CPU cache but requires explicit map/unmap to maintain coherency.

Q2: Why must you call dma_sync_single_for_cpu before reading a streaming-mapped buffer?

When a buffer is mapped with DMA_FROM_DEVICE, the CPU cache lines covering that buffer may be stale — the DMA controller wrote to RAM directly, bypassing the cache. dma_sync_single_for_cpu invalidates those cache lines, ensuring the CPU reads the freshly DMA-written data from RAM rather than the old cached values. Failing to do this results in the CPU reading stale data — a subtle, hard-to-debug bug.

Q3: What is the role of the IOMMU in DMA?

The IOMMU (Input-Output Memory Management Unit) sits between DMA-capable devices and RAM. It translates device-visible (DMA) addresses to physical memory addresses, similar to how the MMU translates CPU virtual addresses. This provides: (1) memory isolation — a buggy driver cannot direct a device to corrupt arbitrary RAM, (2) scatter-gather flattening — the IOMMU can map discontiguous physical pages as a contiguous DMA address range, (3) address range extension — a 32-bit device can access more than 4 GB of RAM via IOMMU remapping.

Q4: What is the implication of a 32-bit DMA mask on a system with more than 4 GB of RAM?

If the device can only generate 32-bit DMA addresses, it can only access the first 4 GB of physical memory. The kernel is aware of this: dma_alloc_coherent will allocate from the DMA32 zone (below 4 GB). For streaming mappings, if the buffer is above 4 GB, the kernel may allocate a bounce buffer below 4 GB, copy the data, perform the DMA, then copy back. This adds latency and CPU overhead — a strong reason to use 64-bit DMA capable hardware.

Q5: What is the difference between scatter-gather DMA and a single-buffer DMA transfer?

A single-buffer DMA transfer (dma_map_single) operates on one physically contiguous buffer. Scatter-gather (dma_map_sg) handles multiple disjoint memory regions — e.g., a network packet built from fragments in different skb pages, or user pages from a read() call mapped in via get_user_pages. Without scatter-gather, the driver would need to allocate a contiguous bounce buffer and copy data into it before each DMA transfer, wasting memory and CPU cycles.

Q6: What does dmaengine_terminate_sync do and when must you call it?

dmaengine_terminate_sync stops any in-progress or queued DMA transfers on the channel and waits until all callbacks have completed. You must call it before releasing resources (DMA buffer, channel) in the driver’s remove() function. Without it, a DMA transfer could still be running or a callback could fire after the buffer is freed, causing a use-after-free crash. The _sync variant (vs _async) guarantees no callbacks are running when it returns.

References {:.gc-ref}

References

Resource	Link
kernel.org: DMA API documentation	https://www.kernel.org/doc/html/latest/core-api/dma-api.html
kernel.org: DMA API HOWTO	https://www.kernel.org/doc/html/latest/core-api/dma-api-howto.html
kernel.org: DMA Engine API	https://www.kernel.org/doc/html/latest/driver-api/dmaengine/index.html
kernel.org: DMA Engine provider guide	https://www.kernel.org/doc/html/latest/driver-api/dmaengine/provider.html
DMA mapping design (LWN)	https://lwn.net/Articles/688508/
IOMMU and DMA (LWN)	https://lwn.net/Articles/741003/
Linux Device Drivers 3rd ed — DMA	https://lwn.net/Kernel/LDD3/