How a Request Becomes Memory
The journey from TCP segment to heap allocation — and where, exactly, a packet becomes memory you can touch.
The short version, before we go deep. If you only read the next five sentences, read these. The packet has already landed in your machine’s memory before a single line of your program has run, because the network card writes it there itself, without asking the CPU for permission. The call you think of as “reading the request,” the read system call, is not where the data arrives; it is where the data is copied across a wall, and you pay for that copy in proportion to how many bytes cross it. C forces you to make every system call by hand, Go hides the part where you would otherwise block, and Rust hides nothing about the system calls themselves but stations its compiler at the door of the buffer so that nobody can misuse it. The tidy request object that shows up as the argument to your handler is, somewhat counterintuitively, the very last thing to come into existence along this whole path, and it is also the most expensive. Once you can see the path as a sequence of stages, the things that usually feel like dark magic, garbage collector pauses, mysterious tail latency, the marketing word “zero-copy,” all snap into focus, because each of them lives at one particular stage and nowhere else.
Most of us begin our mental model of a server somewhere around here:
func handler(req Request) Response {
// ... your code ...
}The request is simply there. It arrives fully formed, like a parcel already sitting on the doorstep when you open the door, and so you read its fields, you do your work, and you hand back a response without ever wondering who carried the parcel up the path. This issue is about that carrying. Before the request was a struct with named fields you could reach into, it was a flat run of bytes living in your process’s address space; before it was that, it was a queue of bytes sitting inside the kernel where your code could not see it; before that, it was an ordered procession of TCP segments; and before even that, it was nothing more dignified than voltage changing on a wire. Somewhere along that chain a packet became memory you are permitted to touch, and the exact location of that transformation, together with the question of who pays for the copy that makes it happen, is the difference between a server that comfortably handles five thousand requests a second and one that handles five hundred thousand.
This is the anchor issue of the series, which means everything that comes later, the parsers and the allocators and the schedulers and the garbage-collector pauses you will eventually spend an entire weekend hunting down, refers back to the single path drawn in this figure.
1. The packet arrives before your code exists
A TCP segment shows up at your network interface card. The first fact to absorb, and it is genuinely the one that reorganizes everything else, is that the CPU does not go and fetch it. We tend to imagine the processor as the thing that does all the work, reaching out and pulling data in, and yet that is not what happens at all.
A note on what the hardware actually does. When the frame arrives, the network card writes it directly into a region of main memory that the driver set aside in advance, a circular structure usually called the receive ring, and it does this using Direct Memory Access, which is the mechanism by which a peripheral device moves bytes into RAM over the system bus without routing them through the processor. By the time anything else happens, the bytes are already resident in memory. The CPU was not involved in placing them there, and it only discovers that they exist after the fact, when the card raises a hardware interrupt to get its attention.
That interrupt is handled in two deliberately unequal halves, and understanding why the work is split this way explains a great deal about how Linux behaves under load. The first half, the part that runs the instant the interrupt fires, does almost nothing on purpose. It runs in a context where other interrupts may be disabled and where sleeping or blocking is forbidden, so if it tried to do anything substantial it would starve every other device on the machine and bring the system to its knees. Consequently it simply acknowledges the interrupt, notes that there is work to be done, and schedules that work to run slightly later in a softer context. That later context is the softirq, specifically the network-receive softirq, and it is where the real processing happens. The softirq drains the receive ring, wraps each packet in the kernel’s universal packet-carrying structure, the socket buffer that the kernel source calls sk_buff, and carries it upward through the layers of the network stack.
There is a refinement here that you will eventually meet in production whether you want to or not. If a machine is taking one interrupt for every single packet while packets are pouring in at line rate, the cost of constantly entering and leaving interrupt context can consume the whole processor, a failure mode with the evocative name of interrupt livelock. To avoid it, modern drivers use a scheme called NAPI, under which the kernel, once it notices that packets are arriving in a flood, stops taking an interrupt per packet and instead switches into a tight polling loop that repeatedly drains the ring until the flood subsides, at which point it re-enables interrupts and goes back to sleep.
Why this matters when you are on call. The overwhelming majority of tail-latency problems at the network edge turn out to be queueing problems rather than computation problems, which is to say the work is waiting in line somewhere rather than being slow to execute. The first time you watch a kernel thread named ksoftirqd consume an entire core in your monitoring, what you are seeing is precisely this polling loop struggling to keep up with the packet rate, and no amount of optimizing your request handler will move that number, because the bottleneck sits several stages upstream of any code you wrote.
Once the softirq has carried the packet up, the link layer strips away the Ethernet header, the IP layer routes the packet to confirm it really is meant for this machine, and the TCP layer is where the loose individual segment finally becomes part of a coherent stream. TCP looks up which connection this segment belongs to by matching the four values that uniquely identify a connection, the source address, the source port, the destination address, and the destination port, and having found the right socket it checks the sequence number, sends an acknowledgement back to the sender, throws away anything that is a duplicate, holds onto anything that has arrived out of order until its predecessors show up, and reassembles everything into a single in-order run of bytes. Those bytes are then appended to the socket’s receive buffer, and the socket is marked readable, so that if any process happens to be sleeping while it waits for data on this connection, the kernel now wakes it.
This is also the moment to notice something that will haunt the rest of the series, which is that TCP gives you a stream of bytes and not a sequence of messages. The kernel does not remember, and could not remember even if it wanted to, where one of your logical requests ended and the next began, because that information was never part of TCP in the first place. What you get is an ordered river of bytes with no markers in it, and the entire apparatus of HTTP framing, the Content-Length header chief among it, exists to draw the boundaries back in by hand. That is the subject of the next issue.
The takeaway for this stage. By the time your process is even eligible to be scheduled and run, the bytes already sit in RAM and have already been reassembled into the correct order. Your code has not executed. The thing you call “the request” does not yet exist in any form your program can use; there is only an ordered stream of bytes waiting patiently in a kernel buffer for someone to come and ask for it.
2. The read call, and the copy that nobody mentions
Now we reach the wall. The memory that belongs to the kernel and the memory that belongs to your process are separate address spaces, kept apart by the hardware itself through the memory management unit, and this separation is not an accident or an inconvenience but the foundation of the entire security and stability model of the operating system, since it is what stops one process from reading another’s secrets and what stops a buggy program from corrupting the kernel. The only sanctioned way through that wall is a system call, which is a controlled, deliberate transition from the privilege level your program runs at, conventionally called ring three, down into the privileged level the kernel runs at, ring zero, where the kernel will perform some action on your behalf and then hand control back.
The system call we care about looks, in C, like this:
ssize_t n = read(conn_fd, buf, sizeof(buf));The manual page for read describes it with admirable economy, saying that read attempts to read up to a given count of bytes from a file descriptor into a buffer that you supply. What that single calm sentence quietly omits is everything that has to happen to satisfy it. First the read call traps into the kernel, performing the privilege transition just described. Then the kernel finds the data that has been waiting in the socket’s receive buffer. Then, and this is the step the whole article has been walking toward, the kernel performs a routine called copy_to_user, which is a literal, byte-by-byte copy of the data out of the kernel’s buffer and into the buffer that you, in user space, provided as the second argument. Finally the call returns, and the value it returns tells you how the read went, because a positive number is the count of bytes that were actually copied into your buffer, a return of zero means the other end has closed the connection cleanly and you have reached the end of the stream, and a return of negative one signals an error whose nature you must look up in errno.
On the phrase “zero-copy,” which is mostly a sales pitch. That copy_to_user step is pure overhead in the sense that it does no useful transformation; it only moves bytes from one place in memory to another so they can sit on your side of the wall. When you are handling a few kilobytes per request the cost is invisible and not worth a moment’s thought, but when you are pushing gigabits per second, the bandwidth and cache disruption of copying every one of those bytes becomes a genuine and measurable fraction of your processor budget. This single copy is the entire reason that more exotic mechanisms exist, the sendfile call, the splice call, the registered buffers in io_uring, and every one of them is, at heart, a scheme to delete this one step so the bytes never have to make the round trip into user space and back. When somebody markets a system as “zero-copy,” what they almost always mean, stripped of the glamour, is simply that they found a way to avoid this userspace round trip, and nothing grander than that.
Here is the sentence to hold onto. This copy is the exact moment at which a packet becomes your memory, and not one instant before it. The bytes had been sitting in RAM the entire time, placed there by the network card through Direct Memory Access while your program was still asleep, but they lived on the kernel’s side of the wall the whole while, untouchable by your code. The read call is the doorway, and copy_to_user is the toll you pay to walk through it. In the figure above it is the band drawn in orange, and it is the only stage in the entire pipeline that charges you for a copy you did not explicitly request.
The takeaway for this stage. The read call is not “getting the data,” because the data already existed and was already in memory. The read call is copying that data across a protection boundary, and that copy is the very first thing you should put on the budget when throughput, rather than mere correctness, is what you are chasing.
3. The same server, written three times, so you can watch the abstraction shrink
This is the demonstration that makes the whole mental model click into place. We are going to write the simplest server that still exercises the full path, an echo server that reads whatever bytes arrive and writes those same bytes straight back, and we are going to write it three times in three languages. Underneath all three sit the same handful of system calls, the ones for creating a socket, binding it to an address, marking it as listening, accepting a connection, and then reading and writing on that connection, and the interesting thing is to watch how each successive language hides more of that machinery than the last. Every line of comfortable high-level code you have ever written is sitting on top of exactly this.
The C version, where nothing is hidden
In C there is no runtime standing between you and the kernel, so what you write is very nearly what the machine does, and the two versions that follow are best understood as polite wrappers around this one.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <sys/socket.h>
#define PORT 8080
#define BUFSIZE 4096
int main(void) {
// Step one. Ask the kernel for a TCP endpoint. What comes back is a file
// descriptor, which is just a small integer that indexes into a table the
// kernel keeps for this process.
int listen_fd = socket(AF_INET, SOCK_STREAM, 0);
if (listen_fd < 0) { perror("socket"); exit(1); }
int opt = 1;
setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
struct sockaddr_in addr = {0};
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = INADDR_ANY; // listen on every interface
addr.sin_port = htons(PORT); // htons puts the port in network byte order
// Step two. Pin that endpoint to a concrete address and port.
if (bind(listen_fd, (struct sockaddr*)&addr, sizeof(addr)) < 0) {
perror("bind"); exit(1);
}
// Step three. Flip the endpoint into passive mode, after which the kernel
// begins quietly queueing up incoming connections on our behalf.
if (listen(listen_fd, 128) < 0) { perror("listen"); exit(1); }
for (;;) {
// Step four. Take one finished connection off that queue. The kernel
// hands back a brand new file descriptor that refers only to this client.
int conn_fd = accept(listen_fd, NULL, NULL);
if (conn_fd < 0) { perror("accept"); continue; }
char buf[BUFSIZE]; // this is the landing zone in user space
ssize_t n;
// Step five. read runs copy_to_user, moving bytes from the kernel
// socket buffer into buf, and returns how many it moved.
while ((n = read(conn_fd, buf, sizeof(buf))) > 0) {
// Step six. write does the mirror image, copying from buf back
// into the kernel's send buffer for this connection.
write(conn_fd, buf, n);
}
close(conn_fd);
}
}Six system calls hold the entire thing upright. The buffer declared as buf is the very same “buffer starting at buf” that the manual page spoke of, the concrete spot in your address space where the kernel’s bytes come to rest. There is no magic anywhere in it, and there is also no concurrency anywhere in it, because this server can attend to exactly one client at a time and will sit blocked, doing nothing, while a slow client dawdles. Every real-world server you have ever used is, in one way or another, a strategy for solving that single limitation, and the next two versions are two such strategies.
The Go version, where the system calls remain but the blocking does not
go
package main
import (
"io"
"log"
"net"
)
func main() {
// net.Listen folds socket, bind, and listen into one call, and crucially
// it sets the underlying descriptor to non-blocking so the runtime's own
// network poller can take responsibility for it.
ln, err := net.Listen("tcp", ":8080")
if err != nil {
log.Fatal(err)
}
defer ln.Close()
for {
// Accept wraps the accept system call. If no connection is waiting,
// the goroutine parks itself and the underlying operating-system thread
// is freed to go and run some other goroutine in the meantime.
conn, err := ln.Accept()
if err != nil {
log.Println(err)
continue
}
go handle(conn) // one goroutine per connection, which is cheap to create
}
}
func handle(conn net.Conn) {
defer conn.Close()
// io.Copy simply loops, reading from the connection and writing the same
// bytes back to it. Each read bottoms out in the read system call and each
// write in the write system call, exactly as the C version did.
io.Copy(conn, conn)
}Look closely and you will see that the system calls have not gone anywhere. The net.Listen call is doing the work of socket and bind and listen. The Accept method is doing the work of accept. The read on the connection is doing the work of the read system call. So if the calls are identical, the natural question is what Go has actually added, and the answer is worth dwelling on.
What the line that reads from a connection is really doing. A call to read from a connection in Go looks, to the eye, exactly like a call that blocks until data arrives. Underneath, though, Go has set the descriptor to non-blocking and has registered it with the kernel’s readiness-notification machinery, which on Linux is epoll. When you call read and there is in fact no data ready yet, the Go runtime does not block the operating-system thread the way the C program would. Instead it parks the goroutine, which is a lightweight thread of execution managed entirely by the Go runtime rather than by the kernel, and it lets the freed operating-system thread go and run a different goroutine, and then later, when epoll reports that the descriptor has become readable, the runtime’s network poller wakes your goroutine back up and the read finally proceeds. The result is that you get to write plain, sequential, blocking-looking code, while the runtime quietly executes it as efficient event-driven, non-blocking input and output beneath you. That illusion, synchronous-looking code running on asynchronous machinery, is the single most valuable thing the Go runtime does for a network server, and now you know that the trick is really just epoll wearing a goroutine for a costume.
The Rust version, where the compiler stands guard over the buffer
use std::io::{Read, Write};
use std::net::TcpListener;
use std::thread;
fn main() -> std::io::Result<()> {
// bind performs socket, bind, and listen beneath the surface.
let listener = TcpListener::bind("0.0.0.0:8080")?;
for stream in listener.incoming() { // one TcpStream for each accept
let mut stream = stream?;
thread::spawn(move || {
let mut buf = [0u8; 4096]; // a stack buffer owned by this closure
loop {
match stream.read(&mut buf) { // read runs copy_to_user; Ok(0) is end of stream
Ok(0) => break,
Ok(n) => {
if stream.write_all(&buf[..n]).is_err() { break; }
}
Err(_) => break,
}
}
// When the closure ends, stream goes out of scope and is dropped,
// and dropping it closes the descriptor, deterministically, with no
// garbage collector deciding the timing.
});
}
Ok(())
}The same six system calls are here once again. What Rust contributes is not a runtime, because the standard library’s networking blocks just as plainly as the C version does, but rather a type system that posts itself as a sentry at the door of the buffer. When you call read and pass it a mutable reference to buf, the compiler proves, before the program is ever allowed to run, that no other part of the code is simultaneously holding a reference to those same bytes, which is how Rust rules out an entire category of data races and aliasing bugs at compile time rather than discovering them in production. And when the closure finishes and the stream value falls out of scope, the language runs its destructor automatically and closes the descriptor at a precisely determined moment, without any garbage collector and without you having to remember to do it. In other words you get the same lean cost model that C gives you, but with a compiler that simply refuses to let you alias the buffer or leak the descriptor. If you needed real throughput rather than a teaching example you would reach for the tokio library, whose asynchronous reactor is built on a crate called mio, which in turn wraps epoll, which is to say it is the very same idea as the Go network poller, only here it is made explicit and opt-in rather than woven invisibly into the language.
The takeaway for this stage. Three languages, and beneath them one identical set of system calls. C requires you to make every call yourself. Go leaves the calls in place but spirits away the blocking. Rust hides nothing about the calls at all and instead polices the buffer they read into. Not one of the three, however clever, can hide the copy.
4. From a buffer of bytes to a struct, which is where the heap finally appears
The echo server has been cheating this whole time, in a way that was convenient for the lesson but dishonest about real life, because it never once interpreted the bytes it handled; it merely bounced them. A server you would actually deploy has to take that buffer of raw octets and turn it into something with named fields you can reason about, and it does so in two distinct steps that are worth keeping separate in your mind.
The first step is parsing. An HTTP parser scans across the buffer hunting for structure, identifying where the request line sits, then walking through the headers, which are delimited from one another by carriage-return and line-feed pairs, then finding the blank line that signals the end of the headers, and finally locating the body. All of this is, at bottom, arithmetic over positions in the buffer, and a well-written parser such as the one in Go’s standard library or the httparse crate in Rust will try very hard to hand you slices that point back into the original buffer rather than copying anything, precisely because copying is the expense it is trying to avoid.
The second step is deserialization, and this is where the heap finally enters the story in earnest. When you take the body and turn it into a typed value, calling something like Unmarshal in Go or from_slice in Rust to produce a real struct, the deserializer must allocate. It allocates the struct itself, and it allocates every string field inside it as a fresh copy of those bytes, because the struct has to own its data and outlive the temporary buffer the bytes first arrived in, and it allocates the slices and the map entries and everything else the shape of your data demands. The flat, anonymous river of bytes is thereby transformed into a sprawling tree of individually owned objects scattered across the heap, and that tree is what your handler finally receives as its tidy request argument.
Why seeing the pipeline reorganizes how you think about performance. Once the whole sequence of stages is laid out in front of you, the problems that ordinarily feel mysterious resolve themselves into specific locations. When your garbage collector is under pressure, the culprit is overwhelmingly this last stage, the deserialization allocations, far more than anything that came before it. When you are fighting tail latency under heavy load, the trouble is frequently back at the first stage, in the NAPI and softirq machinery, or in the cost of waking up from epoll at the stage where you accept and read. And when a framework advertises itself as “zero-copy,” what it is really attacking is the copy_to_user step in the middle and the string-copying in this final step. Every performance optimization you will ever read about lives at one identifiable node in the first figure, which is exactly why being able to picture the figure is worth more than memorizing any individual trick.
The takeaway for this stage. The request that seemed to be “just there” at the top of your handler is in fact the finished product of a long assembly line: Direct Memory Access deposits the bytes into a ring, a softirq carries them up the stack, TCP reassembles them into order, copy_to_user hauls them across the wall, a parser slices them into pieces, and a deserializer at last allocates them into heap objects. The object is the final thing to come into being, and it is the most expensive.
The five things to carry away
First, the packets arrive before your handler exists at all, because Direct Memory Access deposits them in RAM with no involvement from the processor, and your code does not run until considerably later.
Second, TCP gives you a stream of bytes rather than a set of messages, which means the boundaries between your requests are your own responsibility to reconstruct, and reconstructing them is precisely what framing, the topic of the next issue, is for.
Third, the read call is the first major copy in the whole pipeline, since copy_to_user is the toll charged for crossing the wall between kernel space and user space, and it is charged per byte.
Fourth, every language rides on the very same system calls, and the only thing that distinguishes them is what each chooses to hide, with C hiding nothing, Go hiding the blocking, and Rust guarding the buffer.
Fifth, heap allocation happens considerably later than most engineers assume, arriving at the deserialization step rather than at the moment of receiving, and that step is where the bill for your garbage collector is quietly written.
Coming next, in issue 1.2: where does a request end?
TCP flatly refuses to tell you where one request stops and the next one begins. Next time we go hunting for the answer in message framing, looking at how HTTP carves a single request out of a boundaryless stream of bytes, and at why the choice between a Content-Length header, chunked transfer encoding, and simply reading until the connection closes is a decision the kernel forces upon you rather than a detail the specification dreamed up for its own amusement.
Sources and further reading
W. Richard Stevens, in the first volume of Unix Network Programming, gives the canonical treatment of the socket, bind, listen, and accept calls, and chapters four through six are essentially the C server above explained at far greater length.
The Linux manual pages for read, socket, accept, and the socket overview in section seven are the primary source for what these calls actually promise, and they are shorter and more reliable than most blog posts on the subject, this one included.
The Go standard library source is where the abstractions in the Go example resolve into real system calls, particularly the net package in the files net/dial.go and net/fd_unix.go, the internal poll package, and the runtime network poller in netpoll.go and netpoll_epoll.go.
For Rust, the standard library net module documents the blocking model, while the mio and tokio crates document the epoll-based asynchronous reactor that powers high-throughput servers.
And for the kernel itself, the softirq receive path lives in net/core/dev.c, in functions such as napi_poll and netif_receive_skb, while the TCP receive logic lives in net/ipv4/tcp_input.c, should you ever want to read the genuine article rather than a description of it.
