Data Fundamentals Primer

The minimum data plumbing every ML pipeline needs. Five short topics covering what a dataset actually is, the features-vs-labels split, the train / validation / test partition that keeps you honest, the bytes underneath every string (ASCII and UTF-8 — the format LLMs actually consume), and the standardize-and-clean steps that quietly run before any model sees a number. Math-light; intuition-heavy.

Dataset

A pile of examples — that's where every model's knowledge actually comes from.

A dataset is, mechanically, just a list. Each entry in the list is one example of the thing you want the model to learn about — an email, a photo, a sentence, a transaction, a CT scan. The list might be 50 entries or 50 billion; the principle is the same. Whatever a model "knows," it knows because it was shown enough examples for the pattern to be obvious in the data.

Vocabulary you'll see used interchangeably for one entry: sample, example, instance, row, record, data point, observation. They all mean the same thing — one self-contained unit the model will see during training. Pick whichever word your team uses and stop worrying about it.

The simplest mental model is a spreadsheet. One row per example, one column per piece of information about it. Here's a sketch of a "predict house price" dataset:

  sqft    bedrooms   age    zip      price
  ────────────────────────────────────────────
   850       1        12   94110     820,000
  1450       3         8   94110   1,300,000
  2100       4        15   94114   1,720,000
  3200       5         3   94114   2,650,000
   600       0        22   94103     480,000
  ...        ...      ...   ...        ...

1 / 5

Each row is one sample. The dataset grows by adding more rows.

That table is a dataset. Five examples shown, presumably many thousands more not shown. Every column is a fact about a house; every row is one house. As long as the spreadsheet analogy fits, this is what "dataset" means.

Three quick observations to set up the rest of this primer:

Size matters but isn't everything. Deep learning loves big data — ImageNet has 1.2 million images; modern LLMs train on trillions of tokens — but a small, carefully curated dataset can beat a huge sloppy one. "Garbage in, garbage out" is the oldest rule in ML, and it's still true.
The rows have to look alike. Every row in a dataset should be the samekind of thing, with the same columns, drawn from a population you care about. Mixing apartments and shipping containers in a "house price" dataset just gives the model a harder job than it needs.
Not all data is tabular. Images are 3-D arrays (height × width × channels), audio is a long sequence of samples, text is a string. The spreadsheet picture still works — each row is one image or one document — but each "cell" might itself be huge.

The two big questions a dataset has to answer, which the next two sections unpack: What information does each row carry? (features and labels) and How do we keep the model from cheating? (train / validation / test split). Everything else builds on those.

In a Transformer: the dataset for a modern LLM is "all the text we could get our hands on" — Common Crawl, GitHub, books, papers, code, conversations. Trillions of tokens. There's no labels file alongside it; the prompt itself is the question and the next token is the answer, billions of times per epoch. Everything else in this primer — features, labels, splits, encoding, cleaning — applies, just with a vocabulary tuned to sequences of bytes instead of spreadsheet rows.

Features & Labels

Split each row into "what the model sees" and "what the model has to predict."

Section 1's dataset is just a pile of rows. To turn that pile into a learning problem you split each row into two parts: features — the columns the model gets to look at — and the label — the column you're asking it to predict. That split, repeated across every row in the dataset, is what makes "training a model" a meaningful operation.

Conventional notation, used across almost every ML paper:

x — the features of one example. Usually a vector of numbers; sometimes an image, a string, a graph.
y — the label for that example. A single number, a category, or sometimes itself a structured thing.
One row of the dataset = one (x, y) pair. The whole dataset = a list of (x, y) pairs.

Back to the housing example from Section 1. To learn "given a house, predict its price," the price column is the label and everything else is features:

  features (x)                       label (y)
  ────────────────────────────────  ───────────
  sqft  bedrooms  age   zip          price
  ────────────────────────────────  ───────────
   850     1       12   94110         820,000
  1450     3        8   94110       1,300,000
  2100     4       15   94114       1,720,000
  ...

1 / 3

Same table, two roles: the columns the model sees, and the column it predicts.

Picking the right split is the entire framing of the problem. Same dataset, different choices of label, give you completely different models:

Label = price → a model that estimates house value.
Label = sold within 30 days? → a model that predicts how fast a listing will move.
Label = zip code (using sqft, bedrooms, age as features) → a model that guesses neighborhood from architecture.

The label's type drives the choice of model and loss function:

Continuous label → regression. Price in dollars; temperature tomorrow; user click-through rate. Loss is usually mean squared error.
Discrete label, 2 choices → binary classification. Spam vs. not-spam; will-default vs. won't-default. Loss is usually binary cross-entropy.
Discrete label, many choices → multi-class classification. Which of 1,000 ImageNet categories is this photo? Loss is cross-entropy.
Structured label → object detection (label = a list of bounding boxes + classes), translation (label = a sentence in another language), generation (label = the next token).

A subtlety worth flagging: not every dataset comes with a label. Unsupervised learning works on pure x, looking for structure on its own — clustering, dimensionality reduction, density estimation. And there's a beautiful middle ground called self-supervised learning where the data itself provides a label, no human annotation needed. Hide part of an image and ask the model to fill it in. Take a sentence, hide a word, ask the model to guess it. The "label" was already in the data; we just had to peek at it differently.

In a Transformer: an LLM is the world's most expensive self-supervised model. The dataset is "a giant pile of text." The features for one training example are "the first N tokens of some sequence;" the label is "the N+1-th token." That's it. No human ever wrote a label file. The model learns by trying to predict the next token, billions of times, and the trillions of tokens of training text contain billions of perfectly aligned (x, y) pairs hiding in plain sight.

Train / Validation / Test

Cut the dataset into three chunks so the model can't cheat at exam time.

Imagine a teacher who hands out a stack of practice problems, then gives the same problems on the exam. Every student gets 100%. We learn nothing about who actually understood the material. That's the problem the train / validation / test split solves. It carves the dataset into three disjoint chunks with three different jobs, so the model's reported performance reflects its real ability instead of its memory.

The three roles:

Training set — the rows the model sees during fitting. The optimizer looks at these and adjusts the weights. Typically 70–90% of the dataset.
Validation set — held-out rows the model never trains on, used to decide between candidate models. Try learning rate 1e-3 vs. 1e-4? A 12-layer model vs. a 24-layer model? Whichever has the lower validation loss wins. Often called the dev set. Typically 5–15%.
Test set — kept under tight lock and key, opened once at the end of the project to report the final number. Looking at it before you've frozen your final model defeats its purpose. Typically 10–20%.

A concrete split on a 10,000-row dataset:

  ┌──────────── 10,000 rows ────────────┐
  │  train  8,000     val 1,000  test 1,000 │
  └─────────────────────────────────────────┘
                 80%        10%        10%

1 / 4

Three roles, three disjoint chunks. The test set stays locked until the very end.

Why three pieces and not two? Because the moment you use validation results to change something — a hyperparameter, a model architecture, a tokenizer — the validation set has started to leak its information into your model. After enough of these rounds, "validation accuracy" stops being a fair estimate of how the model will do on new data. The test set, untouched until the very end, exists exactly to give you a number you can trust at the finish line.

Two failure modes are worth memorizing because everyone hits them:

Data leakage. Rows from the test set sneak into training. Could be duplicates of the same row, or near-duplicates (the same news article rewritten), or subtle (using future information to predict the past). When test scores look "too good to be true," they usually are — leakage somewhere.
Distribution shift between split and reality. You split a dataset of 2018 emails 80/10/10 — but you deploy in 2026. The test set was held out honestly, but it's also four years stale. Test performance can look great and deployment performance can still be terrible.

How exactly do you split? For most datasets you just shuffle and slice — the rows are interchangeable, so a random 80/10/10 cut is fine. Three cases where shuffling is wrong:

Time-series. Predicting tomorrow's stock price using future data in training is cheating. Split by time: train on the past, validate on the recent past, test on the present.
Grouped data. 100 photos of 10 cats (10 per cat). Putting different photos of the same cat in train and test leaks; the model learns to recognize this cat, not "cats." Split by cat (the group), not by photo.
Imbalanced classes. 99% normal traffic, 1% fraud. A random split might accidentally leave the test set with no fraud cases at all. Use stratified sampling — fix the class ratio inside each split.

In a Transformer: the LLM equivalent of validation/test is a fixed set of held-out documents the model is never allowed to see during training — typically a slice of Common Crawl set aside before training begins, plus curated benchmarks (HellaSwag, MMLU, GSM8K, etc.). Loss on the held-out documents is what researchers report as "validation loss" or "eval loss"; the curated benchmarks are the public scoreboard. Leakage is a real risk: if benchmark questions ever leak into the training corpus, the benchmark stops measuring anything real. Detecting and removing such contamination is half the work of evaluating modern LLMs.

Text Encoding

Strings don't exist in a computer. Bytes do.

A "string" like "hello" is a convenient lie your programming language tells you. Under the hood, every text file, every chat message, every line of source code is a sequence of bytes — integers between 0 and 255. A text encoding is the rulebook that maps between human-readable characters and those bytes. Two encodings matter for ML: ASCII (the simple one) and UTF-8 (the one the whole modern internet uses).

ASCII is the original 1963 standard. It assigns each of 128 characters a single byte. The Latin alphabet, digits, punctuation, a few control codes — that's it. Plenty for English text in 1963, woefully insufficient for anything outside it.

  "hello"  →  104  101  108  108  111
              'h'  'e'  'l'  'l'  'o'

  5 characters → 5 bytes (one byte each).

ASCII fails the moment you need a letter it didn't plan for. é? 中? 👋? None of them exist in 7-bit ASCII. The fix was Unicode: assign every character that has ever appeared in human writing a unique number called a code point. Over 150,000 of them now, covering scripts, symbols, math, emoji — basically everything. h is code point U+0068, 中 is U+4E2D, 👋 is U+1F44B.

But code points are abstract numbers, not bytes. You still need to encode them into bytes to put them on disk or send them over a wire. That's where UTF-8 comes in. UTF-8 is a variable-length encoding:

ASCII characters (code points 0–127) → 1 byte. Identical to ASCII.
Latin accents, Greek, Cyrillic, Hebrew, Arabic (128–2047) → 2 bytes.
Chinese, Japanese, Korean, most other scripts (2048–65535) → 3 bytes.
Emoji, rare symbols, ancient scripts (65536+) → 4 bytes.

A worked example showing why "string length" is treacherous:

  "hi 中 👋"

  characters:        h     i   ' '   中    👋
  code points:    U+68  U+69  U+20  U+4E2D  U+1F44B
  UTF-8 bytes:      68    69    20  E4 B8 AD  F0 9F 91 8B
                  ─1─   ─1─   ─1─   ──3──    ───4───

  5 visible characters → 10 bytes

1 / 3

Five visible characters become ten bytes. UTF-8 spends one byte on ASCII, three on 中, four on the emoji.

Three properties of UTF-8 that make it the default:

Backward-compatible with ASCII. Any old ASCII file is already valid UTF-8 — the bytes are bit-for-bit identical. Forty years of English text just works.
Self-synchronizing. Bytes that start a character look different from bytes that continue one, so even if you start reading mid-stream you can recover quickly. Useful in networks where packets get lost or chunked weirdly.
Universal. One encoding covers every script in human history. Mismatched encodings used to be a daily source of pain ("mojibake": çåüå€ where Chinese should be). UTF-8 made that mostly extinct.

One bear trap to remember: byte count ≠ character count ≠ display width. The string "中" has length 1 in Python (1 character), length 3 in raw bytes, and takes up roughly 2 columns in a terminal. "👋" is 1 user-visible thing, but in JavaScript its .length is 2 (it's a "surrogate pair" in UTF-16), and in UTF-8 it's 4 bytes, and in a terminal it might draw 1 or 2 columns wide depending on the font. Truncating strings naively to "100 characters" breaks in interesting ways the first time emoji shows up.

In a Transformer: this matters more for LLMs than for any other ML system, because LLMs process text. The tokenizer at the top of every LLM does NOT operate on Unicode characters — it operates on UTF-8 bytes. Modern tokenizers (BPE in GPT-style models, SentencePiece in many others) start from the raw byte stream and learn to merge frequently-co-occurring byte sequences into bigger units. That's why a single emoji like 👋 often takes multiple tokens — its 4 UTF-8 bytes may not all merge into one piece. It's why non-English text often costs more tokens per character than English. And it's why "the model has a 128k context window" means "128,000 tokens, which is roughly 80,000 English words but maybe only 40,000 Chinese characters." Every confusing thing about LLM token counts traces back to UTF-8 sitting one layer below the tokenizer.

Preprocessing

The quiet steps between raw data and the tensor the model actually sees.

Real-world data is never ready to train on. Some rows have missing fields. Some have wildly different units. Some have typos. Preprocessing is the catch-all term for the steps you run between "raw data" and "tensor of numbers fed to the optimizer." Most of it is unglamorous bookkeeping, but the quality of your preprocessing usually matters more than the choice of model architecture.

Two families of preprocessing show up over and over: cleaning (deal with the things that are wrong) and standardization (deal with the things that are right but inconveniently scaled).

Cleaning covers the data-hygiene checklist. Skip these and the model learns garbage:

Missing values. A row with age = NaN can't be fed to anything that expects a number. Options: drop the row, fill with the mean / median of the column, or treat "missing" as a category of its own. Each has tradeoffs; the worst choice is "pretend it isn't there."
Duplicates. Two identical rows skew the gradient toward whatever they encode. De-duplicate before splitting — and especially before training on internet text, where the same paragraph is republished thousands of times.
Outliers. A house listed at $1, a temperature of 999°C, an age of 300. Usually a data-entry mistake; sometimes a real but rare event. Either way, a single extreme row can dominate the gradient. Cap, clip, or remove — but document what you did.
Inconsistent formats. "USA" vs "United States" vs "U.S." for the same country. 2024-01-31 vs 31/01/2024 for the same date. The model sees three different strings; you wanted one. Normalize early.

Standardization handles the scale problem. Suppose your features include a house's square footage (range: 600–4000) and its age (range: 0–80). The numbers live on completely different scales. A neural network can technically learn this, but gradient descent has a much easier time when every feature lives in a roughly comparable range. The two recipes you'll see everywhere:

Z-score (standardization). Subtract the column mean, divide by the column standard deviation: x' = (x − μ) / σ. Every column ends up with mean 0 and standard deviation 1. The dominant choice for most ML.
Min-max (normalization). Squash every column into [0, 1]: x' = (x − min) / (max − min). Useful when you need bounded inputs, like for an image whose pixel values you want in [0, 1].

Concrete example. House ages and square footages on the same plot, before and after z-scoring:

  raw                       standardized (z-score)
  ─────────────             ─────────────────────────
  age   sqft                 age      sqft
  ────  ──────              ──────   ──────
   12     850                -0.32   -1.21
    8    1450                -0.61   -0.55
   15    2100                -0.10    0.17
    3    3200                -0.96    1.39
   22     600                 0.40   -1.48
   ...    ...                  ...    ...

  μ      11.9   1640          ≈ 0     ≈ 0
  σ      7.0    900           ≈ 1     ≈ 1

1 / 3

Raw values live on wildly different scales. Z-score brings every column onto a shared, comparable range.

Two rules about standardization that everyone learns the hard way:

Fit on train, apply to all. Compute μ and σ from the training set only, then use those same numbers to transform train, validation, and test. Re-fitting on the test set means letting test data inform your preprocessing, which is the same leak as Section 3's warning, dressed up in a normalization disguise.
Same recipe at serving time. Whatever μ and σ you used during training have to be applied to every input at deployment too. Train on z-scored data, serve raw data, and the model effectively sees garbage — one of the more humiliating production bugs you can ship.

Images and audio get their own variants:

Images. Divide by 255 to push pixels into [0, 1], then subtract the per-channel mean (often ImageNet's familiar (0.485, 0.456, 0.406)) and divide by the per-channel standard deviation. Resize / crop / flip / color-jitter as data augmentation — synthetic extra rows that teach the model to be invariant to nuisances.
Audio. Convert raw waveform → spectrogram, often log-scale the magnitudes, then z-score. Same shape as images at that point — a 2-D tensor — and the same preprocessing logic applies.

In a Transformer: a modern LLM's preprocessing pipeline runs in three stages. (1) Filtering and de-duplication: throw away spam, broken HTML, repeated text, and low-quality documents — this alone can change downstream quality more than doubling the model size. (2) Tokenization: take a UTF-8 byte stream (Section 4) and slice it into integer token IDs via BPE or SentencePiece. (3) Sequence packing: concatenate tokenized documents into fixed-length sequences so the GPU never sees a half-empty batch. Standardization in the z-score sense doesn't apply to integer token IDs — the model has its own embedding layer to put each token into a continuous vector — but every learned parameter inside the model gets the same kind of careful initialization and per-layer normalization (LayerNorm, RMSNorm) so the gradients stay tame, an idea structurally identical to the column-by-column standardization above.