WebGraphMix — Pretraining Data Selection via Web Graph Centrality

01 — Abstract

Modern language model performance depends critically on pretraining data composition, yet existing selection methods lean on auxiliary classifiers or mixture optimization, adding compute and a dependence on labeled data. We propose WebGraphMix, a lightweight graph-based framework that computes structural centrality scores over the Common Crawl host-level web graph and varies the proportion of central versus peripheral documents in the pretraining mixture. Central hosts expose models to reusable abstractions; peripheral hosts encode specialized, long-tail knowledge. Integrated into the DataComp-LM pipeline and trained at 400M and 1B scales across 23 tasks, a 1:1 central–peripheral mixture reaches 41.4% average versus 39.8% for uniform sampling, and combining structural scores with a quality classifier lifts this to 43.8% — showing that web graph topology is a meaningful curation axis, largely orthogonal to content-based approaches.

02 — The Hypothesis

A document's position in the web graph shapes what a model learns from it.

The web is fundamentally a graph: hosts link to one another, encoding topical structure and information flow. Where a document sits in that network turns out to predict the kind of capability it contributes during pretraining.

Top-K · Central

Hubs & Bridges

Highly connected hosts (think major reference and platform sites) lie on many shortest paths and co-occur with heterogeneous contexts — surfacing broadly reusable, procedural patterns.

↗ Strongest gains on Symbolic & Algorithmic Reasoning (+1.4%)

Bottom-K · Peripheral

The Long Tail

Sparsely linked hosts — small organizations, niche communities, non-English content — carry specialized, long-tail knowledge that the dense core of the web simply doesn't repeat.

↗ Improves Science & Factual Knowledge and Commonsense

03 — How WebGraphMix Works

Three steps. One reusable preprocessing pass.

Centrality is computed once over the public host graph using GPU-parallel graph algorithms (cuGraph), then inherited by every document from that host — Katz in under 3 hours on one H100, betweenness in under 6 hours on four.

STEP 01

Score the graph

Build the directed host-level graph and compute Betweenness (cross-community bridging) and Katz (recursive influence) centrality for all 13.9M hosts.

STEP 02

Sample by position

Draw documents Top-K (central), Bottom-K (peripheral), or a Mixed α : (100−α) ratio under a fixed token budget.

STEP 03

Optionally fuse quality

Normalize and combine centrality with DCLM-fasttext quality scores — multiplicative/divisive for the sharpest, most complementary signal.

04 — Main Results

Mixing beats either extreme — and stacks on top of quality.

DCLM CORE v2 accuracy at 1B scale (1.4B params, 28B tokens), averaged by capability category. WebGraphMix uses betweenness with a 50/50 Top-K/Bottom-K mixture; WebGraphMix+ additionally fuses the DCLM-fasttext quality score.

Table 1 — Per-category accuracy on the 23-task DCLM CORE v2 benchmark at 1B scale. WebGraphMix+ matches taxonomy-based WebOrganizer+ while requiring no proxy training, labeled targets, or benchmark-specific tuning.
Method	Commons.	Compreh.	Knowl.	Reason.	Lang.	Avg
Random	57.3	37.9	34.2	19.0	39.9	39.8
Quality	59.8	38.1	38.9	20.7	42.8	42.3
WebOrganizer	59.6	39.2	38.0	22.5	38.3	42.1
WebOrganizer+	61.9	41.4	39.1	21.9	38.8	43.4
PageRank	56.9	37.4	34.8	19.3	38.1	39.6
WebGraphMix	59.5	39.4	35.4	21.4	40.2	41.4
WebGraphMix+	60.8	42.6	39.7	22.6	41.9	43.8

05 — What We Found

Structural position is capability-dependent.

Complementary

Central and peripheral regions encode different knowledge. Bottom-K sampling consistently lifts factual and commonsense knowledge, while central hosts contribute more structured, procedural signal for symbolic reasoning.

Inverted-U

A roughly balanced mixture is best. For betweenness, the 50/50 split (41.4%) beats both 25/75 (40.5%) and 75/25 (41.0%) — too much of either extreme hurts. Neither central nor peripheral documents should dominate.

Orthogonal

Centrality adds value on top of quality. All 18 quality-combination configurations exceed the quality-only baseline; the best (Multiply Betweenness 50% Top) reaches 43.8% — a +1.5% lift over quality alone and +4.0% over random.

Scales up

The benefit grows with model size. The best-mixture gain rises from +0.1% at 400M to +1.6% at 1B, consistent with scaling behavior in other data-selection work — suggesting larger gains at larger scales.

06 — Citation

Cite this work

@article{badoni2025webgraphmix,
  title   = {Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality},
  author  = {Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
  journal = {arXiv preprint},
  year    = {2025}
}