Hubs or Fringes?

Pretraining data selection via web graph centrality β€” no classifiers, no labels, no proxy training.

Vedant Badoni Danqi Chen Xinyi Wang
Princeton Language and Intelligence
Affiliations Princeton Language and Intelligence
39.8 β†’ 43.8
Avg accuracy: uniform β†’ ours+quality (1B, 23 tasks)
13.9M
Host nodes & 439.6M edges in the web graph
< 9
GPU-hours to compute all centrality scores, once
0
Classifiers, labels, proxy models or taxonomies
01 β€” Abstract
Modern language model performance depends critically on pretraining data composition, yet existing selection methods lean on auxiliary classifiers or mixture optimization, adding compute and a dependence on labeled data. We propose WebGraphMix, a lightweight graph-based framework that computes structural centrality scores over the Common Crawl host-level web graph and varies the proportion of central versus peripheral documents in the pretraining mixture. Central hosts expose models to reusable abstractions; peripheral hosts encode specialized, long-tail knowledge. Integrated into the DataComp-LM pipeline and trained at 400M and 1B scales across 23 tasks, a 1:1 central–peripheral mixture reaches 41.4% average versus 39.8% for uniform sampling, and combining structural scores with a quality classifier lifts this to 43.8% β€” showing that web graph topology is a meaningful curation axis, largely orthogonal to content-based approaches.
02 β€” The Hypothesis

A document's position in the web graph shapes what a model learns from it.

The web is fundamentally a graph: hosts link to one another, encoding topical structure and information flow. Where a document sits in that network turns out to predict the kind of capability it contributes during pretraining.

Top-K Β· Central

Hubs & Bridges

Highly connected hosts (think major reference and platform sites) lie on many shortest paths and co-occur with heterogeneous contexts β€” surfacing broadly reusable, procedural patterns.

β†— Strongest gains on Symbolic & Algorithmic Reasoning (+1.4%)
Bottom-K Β· Peripheral

The Long Tail

Sparsely linked hosts β€” small organizations, niche communities, non-English content β€” carry specialized, long-tail knowledge that the dense core of the web simply doesn't repeat.

β†— Improves Science & Factual Knowledge and Commonsense
03 β€” How WebGraphMix Works

Three steps. One reusable preprocessing pass.

Centrality is computed once over the public host graph using GPU-parallel graph algorithms (cuGraph), then inherited by every document from that host β€” Katz in under 3 hours on one H100, betweenness in under 6 hours on four.

STEP 01

Score the graph

Build the directed host-level graph and compute Betweenness (cross-community bridging) and Katz (recursive influence) centrality for all 13.9M hosts.

STEP 02

Sample by position

Draw documents Top-K (central), Bottom-K (peripheral), or a Mixed Ξ± : (100βˆ’Ξ±) ratio under a fixed token budget.

STEP 03

Optionally fuse quality

Normalize and combine centrality with DCLM-fasttext quality scores β€” multiplicative/divisive for the sharpest, most complementary signal.

04 β€” Main Results

Mixing beats either extreme β€” and stacks on top of quality.

DCLM CORE v2 accuracy at 1B scale (1.4B params, 28B tokens), averaged by capability category. WebGraphMix uses betweenness with a 50/50 Top-K/Bottom-K mixture; WebGraphMix+ additionally fuses the DCLM-fasttext quality score.

MethodCommons.Compreh.Knowl.Reason.Lang.Avg
Random57.337.934.219.039.939.8
Quality59.838.138.920.742.842.3
WebOrganizer59.639.238.022.538.342.1
WebOrganizer+61.941.439.121.938.843.4
PageRank56.937.434.819.338.139.6
WebGraphMix59.539.435.421.440.241.4
WebGraphMix+60.842.639.722.641.943.8
Table 1 β€” Per-category accuracy on the 23-task DCLM CORE v2 benchmark at 1B scale. WebGraphMix+ matches taxonomy-based WebOrganizer+ while requiring no proxy training, labeled targets, or benchmark-specific tuning.
05 β€” What We Found

Structural position is capability-dependent.

Complementary

Central and peripheral regions encode different knowledge. Bottom-K sampling consistently lifts factual and commonsense knowledge, while central hosts contribute more structured, procedural signal for symbolic reasoning.

Inverted-U

A roughly balanced mixture is best. For betweenness, the 50/50 split (41.4%) beats both 25/75 (40.5%) and 75/25 (41.0%) β€” too much of either extreme hurts. Neither central nor peripheral documents should dominate.

Orthogonal

Centrality adds value on top of quality. All 18 quality-combination configurations exceed the quality-only baseline; the best (Multiply Betweenness 50% Top) reaches 43.8% β€” a +1.5% lift over quality alone and +4.0% over random.

Scales up

The benefit grows with model size. The best-mixture gain rises from +0.1% at 400M to +1.6% at 1B, consistent with scaling behavior in other data-selection work β€” suggesting larger gains at larger scales.

06 β€” Citation

Cite this work

@article{badoni2025webgraphmix,
  title   = {Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality},
  author  = {Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
  journal = {arXiv preprint},
  year    = {2025}
}