Pretraining data selection via web graph centrality β no classifiers, no labels, no proxy training.
The web is fundamentally a graph: hosts link to one another, encoding topical structure and information flow. Where a document sits in that network turns out to predict the kind of capability it contributes during pretraining.
Highly connected hosts (think major reference and platform sites) lie on many shortest paths and co-occur with heterogeneous contexts β surfacing broadly reusable, procedural patterns.
Sparsely linked hosts β small organizations, niche communities, non-English content β carry specialized, long-tail knowledge that the dense core of the web simply doesn't repeat.
Centrality is computed once over the public host graph using GPU-parallel graph algorithms (cuGraph), then inherited by every document from that host β Katz in under 3 hours on one H100, betweenness in under 6 hours on four.
Build the directed host-level graph and compute Betweenness (cross-community bridging) and Katz (recursive influence) centrality for all 13.9M hosts.
Draw documents Top-K (central), Bottom-K (peripheral), or a Mixed Ξ± : (100βΞ±) ratio under a fixed token budget.
Normalize and combine centrality with DCLM-fasttext quality scores β multiplicative/divisive for the sharpest, most complementary signal.
DCLM CORE v2 accuracy at 1B scale (1.4B params, 28B tokens), averaged by capability category. WebGraphMix uses betweenness with a 50/50 Top-K/Bottom-K mixture; WebGraphMix+ additionally fuses the DCLM-fasttext quality score.
| Method | Commons. | Compreh. | Knowl. | Reason. | Lang. | Avg |
|---|---|---|---|---|---|---|
| Random | 57.3 | 37.9 | 34.2 | 19.0 | 39.9 | 39.8 |
| Quality | 59.8 | 38.1 | 38.9 | 20.7 | 42.8 | 42.3 |
| WebOrganizer | 59.6 | 39.2 | 38.0 | 22.5 | 38.3 | 42.1 |
| WebOrganizer+ | 61.9 | 41.4 | 39.1 | 21.9 | 38.8 | 43.4 |
| PageRank | 56.9 | 37.4 | 34.8 | 19.3 | 38.1 | 39.6 |
| WebGraphMix | 59.5 | 39.4 | 35.4 | 21.4 | 40.2 | 41.4 |
| WebGraphMix+ | 60.8 | 42.6 | 39.7 | 22.6 | 41.9 | 43.8 |
Central and peripheral regions encode different knowledge. Bottom-K sampling consistently lifts factual and commonsense knowledge, while central hosts contribute more structured, procedural signal for symbolic reasoning.
A roughly balanced mixture is best. For betweenness, the 50/50 split (41.4%) beats both 25/75 (40.5%) and 75/25 (41.0%) β too much of either extreme hurts. Neither central nor peripheral documents should dominate.
Centrality adds value on top of quality. All 18 quality-combination configurations exceed the quality-only baseline; the best (Multiply Betweenness 50% Top) reaches 43.8% β a +1.5% lift over quality alone and +4.0% over random.
The benefit grows with model size. The best-mixture gain rises from +0.1% at 400M to +1.6% at 1B, consistent with scaling behavior in other data-selection work β suggesting larger gains at larger scales.
@article{badoni2025webgraphmix, title = {Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality}, author = {Badoni, Vedant and Chen, Danqi and Wang, Xinyi}, journal = {arXiv preprint}, year = {2025} }