About the Corpus
Common Crawl is a large-scale corpus of open web data. Maintained by the nonprofit organization Common Crawl, the corpus currently includes over 300 billion web pages compiled from 2007-present and adds 3–5 billion new pages each month, providing a continuously expanding record of publicly accessible web content.
Common Crawl includes raw web page text, metadata, and link structure information. It is widely used in computational research and has been cited in over 10,000 research papers
Accessing the Corpus
Researchers can access Common Crawl through its online portal: https://commoncrawl.org/
The corpus is accessible in multiple formats:
-
Researchers can access and analyze the corpus using cloud storage provided by Amazon Web Services (AWS). This allows users to work with the data without downloading large files to their own computers: https://commoncrawl.org/get-started
- Web crawl data can be downloaded in bulk as archive files (WARC, WET, WAT formats), which contain collections of web pages grouped by crawl date: https://ds5q9oxwqwsfj.cloudfront.net/
The corpus includes regularly updated crawl releases, and researchers can select specific crawls depending on their research needs. Documentation and file format standards are available through the Common Crawl website and associated technical resources.
Analyzing the Corpus
Common Crawl is well-suited for large-scale computational studies of language use, discourse, and communication across the web.
-
Whirlwind Tutorial for Common Crawl in Python: This tutorial demonstrates how to retrieve web pages, extract text from WARC files, and explore corpus content without needing to download the full dataset. It is especially useful for researchers who want to work with manageable subsets of web crawl data for linguistic or writing analysis. https://github.com/commoncrawl/whirlwind-python
-
Web Graph Analysis Tools with Java: Provides tools for constructing and analyzing web graphs derived from Common Crawl data. These tools allow researchers to examine how web pages link to one another, identify influential sites, and analyze relationships between domains. https://github.com/commoncrawl/cc-webgraph
- Large-Scale HTML and Web Text Analysis with R and Spark: This tutorial shows how researchers can extract text, count linguistic features, and explore large web corpora using familiar R workflows. It provides an accessible entry point for humanities researchers interested in large-scale corpus analysis without needing to manually process raw archive files. https://rpubs.com/jluraschi/billion-tags
Researchers can browse additional tutorials on the Common Crawl website: https://commoncrawl.org/example
Selected Research
- Bruns, A., & Münch, F. (2025). Web crawl refusals: Insights from Common Crawl. In Web Information Systems Engineering – WISE 2024 (Lecture Notes in Computer Science). Springer. https://doi.org/10.1007/978-3-031-85960-1_9
-
Knockel, J., Dalek, J., Aljizawi, N., Ahmed, M., Meletti, L., & Lau, J. (2024, November 25). Banned books: Analysis of censorship on Amazon.com. Citizen Lab, University of Toronto. https://citizenlab.ca/research/analysis-of-censorship-on-amazon-com/
-
Liu, X., et al. (2024). Misinformation resilient search rankings with webgraph-based interventions. ACM Transactions on Intelligent Systems and Technology. https://arxiv.org/abs/2404.08869
- Schweter, S., et al. (2025). A geolocated dataset of German news articles. Scientific Data, 12, Article 4222. https://www.nature.com/articles/s41597-025-05422-w