HuggingFaceFW/fineweb

HuggingFaceFW/fineweb

FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.

More Categories

Keywords

FineWebHuggingFaceDatasetCommonCrawlWeb DataLLMLanguage ModelsData ProcessingdatatroveMachine LearningNatural Language ProcessingLarge Language Models

Share