HuggingFaceFW/fineweb

HuggingFaceFW/fineweb

FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.

HuggingFaceFW/fineweb

معرفی دقیق

FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.

بیشتر
مجموعه داده

Hugging Face Dataset - bfuzzy1/gunny_x
مشاهده جزئیات

Hugging Face Dataset - bfuzzy1/gunny_x

Every veteran knows and has had a 'Gunny': Semper Fidelis. This dataset is designed for conversational AI systems to assist veterans from various military branches, including U.S. and U.K. armed forces.

Chinese Psychological QA DataSet - GitHub Repository
مشاهده جزئیات

Chinese Psychological QA DataSet - GitHub Repository

The Chinese Psychological QA DataSet is a collection of 102,845 community Q&A pairs related to psychological topics., providing a rich source of data for research and development in psychological counseling and AI applications. Each entry includes detailed question and answer information, making it a valuable resource for understanding user queries and generating appropriate responses.

HuggingFaceFW/fineweb-2
مشاهده جزئیات

HuggingFaceFW/fineweb-2

FineWeb-2 is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.

دسته‌بندی‌ها

کلمات کلیدی

FineWebHuggingFaceDatasetCommonCrawlWeb DataLLMLanguage ModelsData ProcessingdatatroveMachine LearningNatural Language ProcessingLarge Language Models

به اشتراک بگذارید