HuggingFaceFW/fineweb

FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Подробное введение

FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.

Visit Website

Больше
Набор данных

Mental Health Large Model Lingxin (SoulChat)

Lingxin (SoulChat) is a psychological health large model fine-tuned with millions of Chinese long-text instructions and multi-turn empathetic dialogue data in the field of psychological counseling.

Hugging Face Dataset - bfuzzy1/gunny_x

Every veteran knows and has had a 'Gunny': Semper Fidelis. This dataset is designed for conversational AI systems to assist veterans from various military branches, including U.S. and U.K. armed forces.

Ithaka 2006 Survey of US Higher Education Faculty Attitudes and Behaviors

This study surveys the attitudes and behaviors of US higher education faculty members regarding online resources, the library, and related topics. It covers a wide range of issues, including faculty dependence on electronic scholarly resources, the transition from print to electronic journals, publishing preferences, e-books, and the preservation of scholarly journals.

URL сайта

https://huggingface.co/datasets/HuggingFaceFW/fineweb

Категории

Набор данных

Ключевые слова

FineWebHuggingFaceDatasetCommonCrawlWeb DataLLMLanguage ModelsData ProcessingdatatroveMachine LearningNatural Language ProcessingLarge Language Models