HuggingFaceFW/fineweb

FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Detaillierte Einführung

FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.

Visit Website

Mehr
Datensatz

Ithaka 2006 Survey of US Higher Education Faculty Attitudes and Behaviors

This study surveys the attitudes and behaviors of US higher education faculty members regarding online resources, the library, and related topics. It covers a wide range of issues, including faculty dependence on electronic scholarly resources, the transition from print to electronic journals, publishing preferences, e-books, and the preservation of scholarly journals.

ISSP: International Social Science Survey Program

The ISSP is a cross-national collaboration program conducting annual surveys on diverse topics relevant to social sciences. Established in 1984, it includes members from various cultures around the globe. Over one million respondents have participated in ISSP surveys, and all collected data and documentation are available free of charge.

Psychology Therapy Dataset- halilxibrahim/psychology-therapy

Dataset Card for Psychology Therapy Dataset : This dataset card aims to provide information about a dataset focused on psychology therapy conversations. Language(s) (NLP): Turkish (tr)

Website-URL

https://huggingface.co/datasets/HuggingFaceFW/fineweb

Kategorien

Datensatz

Schlüsselwörter

FineWebHuggingFaceDatasetCommonCrawlWeb DataLLMLanguage ModelsData ProcessingdatatroveMachine LearningNatural Language ProcessingLarge Language Models