HuggingFaceFW/fineweb

FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Detailed Introduction

FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.

Visit Website

More
Dataset

CaiTI_dataset: Cognitive Behavioral Therapy Dataset - GitHub Repository

The CaiTI_dataset repository contains datasets for Motivational Interviewing and Cognitive Behavioral Therapy, curated by therapists to train CaiTI.

iBVP Dataset: RGB-Thermal rPPG Dataset

The iBVP dataset is a collection of synchronized RGB and thermal infrared videos with PPG ground-truth signals acquired from an ear. It includes manual signal quality labels and dense signal-quality assessment using the SQA-PhysMD model. The dataset is designed to induce real-world variations in psycho-physiological states and head movement.

Adolescent Mental Health: WHO Report

The WHO report on adolescent mental health describes actions undertaken by international development organizations to address adolescents’ mental health needs at the country level. It highlights the inadequacy of current efforts and the need for more coordinated and comprehensive interventions.

Website URL

https://huggingface.co/datasets/HuggingFaceFW/fineweb

More Categories

Dataset

Keywords

FineWebHuggingFaceDatasetCommonCrawlWeb DataLLMLanguage ModelsData ProcessingdatatroveMachine LearningNatural Language ProcessingLarge Language Models