HuggingFaceFW/fineweb

FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.

詳細な紹介

FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.

Visit Website

もっと
データセット

Question-Level Feature Extraction on DAIC-WOZ Dataset

The DAIC-WOZ dataset contains clinical interviews designed to support the diagnosis of psychological distress conditions such as anxiety, depression, and post-traumatic stress disorder. This repository provides code for extracting question-level features from the DAIC-WOZ dataset, which can be used for multimodal analysis of depression levels.

Psych-101: Human Psychological Experiment Transcripts Dataset

Psych-101 is a dataset of natural language transcripts from human psychological experiments, comprising trial-by-trial data from 160 experiments and 60,092 participants, making 10,681,650 choices. It provides valuable insights into human decision-making processes and is available under the Apache License 2.0.

National Study of Mental Health and Wellbeing, 2020-2022

The National Study of Mental Health and Wellbeing provides key statistics on mental health issues in Australia, including the prevalence of mental disorders, consultations with health professionals, and the use of mental health-related medications. The study covers a wide range of mental health conditions and offers insights into the impact of mental health on individuals and society.

ウェブサイトURL

https://huggingface.co/datasets/HuggingFaceFW/fineweb

カテゴリー

データセット

キーワード

FineWebHuggingFaceDatasetCommonCrawlWeb DataLLMLanguage ModelsData ProcessingdatatroveMachine LearningNatural Language ProcessingLarge Language Models