FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.
FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.
The Weibo User Depression Detection Dataset is a large-scale dataset for detecting depression in Weibo users. It includes user profiles, tweets, and labels indicating whether the user is depressed. The dataset is useful for researchers working on mental health and social media analysis.
The iBVP dataset is a collection of synchronized RGB and thermal infrared videos with PPG ground-truth signals acquired from an ear. It includes manual signal quality labels and dense signal-quality assessment using the SQA-PhysMD model. The dataset is designed to induce real-world variations in psycho-physiological states and head movement.
HeartLink is an empathetic psychological model that uses a large language model fine-tuned on a large empathetic Q&A dataset. It can perceive users' emotions and experiences during conversations and provide empathetic responses using rich psychological knowledge, aiming to understand, comfort, and support users. The responses include emoji expressions to bridge the gap with users, offering psychological support and help during consultations.