FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.
FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.
The Weibo User Depression Detection Dataset is a large-scale dataset for detecting depression in Weibo users. It includes user profiles, tweets, and labels indicating whether the user is depressed. The dataset is useful for researchers working on mental health and social media analysis.
PsychData is an online platform for hosting and conducting surveys and experiments in psychology, supporting secure data collection for researchers and students.
The National Study of Mental Health and Wellbeing provides key statistics on mental health issues in Australia, including the prevalence of mental disorders, consultations with health professionals, and the use of mental health-related medications. The study covers a wide range of mental health conditions and offers insights into the impact of mental health on individuals and society.