FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.
FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.
This dataset contains survey responses from individuals in the tech industry about their mental health, including questions about treatment, workplace resources, and attitudes towards discussing mental health in the workplace. By analyzing this dataset, we can better understand how prevalent mental health issues are among those who work in the tech sector—and what kinds of resources they rely upon to find help—so that more can be done to create a healthier working environment for all.
PsychData is an online platform for hosting and conducting surveys and experiments in psychology, supporting secure data collection for researchers and students.
HeartLink is an empathetic psychological model that uses a large language model fine-tuned on a large empathetic Q&A dataset. It can perceive users' emotions and experiences during conversations and provide empathetic responses using rich psychological knowledge, aiming to understand, comfort, and support users. The responses include emoji expressions to bridge the gap with users, offering psychological support and help during consultations.