FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.
FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.
Psychology LLM、LLM、The Big Model of Mental Health、Finetune、InternLM2、InternLM2.5、Qwen、ChatGLM、Baichuan、DeepSeek、Mixtral、LLama3、GLM4、Qwen2 - SmartFlowAI/EmoLLM
PsychData is an online platform for hosting and conducting surveys and experiments in psychology, supporting secure data collection for researchers and students.
The IC-AnnoMI repository contains source code and a synthetic dataset generated through in-context zero-shot LLM prompting for mental health and therapeutic counselling. IC-AnnoMI is a project that generates contextual MI dialogues using large language models (LLMs). The project contains source code and a synthetic dataset generated through zero-shot prompts, aiming to address the data scarcity and inherent bias problems in mental health and therapeutic consultation.