HuggingFaceFW/fineweb

HuggingFaceFW/fineweb

FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.

HuggingFaceFW/fineweb

Detaylı Giriş

FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.

Daha fazla
Veri seti

MentalManip: 心理操纵检测数据集
Detayları Görüntüle

MentalManip: 心理操纵检测数据集

MentalManip数据集是由Wang等人(2024b)引入的,专门用于检测和分类心理操纵的对话数据集。该数据集包含4000个多轮虚构对话,来源于在线电影剧本,并进行了多层次的标注,包括操纵的存在、操纵技巧和目标脆弱性。数据集的创建旨在通过高质量的标注确保数据的一致性和准确性,从而支持心理操纵检测的研究。

Psy-Insight: Mental Health Counseling Dataset
Detayları Görüntüle

Psy-Insight: Mental Health Counseling Dataset

Psy-Insight is a bilingual, interpretable multi-turn dataset for mental health counseling dialogues. It includes 6,208 rounds of multi-turn counseling dialogues in English and 5,776 rounds in Chinese, annotated with step-by-step reasoning labels and multi-task labels. This dataset is designed to support the application of large language models in mental health and is suitable for tasks such as emotion classification and psychological treatment interpretation.

IC-AnnoMI: In-Context MI Dialogues - GitHub Repository
Detayları Görüntüle

IC-AnnoMI: In-Context MI Dialogues - GitHub Repository

The IC-AnnoMI repository contains source code and a synthetic dataset generated through in-context zero-shot LLM prompting for mental health and therapeutic counselling. IC-AnnoMI is a project that generates contextual MI dialogues using large language models (LLMs). The project contains source code and a synthetic dataset generated through zero-shot prompts, aiming to address the data scarcity and inherent bias problems in mental health and therapeutic consultation.

Kategoriler

Anahtar Kelimeler

FineWebHuggingFaceDatasetCommonCrawlWeb DataLLMLanguage ModelsData ProcessingdatatroveMachine LearningNatural Language ProcessingLarge Language Models

Paylaş