Trending repositories for topic datasets
Label Studio is a multi-type data labeling and annotation tool with standardized output format
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
OSINT cheat sheet, list OSINT tools, wiki, dataset, article, book , red team OSINT for hackers and OSINT tips and OSINT branch. This repository will grow every time will research, there is a research,...
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
Techniques for deep learning with satellite & aerial imagery
[TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop....
A list of publicly available datasets with real-time data maintained by the team at bytewax.io
A list of awesome papers and resources of recommender system on large language model (LLM).
AI Audio Datasets (AI-ADS) 🎵, including Speech, Music, and Sound Effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio app...
FL Chart is a highly customizable Flutter chart library that supports Line Chart, Bar Chart, Pie Chart, Scatter Chart, Radar Chart and Candlestick Chart.
Synthesizing High-quality Text-to-SQL Data at Scale. SynSQL-2.5M is the first million-scale cross-domain text-to-SQL dataset.
csghub-server is the backend server for CSGHub which helps user to manage datasets, modes, and also run Model Inference, Finetune and Application Spaces.
Major Europe leagues data (England, Spain, Italy, Germany and France)
OSINT cheat sheet, list OSINT tools, wiki, dataset, article, book , red team OSINT for hackers and OSINT tips and OSINT branch. This repository will grow every time will research, there is a research,...
Synthesizing High-quality Text-to-SQL Data at Scale. SynSQL-2.5M is the first million-scale cross-domain text-to-SQL dataset.
A list of publicly available datasets with real-time data maintained by the team at bytewax.io
Data Engineering Pilipinas is a community for data engineers, data analysts, data scientists, developers, AI / ML engineers, and users of closed and open source data tools and methods / techniques in ...
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
AI Audio Datasets (AI-ADS) 🎵, including Speech, Music, and Sound Effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio app...
Multimodal Question Answering in the Medical Domain: A summary of Existing Datasets and Systems
🌳 A curated list of ground-truth forest datasets for the machine learning and forestry community.
csghub-server is the backend server for CSGHub which helps user to manage datasets, modes, and also run Model Inference, Finetune and Application Spaces.
🚀🚀🚀A collection of some awesome public projects about Large Language Model(LLM), Vision Language Model(VLM), Vision Language Action(VLA), AI Generated Content(AIGC), the related Datasets and Applic...
A list of awesome papers and resources of recommender system on large language model (LLM).
A standard format for offline reinforcement learning datasets, with popular reference datasets and related utilities
[TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.
Label Studio is a multi-type data labeling and annotation tool with standardized output format
[AAAI 2025]👔IMAGDressing👔: Interactive Modular Apparel Generation for Virtual Dressing. It enables customizable human image generation with flexible garment, pose, and scene control, ensuring high f...
A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Label Studio is a multi-type data labeling and annotation tool with standardized output format
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
[TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.
OSINT cheat sheet, list OSINT tools, wiki, dataset, article, book , red team OSINT for hackers and OSINT tips and OSINT branch. This repository will grow every time will research, there is a research,...
A list of awesome papers and resources of recommender system on large language model (LLM).
Techniques for deep learning with satellite & aerial imagery
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop....
A list of publicly available datasets with real-time data maintained by the team at bytewax.io
A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance in Text-to-SQL
TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
An open source multi-tool for exploring and publishing data
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Synthesizing High-quality Text-to-SQL Data at Scale. SynSQL-2.5M is the first million-scale cross-domain text-to-SQL dataset.
A powerful tool for creating high-quality training datasets for Large Language Models (LLMs)(一个快速生成高质量LLM训练数据集的工具)
Collection of Aesthetics Assessment Papers for Graphic Designs.
Synthesizing High-quality Text-to-SQL Data at Scale. SynSQL-2.5M is the first million-scale cross-domain text-to-SQL dataset.
A list of public EMG datasets and their papers, with a focus on raw EMG signals.
OSINT cheat sheet, list OSINT tools, wiki, dataset, article, book , red team OSINT for hackers and OSINT tips and OSINT branch. This repository will grow every time will research, there is a research,...
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
A list of publicly available datasets with real-time data maintained by the team at bytewax.io
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
Major Europe leagues data (England, Spain, Italy, Germany and France)
🎉🎨 Papers, Code, Datasets for Neuroscience and Cognition Science
AI Audio Datasets (AI-ADS) 🎵, including Speech, Music, and Sound Effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio app...
A collection of some awesome public object detection and recognition datasets.
A list of awesome papers and resources of recommender system on large language model (LLM).
A bunch of some 200 datasets. You can call it mini-kaggle :)
A curated set of references to useful UK Government datasets
A powerful tool for creating high-quality training datasets for Large Language Models (LLMs)(一个快速生成高质量LLM训练数据集的工具)
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
Label Studio is a multi-type data labeling and annotation tool with standardized output format
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
OSINT cheat sheet, list OSINT tools, wiki, dataset, article, book , red team OSINT for hackers and OSINT tips and OSINT branch. This repository will grow every time will research, there is a research,...
[TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.
TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop....
Techniques for deep learning with satellite & aerial imagery
A list of awesome papers and resources of recommender system on large language model (LLM).
An open source multi-tool for exploring and publishing data
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
FL Chart is a highly customizable Flutter chart library that supports Line Chart, Bar Chart, Pie Chart, Scatter Chart, Radar Chart and Candlestick Chart.
A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance in Text-to-SQL
Croissant is a high-level format for machine learning datasets that brings together four rich layers.
A curated list of amazingly awesome Cybersecurity datasets
Official repository for Aria-MIDI: a MIDI dataset of 1,186,253 transcribed solo-piano recordings.
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
🎨 IMAGGarment-1: Fine-Grained Garment Generation with Controllable Structure, Color, and Logo. It supports precise and customizable garment synthesis guided by multi-conditions (e.g., sketch, color,...
Healthcare and biomedical datasets, for AI/ML
Collection of Aesthetics Assessment Papers for Graphic Designs.
OSINT cheat sheet, list OSINT tools, wiki, dataset, article, book , red team OSINT for hackers and OSINT tips and OSINT branch. This repository will grow every time will research, there is a research,...
A list of public EMG datasets and their papers, with a focus on raw EMG signals.
A benchmark fault diagnosis dataset comprises vibration data collected from a gearbox under variable working conditions with intentionally induced faults, encompassing diverse fault severities and typ...
Synthesizing High-quality Text-to-SQL Data at Scale. SynSQL-2.5M is the first million-scale cross-domain text-to-SQL dataset.
CESNET DataZoo: A toolset for large network traffic datasets
[AAAI 2025 Oral🚁] Game4Loc: A UAV Geo-Localization Benchmark from Game Data
Croissant is a high-level format for machine learning datasets that brings together four rich layers.
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning - - —
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
A list of datasets, tools, papers and code related to Deepfakes.
[AAAI 2025]👔IMAGDressing👔: Interactive Modular Apparel Generation for Virtual Dressing. It enables customizable human image generation with flexible garment, pose, and scene control, ensuring high f...
Synthesizing High-quality Text-to-SQL Data at Scale. SynSQL-2.5M is the first million-scale cross-domain text-to-SQL dataset.
Official implementation of "Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM"
[AAAI 2025 Oral🚁] Game4Loc: A UAV Geo-Localization Benchmark from Game Data
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning - - —
A suite of tools designed to extract, compute and display data stored on a Bitcoin Core node
[ECCV2024] Towards Reliable Advertising Image Generation Using Human Feedback
Code and data for "ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM" (NeurIPS 2024 Track Datasets and Benchmarks)
🎨 IMAGGarment-1: Fine-Grained Garment Generation with Controllable Structure, Color, and Logo. It supports precise and customizable garment synthesis guided by multi-conditions (e.g., sketch, color,...
Official repository for Aria-MIDI: a MIDI dataset of 1,186,253 transcribed solo-piano recordings.
A powerful tool for creating high-quality training datasets for Large Language Models (LLMs)(一个快速生成高质量LLM训练数据集的工具)
Label Studio is a multi-type data labeling and annotation tool with standardized output format
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
[TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Techniques for deep learning with satellite & aerial imagery
[AAAI 2025]👔IMAGDressing👔: Interactive Modular Apparel Generation for Virtual Dressing. It enables customizable human image generation with flexible garment, pose, and scene control, ensuring high f...
TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
An open source multi-tool for exploring and publishing data
A list of awesome papers and resources of recommender system on large language model (LLM).
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop....
Langtrace 🔍 is an open-source, Open Telemetry based end-to-end observability tool for LLM applications, providing real-time tracing, evaluations and metrics for popular LLMs, LLM frameworks, vectorD...
A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance in Text-to-SQL
FL Chart is a highly customizable Flutter chart library that supports Line Chart, Bar Chart, Pie Chart, Scatter Chart, Radar Chart and Candlestick Chart.
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
csghub-server is the backend server for CSGHub which helps user to manage datasets, modes, and also run Model Inference, Finetune and Application Spaces.
A repository of datasets paired with rich documentation, data essays, and teaching resources
A list of public EMG datasets and their papers, with a focus on raw EMG signals.
A benchmark fault diagnosis dataset comprises vibration data collected from a gearbox under variable working conditions with intentionally induced faults, encompassing diverse fault severities and typ...
Multiple datasets for ARC (Abstraction and Reasoning Corpus)
Langtrace 🔍 is an open-source, Open Telemetry based end-to-end observability tool for LLM applications, providing real-time tracing, evaluations and metrics for popular LLMs, LLM frameworks, vectorD...
Collection of Aesthetics Assessment Papers for Graphic Designs.
A comprehensive survey of datasets for research in host-based and/or network-based intrusion detection, with a focus on enterprise networks
An open source DevOps tool for packaging and versioning AI/ML models, datasets, code, and configuration into an OCI artifact.
Resources about solar power systems for data science
VetDataHub is an opensource veterinary datasets repository dedicated to advancing veterinary medicine through the sharing and exchange of diverse datasets. The project aims to make opensource veterina...
[TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.