Trending repositories for topic dataset
Label Studio is a multi-type data labeling and annotation tool with standardized output format
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
心理健康大模型 (LLM x Mental Health), Pre & Post-training & Dataset & Evaluation & Depoly & RAG, with InternLM / Qwen / Baichuan / DeepSeek / Mixtral / LLama / GLM series models
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
A quick guide (especially) for trending instruction finetuning datasets
A MNIST-like fashion product database. Benchmark :point_down:
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测)
CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offers both open-source and on-premise/SaaS solutions, with features comparable to Hugging Face. Gain fu...
[NeurIPS'24 Spotlight] Text2CAD: Generating Sequential CAD Designs from Beginner-to-Expert Level Text Prompts
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
[NeurIPS'24 Spotlight] Text2CAD: Generating Sequential CAD Designs from Beginner-to-Expert Level Text Prompts
[IEEE JSTARS 2024] CV-Cities: Advancing Cross-view Geo-localization in Global Cities
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测)
MineStudio: A Streamlined Package for Minecraft AI Agent Development
[ICLR 2025] This is the official repository of our paper "MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine“
✨ A synthetic dataset generation framework that produces diverse coding questions and verifiable solutions - all in one framwork
心理健康大模型 (LLM x Mental Health), Pre & Post-training & Dataset & Evaluation & Depoly & RAG, with InternLM / Qwen / Baichuan / DeepSeek / Mixtral / LLama / GLM series models
Python library for downloading, loading & working with sound datasets
[SIGGRAPH Asia 2022] Assemble Them All: Physics-Based Planning for Generalizable Assembly by Disassembly
ACM Multimedia2020 University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization :helicopter: annotates 1652 buildings in 72 universities around the world.
A large-scale multilingual dataset for Information Retrieval. Thorough human-annotations across 18 diverse languages.
AgML is a centralized framework for agricultural machine learning. AgML provides access to public agricultural datasets for common agricultural deep learning tasks, with standard benchmarks and pretra...
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data generation pipeline!
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
This is a continuously updated handbook for readers to easily track the latest Text-to-SQL techniques in the literature and provide practical guidance for researchers and practitioners. If we missed a...
CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offers both open-source and on-premise/SaaS solutions, with features comparable to Hugging Face. Gain fu...
Transformer: PyTorch Implementation of "Attention Is All You Need"
Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.
Browser compatibility data for Web technologies as displayed on MDN
心理健康大模型 (LLM x Mental Health), Pre & Post-training & Dataset & Evaluation & Depoly & RAG, with InternLM / Qwen / Baichuan / DeepSeek / Mixtral / LLama / GLM series models
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
[IEEE JSTARS 2024] CV-Cities: Advancing Cross-view Geo-localization in Global Cities
[NeurIPS'24 Spotlight] Text2CAD: Generating Sequential CAD Designs from Beginner-to-Expert Level Text Prompts
This is a continuously updated handbook for readers to easily track the latest Text-to-SQL techniques in the literature and provide practical guidance for researchers and practitioners. If we missed a...
✨ A synthetic dataset generation framework that produces diverse coding questions and verifiable solutions - all in one framwork
Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.
SEA is an automated paper review framework capable of generating comprehensive and high-quality review feedback with high consistency for papers, thereby assisting researchers in improving the quality...
MineStudio: A Streamlined Package for Minecraft AI Agent Development
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测)
[ICLR 2025] This is the official repository of our paper "MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine“
Pre-rendered regularization images of men and women, mainly faces, seeking to generate more realistic images (without wax skin)
A published large-scale dataset - Weibo User Depression Detection Dataset.
🔥 Datasets and env wrappers for offline safe reinforcement learning
☠️ Ground-truth dataset for vulnerability prediction (known research datasets and data sources included such as NVD, CVE Details and OSV); tools to automatically update the data are provided.
Extract information from all games published in Steam thanks to its Web API, and store it in JSON format.
[SCIS 2024] The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions fro...
A curated collection of public industrial datasets.
Label Studio is a multi-type data labeling and annotation tool with standardized output format
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Techniques for deep learning with satellite & aerial imagery
Transformer: PyTorch Implementation of "Attention Is All You Need"
This is a continuously updated handbook for readers to easily track the latest Text-to-SQL techniques in the literature and provide practical guidance for researchers and practitioners. If we missed a...
CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offers both open-source and on-premise/SaaS solutions, with features comparable to Hugging Face. Gain fu...
Up to 10x faster strings for C, C++, Python, Rust, Swift & Go, leveraging NEON, AVX2, AVX-512, SVE, & SWAR to accelerate search, hashing, sort, edit distances, and memory ops 🦖
心理健康大模型 (LLM x Mental Health), Pre & Post-training & Dataset & Evaluation & Depoly & RAG, with InternLM / Qwen / Baichuan / DeepSeek / Mixtral / LLama / GLM series models
A quick guide (especially) for trending instruction finetuning datasets
Browser compatibility data for Web technologies as displayed on MDN
[CVPR 2025] Science-T2I: Addressing Scientific Illusions in Image Synthesis
Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching"
The world's first roller coaster SLAM dataset
LoLI-Street is a low-light image enhancement dataset for training and testing low-light image enhancement models under urban street scenes.
🌴[CVPR 2024] OakInk2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion
MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery
This is a continuously updated handbook for readers to easily track the latest Text-to-SQL techniques in the literature and provide practical guidance for researchers and practitioners. If we missed a...
A curated collection of public industrial datasets.
[NeurIPS'24 Spotlight] Text2CAD: Generating Sequential CAD Designs from Beginner-to-Expert Level Text Prompts
The Building TimeSeries (BTS) dataset covers three buildings over a three-year period, comprising more than ten thousand timeseries data points with hundreds of unique ontologies. Moreover, the meta...
Marathi NLP - is a repository dedicated to development of tools and resources for Marathi language.
✨ A synthetic dataset generation framework that produces diverse coding questions and verifiable solutions - all in one framwork
Get 3D motion vectors / scene flow directly from Blender
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
A repo to introduce website that share Data and Dataset about Iran [Useful for Journalist and Researchers ]
[ACL 2024] CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data generation pipeline!
This is a continuously updated handbook for readers to easily track the latest Text-to-SQL techniques in the literature and provide practical guidance for researchers and practitioners. If we missed a...
[ICLR 2025] This is the official repository of our paper "MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine“
Dataset and code of GTSinger(NeurIPS 2024 Spotlight): A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
✨ A synthetic dataset generation framework that produces diverse coding questions and verifiable solutions - all in one framwork
[NeurIPS'24 Spotlight] Text2CAD: Generating Sequential CAD Designs from Beginner-to-Expert Level Text Prompts
[NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Official implementation for "JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework"
[CVPR 2025] WildAvatar: Learning In-the-wild 3D Avatars from the Web
A taxonomy of industrial anomaly detection methods and datasets (updating).
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offers both open-source and on-premise/SaaS solutions, with features comparable to Hugging Face. Gain fu...
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Techniques for deep learning with satellite & aerial imagery
Transformer: PyTorch Implementation of "Attention Is All You Need"
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
心理健康大模型 (LLM x Mental Health), Pre & Post-training & Dataset & Evaluation & Depoly & RAG, with InternLM / Qwen / Baichuan / DeepSeek / Mixtral / LLama / GLM series models
A quick guide (especially) for trending instruction finetuning datasets
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
Up to 10x faster strings for C, C++, Python, Rust, Swift & Go, leveraging NEON, AVX2, AVX-512, SVE, & SWAR to accelerate search, hashing, sort, edit distances, and memory ops 🦖
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data generation pipeline!
[NeurIPS'24 Spotlight] Text2CAD: Generating Sequential CAD Designs from Beginner-to-Expert Level Text Prompts
Dataset and code of GTSinger(NeurIPS 2024 Spotlight): A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
The history files when recording human interaction while solving ARC tasks
✨ A synthetic dataset generation framework that produces diverse coding questions and verifiable solutions - all in one framwork
Code repository for the ECCV paper "MSD: A Benchmark Dataset for Floor Plan of Building Complexes".
[ICLR 2025] This is the official repository of our paper "MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine“
[CVPR 2025] Science-T2I: Addressing Scientific Illusions in Image Synthesis
Data research, preparation, and manipulation nodes for model trainers and artists.
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
Create synthetic datasets for training and testing Language Learning Models (LLMs) in a Question-Answering (QA) context.
[ECCV 2024] SDK for MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty
A curated list of awesome smart contract datasets
Multiple datasets for ARC (Abstraction and Reasoning Corpus)
[SCIS 2024] The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions fro...
CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offers both open-source and on-premise/SaaS solutions, with features comparable to Hugging Face. Gain fu...