Statistics for topic dataset
RepositoryStats tracks 591,869 Github repositories, of these 1,155 are tagged with the dataset topic. The most common primary language for repositories using this topic is Python (609). Other languages include: Jupyter Notebook (155), C++ (25), JavaScript (23), HTML (19), MATLAB (15), R (14)
Stargazers over time for topic dataset
Most starred repositories for topic dataset (view more)
Trending repositories for topic dataset (view more)
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Transformer: PyTorch Implementation of "Attention Is All You Need"
An open-source dataset of malicious software packages found in the wild, 100% vetted by humans.
A fully-annotated, open-design dataset of autonomous and piloted high-speed flight
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
(TMI-2024) Source-Free Active Domain Adaptation (SFADA) for GTV Segmentation across Multiple Hospitals
These scripts are used to download RealEstate10K dataset.
Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
MIntRec2.0 is the first large-scale dataset for multimodal intent recognition and out-of-scope detection in multi-party conversations (ICLR 2024)
A taxonomy of industrial anomaly detection methods and datasets (updating).
(TMI-2024) Source-Free Active Domain Adaptation (SFADA) for GTV Segmentation across Multiple Hospitals
We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.
An open-source mechanical failure dataset is available, comprising 30+ categories including bearings, gears, pumps, and others.(30余个开源故障诊断和预测数据集,不断更新中)
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
This repository contains a reading list of papers on Time Series Segmentation. This repository is still being continuously improved.
Dataset and code of GTSinger(NeurIPS 2024 Spotlight): A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
Dataset and code of GTSinger(NeurIPS 2024 Spotlight): A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
A Collection of 10.000 collected Windows Chrome Fingerprints. Usable with an easy-to-use API, available as a compressed (lzma) or full-size Json (view Releases). Its just 1.4mb in size in compressed f...
A comprehesive survey about foundation models for weather and cliamte data understanding.