Trending repositories for topic dataset
Label Studio is a multi-type data labeling and annotation tool with standardized output format
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Transformer: PyTorch Implementation of "Attention Is All You Need"
Techniques for deep learning with satellite & aerial imagery
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
Code and Data artifact for NeurIPS 2023 paper - "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context". `multispy` is a lsp client library in Python intended to be used to bu...
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
A quick guide (especially) for trending instruction finetuning datasets
Documentation on how to access and use the Quick, Draw! Dataset.
Create synthetic datasets for training and testing Language Learning Models (LLMs) in a Question-Answering (QA) context.
Code and Data artifact for NeurIPS 2023 paper - "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context". `multispy` is a lsp client library in Python intended to be used to bu...
Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
A 1D analogue of the MNIST dataset for measuring spatial biases and answering Science of Deep Learning questions.
Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
Small-Object Detection in Remote Sensing (satellite) Images with End-to-End Edge-Enhanced GAN and Object Detector Network
An open-source dataset of malicious software packages found in the wild, 100% vetted by humans.
Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
This repository contains a reading list of papers on Time Series Segmentation. This repository is still being continuously improved.
A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Techniques for deep learning with satellite & aerial imagery
Transformer: PyTorch Implementation of "Attention Is All You Need"
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
A quick guide (especially) for trending instruction finetuning datasets
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
Code and Data artifact for NeurIPS 2023 paper - "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context". `multispy` is a lsp client library in Python intended to be used to bu...
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
中文医学NLP公开资源整理:术语集/语料库/词向量/预训练模型/知识图谱/命名实体识别/QA/信息抽取/模型/论文/etc
The Building TimeSeries (BTS) dataset covers three buildings over a three-year period, comprising more than ten thousand timeseries data points with hundreds of unique ontologies. Moreover, the meta...
The World's Largest Decentralized AGI Multimodal Dataset
Create synthetic datasets for training and testing Language Learning Models (LLMs) in a Question-Answering (QA) context.
The world's first roller coaster SLAM dataset
Code and Data artifact for NeurIPS 2023 paper - "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context". `multispy` is a lsp client library in Python intended to be used to bu...
Türkiye'nin açık veri kaynakları | Curated list of open data platforms of Turkiye
Sportsbookreview.com scraper + complete 10Y games+odds data for NFL, NBA, NHL, MLB for bettors and sports analysts
A curated collection of public industrial datasets.
Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Official Implementation of 3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs
A world wines dataset with user ratings for recommendation systems and general use.
[ECCV 2024 Oral] PetFace: A Large-Scale Dataset and Benchmark for Animal Identification https://arxiv.org/abs/2407.13555
🔡 List of Tools, Libraries, Models, Datasets and other resources for Turkish NLP.
SUES-200: A Multi-height Multi-scene Cross-view Image Benchmark Across Drone and Satellite
This reposotory release a bearing failure dataset, which can support intelliegnt fault diagnosis research(实验室自采轴承开源数据集,包含稳定转速和时变转速)
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Transformer: PyTorch Implementation of "Attention Is All You Need"
Techniques for deep learning with satellite & aerial imagery
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
A quick guide (especially) for trending instruction finetuning datasets
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
A MNIST-like fashion product database. Benchmark :point_down:
The Building TimeSeries (BTS) dataset covers three buildings over a three-year period, comprising more than ten thousand timeseries data points with hundreds of unique ontologies. Moreover, the meta...
The world's first roller coaster SLAM dataset
The World's Largest Decentralized AGI Multimodal Dataset
Create synthetic datasets for training and testing Language Learning Models (LLMs) in a Question-Answering (QA) context.
(TMI-2024) Source-Free Active Domain Adaptation (SFADA) for GTV Segmentation across Multiple Hospitals
This reposotory release a bearing failure dataset, which can support intelliegnt fault diagnosis research(实验室自采轴承开源数据集,包含稳定转速和时变转速)
A taxonomy of industrial anomaly detection methods and datasets (updating).
Official Implementation of 3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs
The Enron-Spam dataset preprocessed in a single, clean csv file.
MIntRec2.0 is the first large-scale dataset for multimodal intent recognition and out-of-scope detection in multi-party conversations (ICLR 2024)
Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
UniGen: A Unified Framework for Dataset Generation via Large Language Model
Time-Series Anomaly Detection Comprehensive Benchmark
Multiple datasets for ARC (Abstraction and Reasoning Corpus)
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
This repository contains a reading list of papers on Time Series Segmentation. This repository is still being continuously improved.
Dataset and code of GTSinger(NeurIPS 2024 Spotlight): A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
The human toll of Israel's ongoing genocide in names & numbers. Use the data from our APIs to tell their story.
[NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
The history files when recording human interaction while solving ARC tasks
Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
[ArXiv 2024] WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation
[ACL 2024] CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Techniques for deep learning with satellite & aerial imagery
Transformer: PyTorch Implementation of "Attention Is All You Need"
A quick guide (especially) for trending instruction finetuning datasets
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
A MNIST-like fashion product database. Benchmark :point_down:
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
Dataset and code of GTSinger(NeurIPS 2024 Spotlight): A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
[ICLR 2024 Oral] Supervised Pre-Trained 3D Models for Medical Image Analysis (9,262 CT volumes + 25 annotated classes)
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测)
A Collection of 10.000 collected Windows Chrome Fingerprints. Usable with an easy-to-use API, available as a compressed (lzma) or full-size Json (view Releases). Its just 1.4mb in size in compressed f...
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
MegaVul - The largest, high-quality, extensible, continuously updated, C/C++/Java vulnerability dataset
Code repository for the ECCV paper "MSD: A Benchmark Dataset for Floor Plan of Building Complexes".
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
🤖 Dataset for TextSLAM: Visual SLAM with Semantic Planar Text Features. (ICRA2020 & TPAMI2023)
WildlifeDatasets: An open-source toolkit for animal re-identification
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.
Official Access to ICIP2024 "THQA: A Perceptual Quality Assessment Database for Talking Heads"
[NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM Instruction Tuning