Trending repositories for topic dataset
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Transformer: PyTorch Implementation of "Attention Is All You Need"
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
Techniques for deep learning with satellite & aerial imagery
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
A quick guide (especially) for trending instruction finetuning datasets
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
An open-source dataset of malicious software packages found in the wild, 100% vetted by humans.
cluster data collected from production clusters in Alibaba for cluster management research
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
中文医学NLP公开资源整理:术语集/语料库/词向量/预训练模型/知识图谱/命名实体识别/QA/信息抽取/模型/论文/etc
An open-source dataset of malicious software packages found in the wild, 100% vetted by humans.
A fully-annotated, open-design dataset of autonomous and piloted high-speed flight
These scripts are used to download RealEstate10K dataset.
Dataset and code of GTSinger(NeurIPS 2024 Spotlight): A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
[ICLR 2024] DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome
Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context l...
Code and Data artifact for NeurIPS 2023 paper - "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context". `multispy` is a lsp client library in Python intended to be used to bu...
Transformer: PyTorch Implementation of "Attention Is All You Need"
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Code for replicating Roboflow 100 benchmark results and programmatically downloading benchmark datasets
[ICLR 2024 Oral] Supervised Pre-Trained 3D Models for Medical Image Analysis (9,262 CT volumes + 25 annotated classes)
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Transformer: PyTorch Implementation of "Attention Is All You Need"
Techniques for deep learning with satellite & aerial imagery
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
A quick guide (especially) for trending instruction finetuning datasets
(TMI-2024) Source-Free Active Domain Adaptation (SFADA) for GTV Segmentation across Multiple Hospitals
Official Implementation of 3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs
These scripts are used to download RealEstate10K dataset.
Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
An open-source dataset of malicious software packages found in the wild, 100% vetted by humans.
A taxonomy of industrial anomaly detection methods and datasets (updating).
The world's first roller coaster SLAM dataset
WildlifeDatasets: An open-source toolkit for animal re-identification
Simple script to parallelize download and extract files for SA-1B Dataset.
UniGen: A Unified Framework for Dataset Generation via Large Language Model
[NAACL 2022 Findings] Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction
(3DV 2021) A High-fidelity 128-channel LiDAR Dataset with Panoramic Ambient and Reflectivity Imagery for Multi-modal Autonomous Driving Applications
🏆 • 5050 most frequent words in 109 languages
Repository for the paper "MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)" (NAACL 2022).
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Transformer: PyTorch Implementation of "Attention Is All You Need"
Techniques for deep learning with satellite & aerial imagery
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
A quick guide (especially) for trending instruction finetuning datasets
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIntRec2.0 is the first large-scale dataset for multimodal intent recognition and out-of-scope detection in multi-party conversations (ICLR 2024)
A taxonomy of industrial anomaly detection methods and datasets (updating).
(TMI-2024) Source-Free Active Domain Adaptation (SFADA) for GTV Segmentation across Multiple Hospitals
We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.
An open-source mechanical failure dataset is available, comprising 30+ categories including bearings, gears, pumps, and others.(30余个开源故障诊断和预测数据集,不断更新中)
The world's first roller coaster SLAM dataset
A complete list of IATA Airports including IATA code, ICAO code, Time zone, name, city code, two-letter ISO country code, URL, elevation above sea level in feet, coordinates in decimal degrees, geo en...
These scripts are used to download RealEstate10K dataset.
An Open-source Deep Learning Framework for Visual Place Recognition
UniGen: A Unified Framework for Dataset Generation via Large Language Model
🤖 Dataset for TextSLAM: Visual SLAM with Semantic Planar Text Features. (ICRA2020 & TPAMI2023)
(IEEE TITS 2024) WHU-Railway3D: A Diverse Dataset and Benchmark for Railway Point Cloud Semantic Segmentation
This reposotory release a bearing failure dataset, which can support intelliegnt fault diagnosis research(实验室自采轴承开源数据集,包含稳定转速和时变转速)
Simple script to parallelize download and extract files for SA-1B Dataset.
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
This repository contains a reading list of papers on Time Series Segmentation. This repository is still being continuously improved.
Dataset and code of GTSinger(NeurIPS 2024 Spotlight): A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
The human toll of Israel's ongoing genocide in names & numbers. Use the data from our APIs to tell their story.
[NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
The history files when recording human interaction while solving ARC tasks
[ArXiv 2024] WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation
An open-source mechanical failure dataset is available, comprising 30+ categories including bearings, gears, pumps, and others.(30余个开源故障诊断和预测数据集,不断更新中)
Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Techniques for deep learning with satellite & aerial imagery
A quick guide (especially) for trending instruction finetuning datasets
Transformer: PyTorch Implementation of "Attention Is All You Need"
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
A MNIST-like fashion product database. Benchmark :point_down:
esProc SPL is a scripting language for data processing, with well-designed rich library functions and powerful syntax, which can be executed in a Java program through JDBC interface and computing inde...
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
Dataset and code of GTSinger(NeurIPS 2024 Spotlight): A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
A Collection of 10.000 collected Windows Chrome Fingerprints. Usable with an easy-to-use API, available as a compressed (lzma) or full-size Json (view Releases). Its just 1.4mb in size in compressed f...
A comprehesive survey about foundation models for weather and cliamte data understanding.
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
[ICLR 2024 Oral] Supervised Pre-Trained 3D Models for Medical Image Analysis (9,262 CT volumes + 25 annotated classes)
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测,持续更新)
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.
WildlifeDatasets: An open-source toolkit for animal re-identification
MegaVul - The largest, high-quality, extensible, continuously updated, C/C++/Java vulnerability dataset
Code repository for the ECCV paper "MSD: A Benchmark Dataset for Floor Plan of Building Complexes".
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
🤖 Dataset for TextSLAM: Visual SLAM with Semantic Planar Text Features. (ICRA2020 & TPAMI2023)
The world's first roller coaster SLAM dataset
[NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM Instruction Tuning