Trending repositories for topic dataset
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Transformer: PyTorch Implementation of "Attention Is All You Need"
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
A quick guide (especially) for trending instruction finetuning datasets
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
A MNIST-like fashion product database. Benchmark :point_down:
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
This repository contains a reading list of papers on Time Series Segmentation. This repository is still being continuously improved.
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
[ICLR 2024] This is the official implementation of the paper "The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World"
nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset
[ACL 2024] CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling
🫁 AeroPath: An airway segmentation benchmark dataset with challenging pathology
This repository contains a reading list of papers on Time Series Segmentation. This repository is still being continuously improved.
nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset
CircuitNet: An Open-Source Dataset for Machine Learning Applications in Electronic Design Automation (EDA)
Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
⚽️ Extract, prepare and publish Transfermarkt datasets.
[ICLR 2024] This is the official implementation of the paper "The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World"
Datasets for evaluating smart contract security analysis tools ( continuously updating... )
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data oper...
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Techniques for deep learning with satellite & aerial imagery
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
Transformer: PyTorch Implementation of "Attention Is All You Need"
A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
A quick guide (especially) for trending instruction finetuning datasets
A MNIST-like fashion product database. Benchmark :point_down:
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
This repository contains a reading list of papers on Time Series Segmentation. This repository is still being continuously improved.
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, et...
[ACL 2024] CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
An open-source mechanical failure dataset is available, comprising 30+ categories including bearings, gears, pumps, and others.(30余个开源故障诊断和预测数据集,不断更新中)
Air Pollution Image Dataset from India and Nepal
🫁 AeroPath: An airway segmentation benchmark dataset with challenging pathology
A large dataset of real-world WebAssembly binaries, collected from the Web, GitHub, NPM and other sources. Useful as test data, to study WebAssembly, for training machine learning models, and much mor...
[CVPR2024 Highlight] Official Code for "ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object"
The history files when recording human interaction while solving ARC tasks
This repository contains a reading list of papers on Time Series Segmentation. This repository is still being continuously improved.
nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset
The code used to create and update the Open Australian Legal Corpus, the first and only multijurisdictional open corpus of Australian legislative and judicial documents.
A Collection of 10.000 collected Windows Chrome Fingerprints. Usable with an easy-to-use API, available as a compressed (lzma) or full-size Json (view Releases). Its just 1.4mb in size in compressed f...
Label Studio is a multi-type data labeling and annotation tool with standardized output format
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Techniques for deep learning with satellite & aerial imagery
Transformer: PyTorch Implementation of "Attention Is All You Need"
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
A quick guide (especially) for trending instruction finetuning datasets
A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
The history files when recording human interaction while solving ARC tasks
fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data oper...
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
A MNIST-like fashion product database. Benchmark :point_down:
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
The history files when recording human interaction while solving ARC tasks
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
Air Pollution Image Dataset from India and Nepal
The human toll of Israel's ongoing genocide in names & numbers. Use the data from our APIs to tell their story.
A Collection of 10.000 collected Windows Chrome Fingerprints. Usable with an easy-to-use API, available as a compressed (lzma) or full-size Json (view Releases). Its just 1.4mb in size in compressed f...
An open-source mechanical failure dataset is available, comprising 30+ categories including bearings, gears, pumps, and others.(30余个开源故障诊断和预测数据集,不断更新中)
MegaVul - The largest, high-quality, extensible, continuously updated, C/C++/Java vulnerability dataset
A comprehesive survey about foundation models for weather and cliamte data understanding.
AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose (ICCV 2023)
A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
[ICLR 2024] This is the official implementation of the paper "The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World"
This repository contains a reading list of papers on Time Series Segmentation. This repository is still being continuously improved.
[ACL 2024] Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation
Code and Data artifact for NeurIPS 2023 paper - "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context". `multispy` is a lsp client library in Python intended to be used to bu...
[ICLR 2024] Supervised Pre-Trained 3D Models for Medical Image Analysis
The human toll of Israel's ongoing genocide in names & numbers. Use the data from our APIs to tell their story.
Doppelgangers: Learning to Disambiguate Images of Similar Structures
A Collection of 10.000 collected Windows Chrome Fingerprints. Usable with an easy-to-use API, available as a compressed (lzma) or full-size Json (view Releases). Its just 1.4mb in size in compressed f...
A detection/segmentation dataset with labels characterized by intricate and flexible expressions. "Described Object Detection: Liberating Object Detection with Flexible Expressions" (NeurIPS 2023).
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Label Studio is a multi-type data labeling and annotation tool with standardized output format
esProc SPL is a scripting language for data processing, with well-designed rich library functions and powerful syntax, which can be executed in a Java program through JDBC interface and computing inde...
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
A quick guide (especially) for trending instruction finetuning datasets
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, et...
A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
Techniques for deep learning with satellite & aerial imagery
Transformer: PyTorch Implementation of "Attention Is All You Need"
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
A MNIST-like fashion product database. Benchmark :point_down:
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, et...
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI
Generate textbook-quality synthetic LLM pretraining data
Dataset Helper program to automatically select, re scale and tag Datasets (composed of image and text) for Machine Learning training.
[ICLR 2024] Supervised Pre-Trained 3D Models for Medical Image Analysis
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
[NeurIPS 2023] AbdomenAtlas 1.0 (5,195 CT volumes plus nine classes)
[ICCV 2023] Code base for Revisiting Scene Text Recognition: A Data Perspective
Code and Data artifact for NeurIPS 2023 paper - "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context". `multispy` is a lsp client library in Python intended to be used to bu...
The world's first roller coaster SLAM dataset
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测,持续更新)
A GPT-3.5 & GPT-4 Workload Trace to Optimize LLM Serving Systems
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
VisText is a benchmark dataset for semantically rich chart captioning.
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
[NeurIPS 2023] Offical code for <Real3D-AD: A Dataset of Point Cloud Anomaly Detection>. A 3D point cloud anomaly detection dataset and benchmark.