centic9 / CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

Date Created 2015-04-22 (9 years ago)
Commits 290 (last one 11 days ago)
Stargazers 63 (0 this week)
Watchers 13 (0 this week)
Forks 20
License bsd-2-clause
Ranking

RepositoryStats indexes 584,777 repositories, of these centic9/CommonCrawlDocumentDownload is ranked #396,898 (32nd percentile) for total stargazers, and #165,662 for total watchers. Github reports the primary language for this repository as Java, for repositories using this language it is ranked #21,221/28,277.

centic9/CommonCrawlDocumentDownload is also tagged with popular topics, for these it's ranked: java (#6,027/7690)

Other Information

There have been 6 releases, the latest one was published on 2023-01-15 (about a year ago) with the name 1.0.0.10.

Star History

Github stargazers over time

Watcher History

Github watchers over time, collection started in '23

Recent Commit History

83 commits on the default branch (master) since jan '22

Yearly Commits

Commits to the default branch (master) per year

Issue History

Languages

The primary language is Java but there's also others...

updated: 2024-11-10 @ 08:37pm, id: 34407138 / R_kgDOAg0C4g