centic9 / CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

Date Created 2015-04-22 (9 years ago)
Commits 296 (last one 2 months ago)
Stargazers 65 (0 this week)
Watchers 12 (0 this week)
Forks 18
License bsd-2-clause
Ranking

RepositoryStats indexes 628,836 repositories, of these centic9/CommonCrawlDocumentDownload is ranked #409,803 (35th percentile) for total stargazers, and #172,860 for total watchers. Github reports the primary language for this repository as Java, for repositories using this language it is ranked #21,353/29,374.

centic9/CommonCrawlDocumentDownload is also tagged with popular topics, for these it's ranked: java (#6,091/7952)

Other Information

There have been 6 releases, the latest one was published on 2023-01-15 (2 years ago) with the name 1.0.0.10.

Star History

Github stargazers over time

70706060505040403030202010100020162016201720172018201820192019202020202021202120222022202320232024202420252025

Watcher History

Github watchers over time, collection started in '23

13131313131312.512.5121212121212Jun '23Jun '23Jul '23Jul '23Aug '23Aug '23Sep '23Sep '23Oct '23Oct '23Nov '23Nov '2320242024Feb '24Feb '24Mar '24Mar '24Apr '24Apr '24May '24May '24Jul '24Jul '24Aug '24Aug '24Sep '24Sep '24Oct '24Oct '24Nov '24Nov '2420252025Feb '25Feb '25

Recent Commit History

89 commits on the default branch (master) since jan '22

90908080707060605050404030302020101000Jul '22Jul '2220232023Jul '23Jul '2320242024Jul '24Jul '2420252025

Yearly Commits

Commits to the default branch (master) per year

707060605050404030302020101000201520152016201620172017201820182019201920202020202120212022202220242024

Issue History

Total Issues
Open Issues
Closed Issues
554.54.5443.53.5332.52.5221.51.5110.50.5002018201820192019202020202021202120222022202320232024202420252025

Languages

The primary language is Java but there's also others...

JavaJavaShellShell

updated: 2025-03-14 @ 04:09am, id: 34407138 / R_kgDOAg0C4g