LanguageMachines / ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

Date Created 2013-03-26 (11 years ago)
Commits 1,579 (last one about a month ago)
Stargazers 65 (0 this week)
Watchers 13 (0 this week)
Forks 13
License gpl-3.0
Ranking

RepositoryStats indexes 582,612 repositories, of these LanguageMachines/ucto is ranked #387,387 (34th percentile) for total stargazers, and #165,538 for total watchers. Github reports the primary language for this repository as C++, for repositories using this language it is ranked #21,608/31,193.

LanguageMachines/ucto is also tagged with popular topics, for these it's ranked: nlp (#1,766/2397),  natural-language-processing (#1,062/1409),  language (#558/738)

Other Information

LanguageMachines/ucto has Github issues enabled, there are 12 open issues and 81 closed issues.

There have been 46 releases, the latest one was published on 2024-09-12 (2 months ago) with the name v0.34.

Homepage URL: https://languagemachines.github.io/ucto

Star History

Github stargazers over time

Watcher History

Github watchers over time, collection started in '23

Recent Commit History

292 commits on the default branch (master) since jan '22

Yearly Commits

Commits to the default branch (master) per year

Issue History

Languages

The primary language is C++ but there's also others...

updated: 2024-09-25 @ 07:31am, id: 9028617 / R_kgDOAInECQ