An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
-
Updated
Jun 12, 2024 - Python
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Linguistic search for large annotated text corpora, based on Apache Lucene
Thai News Dataset from Thai government website.
📑 Galician corpus for misogyny detection
We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/
A corpus and models for the atuomated legal assessment of clauses in German consumer contracts.
[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)
KH Coder: for Quantitative Content Analysis or Text Mining
HausaHate is a benchmark dataset for Hausa hate speech detection task. it was extracted from West African Facebook pages and comprises 2,000 comments annotated according to a binary class (offensive and non-offensive) and hate speech targets (race, gender and none).
The SentiAspect-pt comprises 180 product reviews annotated according to implicit and explicit fine-grained opinions, which were hierarchically organized for aspect-based sentiment analysis and opinion summarization applications.
粵文語料篩選器 Cantonese text filter
ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.
A very simple news crawler with a funny name
EZCAT: an Easy Conversation Annotation Tool
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
BlackLab Frontend, a feature-rich corpus search interface for BlackLab.
Add a description, image, and links to the corpus topic page so that developers can more easily learn about it.
To associate your repository with the corpus topic, visit your repo's landing page and select "manage topics."