#

corpus

Here are 853 public repositories matching this topic...

BLKSerene / Wordless

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

translation tokenizer corpus linguistics tagger literature dependency-parser corpus-linguistics lemmatizer corpus-tools corpus-processing corpus-search corpus-statistics stopword corpus-analysis

Updated Jun 12, 2024
Python

esteeschwarz / SPUND-LX

linguistics essais

corpus linguistics

Updated Jun 12, 2024
HTML

INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene

Updated Jun 12, 2024
Java

PyThaiNLP / thaigov-v2-corpus

Thai News Dataset from Thai government website.

corpus thai-language corpus-data thai-nlp pythainlp

Updated Jun 12, 2024
Jupyter Notebook

luciamariaalvarezcrespo / GalMisoCorpus2023

📑 Galician corpus for misogyny detection

nlp machine-learning corpus corpus-data nlp-machine-learning misogyny galician misogyny-detection

Updated Jun 12, 2024
Python

malaysian-dataset

mesolitica / malaysian-dataset

We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/

text-mining corpus malaysia bahasa-melayu manglish malay-dataset

Updated Jun 12, 2024
Jupyter Notebook

DaBr01 / AGB-DE

A corpus and models for the atuomated legal assessment of clauses in German consumer contracts.

natural-language-processing corpus legaltech

Updated Jun 12, 2024
Python

luciusssss / mc2_corpus

[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)

multilingual natural-language-processing corpus mongolian tibetan tibetan-nlp uyghur kazakh low-resource-languages low-resource-nlp

Updated Jun 12, 2024
Python

ko-ichi-h / khcoder

KH Coder: for Quantitative Content Analysis or Text Mining

visualization text-mining corpus content-analysis kwic

Updated Jun 12, 2024
Perl

franciellevargas / HausaHate

HausaHate is a benchmark dataset for Hausa hate speech detection task. it was extracted from West African Facebook pages and comprises 2,000 comments annotated according to a binary class (offensive and non-offensive) and hate speech targets (race, gender and none).

benchmark machine-learning natural-language-processing corpus dataset nlp-machine-learning offensive-language hate-speech low-resource-languages hausa-nlp

Updated Jun 12, 2024

franciellevargas / SentiAspect-pt

The SentiAspect-pt comprises 180 product reviews annotated according to implicit and explicit fine-grained opinions, which were hierarchically organized for aspect-based sentiment analysis and opinion summarization applications.

nlp machine-learning natural-language-processing sentiment-analysis corpus opinion-mining aspect-based-sentiment-analysis portuguese-brazilian

Updated Jun 12, 2024

CanCLID / canto-filter

粵文語料篩選器 Cantonese text filter

nlp data corpus cantonese corpus-data cantonese-language

Updated Jun 11, 2024
Python

vxern / tatoeba

📜 A complete, documented API wrapper for querying and retrieving sentences from the Tatoeba corpus.

api language wrapper erlang translation corpus clean sentence tested tatoeba documented gleam

Updated Jun 11, 2024
Gleam

sparkfish / shabby-pages

ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.

data-science computer-vision corpus dataset binarization denoising layout-detection born-digital

Updated Jun 12, 2024
Jupyter Notebook

flairNLP / fundus

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping news-crawler commoncrawl web-corpus news-scraping cc-news

Updated Jun 11, 2024
Python

gguibon / ezcat

EZCAT: an Easy Conversation Annotation Tool

annotation corpus conversation whatsapp

Updated Jun 11, 2024
Vue

adbar / trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Updated Jun 12, 2024
Python

KorAP / KorAP-Docker

🐋 Single Command Installation for KorAP

Updated Jun 11, 2024

INL / corpus-frontend

BlackLab Frontend, a feature-rich corpus search interface for BlackLab.

Updated Jun 12, 2024
TypeScript

divvun / CorpusTools

Tools to manage and convert GiellaLT corpus files

xml corpus linguistics

Updated Jun 7, 2024
Python

Improve this page

Add a description, image, and links to the corpus topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpus topic, visit your repo's landing page and select "manage topics."