MDLText: A Lightweight and Robust Text Classifier

January 1, 2017

The Intelligent Systems and Data Science Laboratory (LaSID) proudly highlights the development of MDLText, a powerful text classifier designed to address the challenges of large-scale and online text classification tasks. Based on the Minimum Description Length (MDL) principle, MDLText ensures fast incremental learning while effectively preventing overfitting—essential attributes for real-world applications.

Key features

Efficiency: Lightweight and scalable, MDLText operates seamlessly even with extensive datasets, ensuring optimal performance in resource-constrained environments.
Versatility: Supports raw and preprocessed text formats, including LIBSVM, and offers customizable preprocessing options like text normalization, stop-word removal, and tokenizer selection.
Ease of Use: Users can train models incrementally or in batch mode and classify text using a straightforward command-line interface, making it accessible to both researchers and developers.
Applications: From spam detection to sentiment analysis, MDLText adapts to diverse domains requiring accurate and fast text classification.

Publications

ML-MDLText: an efficient and lightweight multilabel text classifier with incremental learning
M.M. BITTENCOURT, R.M. SILVA, T.A. ALMEIDA.
Applied Soft Computing, Elsevier, Volume 96, 1-15, 2020 [pdf]

ML-MDLText: A Multilabel Text Categorization Technique with Incremental Learning
M.M. BITTENCOURT, R.M. SILVA, T.A. ALMEIDA
Proceedings of the 8th Brazilian Conference on Intelligent Systems (BRACIS’19), 580-585, Salvador, Brazil, October, 2019 [pdf]

MDLText aplicado na Filtragem Automática de SPIM e SMS Spam
R.M. SILVA, T.A. ALMEIDA, A. YAMAKAMI
iSys – Revista Brasileira de Sistemas de Informação, 11(1), 1-30, 2018 [pdf]

MDLText: An Efficient and Lightweight Text Classifier
R.M. SILVA, T.A. ALMEIDA, A. YAMAKAMI
Knowledge-Based Systems, Elsevier, 118(2017), 152-164, 2017 [pdf]

MDLText e Indexação Semântica aplicados na Detecção de Spam nos Comentários do YouTube
R.M. SILVA, T.C. ALBERTO, T.A. ALMEIDA
iSys – Revista Brasileira de Sistemas de Informação, 10(3), 49-73, 2017 [pdf]

Code

The source code and documentation are publicly available in the MDLText repository.

Navigation

Federal University of São Carlos (UFSCar), Sorocaba campus, João Leme dos Santos (SP-264) Highway, Km 110, Sorocaba – SP, 18052-780.