MDLText: A Lightweight and Robust Text Classifier

The Intelligent Systems and Data Science Laboratory (LaSID) proudly highlights the development of MDLText, a powerful text classifier designed to address the challenges of large-scale and online text classification tasks. Based on the Minimum Description Length (MDL) principle, MDLText ensures fast incremental learning while effectively preventing overfitting—essential attributes for real-world applications.

Key features

  • Efficiency: Lightweight and scalable, MDLText operates seamlessly even with extensive datasets, ensuring optimal performance in resource-constrained environments.
  • Versatility: Supports raw and preprocessed text formats, including LIBSVM, and offers customizable preprocessing options like text normalization, stop-word removal, and tokenizer selection.
  • Ease of Use: Users can train models incrementally or in batch mode and classify text using a straightforward command-line interface, making it accessible to both researchers and developers.
  • Applications: From spam detection to sentiment analysis, MDLText adapts to diverse domains requiring accurate and fast text classification.

Publications

  • ML-MDLText: an efficient and lightweight multilabel text classifier with incremental learning
    M.M. BITTENCOURT, R.M. SILVA, T.A. ALMEIDA.
    Applied Soft Computing, Elsevier, Volume 96, 1-15, 2020 [pdf]
  • ML-MDLText: A Multilabel Text Categorization Technique with Incremental Learning
    M.M. BITTENCOURT, R.M. SILVA, T.A. ALMEIDA
    Proceedings of the 8th Brazilian Conference on Intelligent Systems (BRACIS’19), 580-585, Salvador, Brazil, October, 2019 [pdf]
  • MDLText aplicado na Filtragem Automática de SPIM e SMS Spam
    R.M. SILVA, T.A. ALMEIDA, A. YAMAKAMI
    iSys – Revista Brasileira de Sistemas de Informação, 11(1), 1-30, 2018 [pdf]
  • MDLText: An Efficient and Lightweight Text Classifier
    R.M. SILVA, T.A. ALMEIDA, A. YAMAKAMI
    Knowledge-Based Systems, Elsevier, 118(2017), 152-164, 2017 [pdf]
  • MDLText e Indexação Semântica aplicados na Detecção de Spam nos Comentários do YouTube
    R.M. SILVA, T.C. ALBERTO, T.A. ALMEIDA
    iSys – Revista Brasileira de Sistemas de Informação, 10(3), 49-73, 2017 [pdf]

Code

The source code and documentation are publicly available in the MDLText repository.

Federal University of São Carlos (UFSCar), Sorocaba campus, João Leme dos Santos (SP-264) Highway, Km 110, Sorocaba – SP, 18052-780.

2024 © All rights reserved.