WE4LKD: Word Embeddings For Latent Knowledge Discovery

The WE4LKD (Word Embeddings for Latent Knowledge Discovery) is project under the ML4LKD (Machine Learning for Latent Knowledge Discovery) initiative that represents a remarkable journey in leveraging artificial intelligence to accelerate the discovery of medical knowledge. WE4LKD focused on developing advanced machine learning models to extract latent insights from scientific articles on Acute Myeloid Leukemia (AML).

One of the proeminent projects related to this initiative proposed training word vectors – dense vector representations of text – on a specialized corpus of medical articles. Inspired by previous work like PubMedBERT and studies that revealed latent chemical relationships in materials science, ML4LKD aimed to analyze whether latent knowledge in texts could assist in the identification of new diagnoses, prognoses, and treatments for AML.

Key features

  • Undergraduate and Master’s Contributions:
    The project started as an undergraduate initiative (FAPESP process: 21/13054-8) and evolved into a fully-funded international research internship (FAPESP process: 22/07236-9).
  • Transformative Results:
    The research culminated in impactful findings, showcasing the potential of transformers and bidirectional architectures to process domain-specific scientific texts effectively.
  • Collaborative Efforts:
    The project was led by Prof. Dr. Tiago A. de Almeida (UFSCar) and supervised by Prof. Dr. Carolina Evaristo Scarton (University of Sheffield), with active participation from Matheus Vargas Volpon Berto. Other researchers such as Breno Lima de Freitas and Prof. Dr. João A. Machado-Neto also participated.

The project explored natural language processing, artificial intelligence, and machine learning techniques to address critical challenges in AML prognostics. By focusing on latent knowledge discovery, WE4LKD demonstrated how AI could transform healthcare, offering tools for more accurate prognostic models to support personalized treatment decisions.

Publications

  • Accelerating Discoveries in Medicine using Distributed Vector Representations of Words
    M.V.V. BERTO, B.L. FREITAS, C. SCARTON, J.A. MACHADO-NETO, T.A. ALMEIDA
    Expert Systems with Applications, 2024 [pdf, preprint, data]
  • Impulsionando a descoberta de tratamentos na medicina através da representação distribuída de palavras
    M.V.V. BERTO, T.A. ALMEIDA
    IV Scientific Initiation Work Competition (CTIC’24). Proceedings of the 30th Brazilian Symposium on Multimedia and Web Systems (WebMedia’24), Juiz de Fora, Brazil, October, 2024 [pdf]

Code

The ML4LKD project reflects the power of interdisciplinary collaboration and the potential of AI to solve complex, real-world problems. You can check out the announcement of the paper published from this research on our news page and the code on our GitHub repository.

Federal University of São Carlos (UFSCar), Sorocaba campus, João Leme dos Santos (SP-264) Highway, Km 110, Sorocaba – SP, 18052-780.

2024 © All rights reserved.