A Roadmap to Green(er) NLP for Topic Classification with Statistical Hypothesis Testing and Machine Learning
Date:
Topic classification is a core task in natural language processing (NLP) that assigns predefined categories to text based on its content. These algorithms quietly power systems that help people navigate, organize, and make sense of large volumes of textual data. Recent advances in topic classification have been driven by large language models (LLMs), yet their growing energy demands raise pressing concerns about environmental sustainability. This talk overviews my current and planned future research on topic classification in the emerging Green NLP paradigm. My hunch is that traditional machine learning classifiers, when furnished with textual features derived from the fresh perspective of statistical hypothesis testing, can approach LLM-level accuracy with a significantly smaller carbon footprint. This is particularly true of longform text characteristic of legal, medical, and insurance documents, as LLM energy costs scale sharply with input length. In the talk, I will paint in broad brushstrokes three key components of my masterplan: (1) designing statistical-hypothesis-test-based features, (2) integrating them into traditional lightweight machine learning classifiers, and (3) conducting comparative studies on longform text that evaluate trade-offs between accuracy and carbon footprint. Making topic classification green(er) is calculated to help individuals and institutions meet their information needs in a more environmentally sustainable manner.
Presentation slides are available here.
