Posts by Collection

portfolio

Portfolio item number 1

This is an item in your portfolio. It can be have images or nice text. If you name the file .md, it will be parsed as markdown. If you name the file .html, it will be parsed as HTML.

publications

talks

The Literary Theme Ontology for Media Annotation and Information Retrieval

Published:

Literary theme identification and interpretation is a focal point of literary studies scholarship. Classical forms of literary scholarship, such as close reading, have flourished with scarcely any need for commonly defined literary themes. However, the rise in popularity of collaborative and algorithmic analyses of literary themes in works of fiction, together with a requirement for computational searching and indexing facilities for large corpora, creates the need for a collection of shared literary themes to ensure common terminology and definitions. To address this need, we here introduce a first draft of the Literary Theme Ontology. Inspired by a traditional framing from literary theory, the ontology comprises literary themes drawn from the authors own analyses, reference books, and online sources.

Sheridan, P., Onsjö, M., Hastings, J. (2019). “The Literary Theme Ontology for Media Annotation and Information Retrieval”. Proceedings of JOWO2019: The Joint Ontology Workshops, Graz, Austria, September 23-25. Full text available at Proceedings of the Joint Ontology Workshops 2019 Episode V: The Styrian Autumn of Ontology.

Presentation slides are available here.

A Fisher’s Exact Test Interpretation of the TF-IDF Term-weighting Scheme

Published:

Term frequency–inverse document frequency, or TF–IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term’s occurrences are concentrated in any one given document out of many, TF–IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF–IDF on a sound theoretical foundation. In this talk, I build on that tradition by motivating TF–IDF to the statistics community by deriving the famed expression from a significance testing perspective. I will sketch out how TF–IDF is, under some admittedly restrictive conditions, asymptotically equal to the negative logarithm of a one-tailed Fisher’s exact (significance) test p-value. The Fisher’s exact test interpretation of TF–IDF equips the working statistician with a justification for TF–IDF’s use together with a ready explanation of its long-established effectiveness.

Presentation slides are available here.

A Roadmap to Green(er) NLP for Topic Classification with Statistical Hypothesis Testing and Machine Learning

Published:

Topic classification is a core task in natural language processing (NLP) that assigns predefined categories to text based on its content. These algorithms quietly power systems that help people navigate, organize, and make sense of large volumes of textual data. Recent advances in topic classification have been driven by large language models (LLMs), yet their growing energy demands raise pressing concerns about environmental sustainability. This talk overviews my current and planned future research on topic classification in the emerging Green NLP paradigm. My hunch is that traditional machine learning classifiers, when furnished with textual features derived from the fresh perspective of statistical hypothesis testing, can approach LLM-level accuracy with a significantly smaller carbon footprint. This is particularly true of longform text characteristic of legal, medical, and insurance documents, as LLM energy costs scale sharply with input length. In the talk, I will paint in broad brushstrokes three key components of my masterplan: (1) designing statistical-hypothesis-test-based features, (2) integrating them into traditional lightweight machine learning classifiers, and (3) conducting comparative studies on longform text that evaluate trade-offs between accuracy and carbon footprint. Making topic classification green(er) is calculated to help individuals and institutions meet their information needs in a more environmentally sustainable manner.

Presentation slides are available here.

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.