A Fisher’s Exact Test Interpretation of the TF-IDF Term-weighting Scheme
Talk, Dalhousie University, Halifax, Canada
Term frequency–inverse document frequency, or TF–IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term’s occurrences are concentrated in any one given document out of many, TF–IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF–IDF on a sound theoretical foundation. In this talk, I build on that tradition by motivating TF–IDF to the statistics community by deriving the famed expression from a significance testing perspective. I will sketch out how TF–IDF is, under some admittedly restrictive conditions, asymptotically equal to the negative logarithm of a one-tailed Fisher’s exact (significance) test p-value. The Fisher’s exact test interpretation of TF–IDF equips the working statistician with a justification for TF–IDF’s use together with a ready explanation of its long-established effectiveness. Presentation slides are available here.