20 Newsgroups/home

Table of Content

Overview
Motivation
- Example
Description of Dataset
Resources

Overview

This site is about evaluation in information retrieval using the test corpus 20 newsgroups. You can find the data set(s) and more information at Jason Rennie's site. This site provides

Query title (subject) lines for patterns to find interesting articles in the 20 newsgroups corpus.
Find similar articles for a selected article.
For each similarity query valuation metrics are calculated.
You can inspect the article contents (if you pass the IR quiz).
Documentation on used metrics.

Motivation

In the last years similarity search has been becoming more promiment in the context of deep learning starting with Salakhutdinov and Hinton's paper on semantic hashing in 2007. This article also strikingly demonstrates the subtle intricacies and pitfalls in evaluation. It reports the precision at 100% recall for the RCV1-v2 corpus to be about 3%. Karol Grzegorczyk in his 2019 thesis Vector representations of text data in deep learning reports about 12%. A lower bound for precision at 100% recall is given when we have to retrieve all documents from the corpus to achieve 100% recall. This lower bound can be easily calculated (number of relevant documents divided by corpus size) and is independent from the applied similarity search algorithm:

1/N * sum_(i=1)^N rel(q_i)/N

where N is the size of the test corpus (equals number of similarity queries) and rel(q_i) is the relevance for the i-th similarity query. Own calculations estimate that the lower bound is much higher than 3% and they are consistent with Karol Grzegorczyk's results. I am not documenting my RCV1-v2 evaluations here, because

RCV1-v2 is not publicly available, but needs an NDA from educational/research users.
RCV1-v2 is a much more complicated and error prone corpus. Many studies use different ways how they define relevancy which makes the corpus difficult to use for comparison. Also many authors do not well document their use of the corpus (e.g. you will seldom if ever find, how they calculated the needed yields, which can get quite sophisticated for non-binary relevance measures). In this situation one is left with with guess work and/or reverse engineering efforts. This shows that evaluation is sometimes still more of a craft then science, since authors are obviously struggling with the scientific principle of reproducibility.

Having said the above, my motivation for this site (20newsgroups.neurolab.de) is twofold.

Firstly I wanted to document some of my work on evaluation for a similarity search algorithm I am using.
And secondly, when I started my evaluation project I would have wished to have some more concrete examples. So this site possibly provides some help for others.

Example

When you look at the query results for document 1528211 "GP 2.0 vs. 2.2" you will notice that the first retrieved document Intel memory board for sale (doc id 7473806) has semantically not much similarity with the query document 981602. Why is that so?

By looking at the respective message bodies we see that

both document bodies are rather short, and
there is a great similarity in the header fields, e.g. "Organization: Netcom Online Communications Services" is identical in both messages (apart from header field names themself).

This unsurprisingly leads us to the strong assumption, that removing boilerplate in the message from indexing would strongly improve retrieval performance.

This hypothesis can also be experimentally confirmed. I created a second corpus where I removed the header lines in each news message and rerun the evaluation test. We observe an huge increase in precision and recall as can be seen in the below shown results for the single example document from above and also for corpus level results.

Evaluation results for 1528211 "PGP 2.0 vs 2.2" containing the header lines.

Evaluation results for 1528211 "PGP 2.0 vs 2.2" after header lines have been removed.

Evaluation results for 20newsgroups corpus containing the header lines.

Evaluation results for 20newsgroups corpus after the header lines have been removed.

Description of Dataset

The 20 newsgroups test corpus is commonly used for evaluating text classification or similarity search tasks and has been collected by Ken Lang.

It consists of about 1000 articles from each of 20 Usenet newsgroups. Articles are from April 1993. For evaluation purposes the newsgroup name can be regarded as the category of a document. Here we concentrate on the bydate version consisting of 18846 documents (news messages) and does not include cross-posts (duplicates) and newsgroup-identifying headers (Xref, Newsgroups,Path, Followup-To, Date) compared to the original version (see Jason Rennie's 20 Newsgroups page.

In contrast to the also widely used evaluation corpora Reuters-21578 and more recently RCV1-v2 (see David Lewis' page for more information) the 20 Newsgroups test corpus is not protected by a NDA (non-disclosure agreement).

ID	Group	Size
0	alt.atheism	799
1	comp.graphics	973
2	comp.os.ms-windows.misc	985
3	comp.sys.ibm.pc.hardware	982
4	comp.sys.mac.hardware	963
5	comp.windows.x	988
6	misc.forsale	975
7	rec.autos	990
8	rec.motorcycles	996
9	rec.sport.baseball	994
10	rec.sport.hockey	999
11	sci.crypt	991
12	sci.electronics	984
13	sci.med	990
14	sci.space	987
15	soc.religion.christian	997
16	talk.politics.guns	910
17	talk.politics.mideast	940
18	talk.politics.misc	775
19	talk.religion.misc	628

Resources

These are the resources for 20 Newsgroups data sets I know of: