Advertisements
For the
labeled set of examples, we downloaded the classified
advertisements from one day in January 2001 from the
Courier Post at
http://www.southjerseyclassifieds.com.
The Courier Post online
advertisements are divided into 9 main categories:
Employment, Real
Estate for
Transportation, Pets, Employment, and Business
simply downloaded advertisements from the same paper,
from one day a
month later, taking approximately 1000 (25%) of the
examples for our
test set.
The background knowledge from the problem came from another online
newspaper -- The Daily Record
(http://classifieds.dailyrecord.com).
The Daily Record advertisements online are divided into 8 categories:
Announcements, Business and Service, Employment, Financial,
Instructions,
the union of the articles from each one of these
categories as a
separate piece of background knowledge.
ASRS
The Aviation Safety Reporting System (http://asrs.arc.nasa.gov/) is a
combined effort of the Federal Aviation Administration
(FAA) and
the National Aeronautics and Space Administration
(NASA). The purpose
of the system is to provide a forum by which airline
workers can
anonymously report any problems that arise pertaining
to flights and
aircraft. The
incident reports are read by analysts and classified
and diagnosed by them.
The analysts identify any emergencies,
detect situations that compromise safety, and provide
actions to
be taken. These
reports are then placed into a database for further
research on safety and other issues. We obtained the data from
http://nasdac.faa.gov/asp/ and our database contains the
incident reports from January 1990 through March 1999.
Since we are interested in text categorization tasks, there are two
parts of each incident report that we deal with, the narrative and
the synopsis. The narrative
is a long description of the
incident and the synopsis
is a much shorter
summary of the incident. It is interesting to note
that many of these
words are sometimes abbreviated which makes the text
classification
task even harder.
There are many different categorization problems that can be taken from
this data set.
A feature that is associated with each incident is the
consequence of the incident that the analyst adds to
the report. This
can take on the values: aircraft damaged, emotional
trauma, FAA
investigatory follow-up, FAA assigned or threatened
penalties, flight
control/aircraft review, injury, none, and other. If more than one
consequence was present we removed that incident from
our
training and test data. We also removed from the training and test
data all those incidents that had categories of none and
other. This
then became a six class classification problem.
We chose the training and test sets to consist of the synopsis
part of each incident.
The test set consists of data from the year
1999 and the
training set consists of all data from
1997 and 1998.
For the
background knowledge, we chose all narratives
from 1990-1996.
In this
case we did not remove any examples; thus the background
knowledge contains those reports whose categories were other
and
none as well as the six that are found
in our training and test set.
NetVet
(can also be downloaded from: http://www.cs.cmu.edu/~wcohen/)
The NetVet site
(http://www.netvet.wustle.edu) includes the Web page headings for the
pages concerning cows, horses, cats, dogs, rodents,
birds and
primates. The
text categorization task is to place a web page title
into the appropriate class. For example, a training
example in the
class birds might have been: ``Wild Bird Center of
Walnut Creek''.
Each of these titles had a URL that linked the title to its associated
Web page. For
the training/test corpus, we randomly chose half of
these titles with their labels.
We
discarded the other half of the titles, with
their labels, and simply kept the URL to the
associated Web page. We
used these URLs to download the first 100 words from
each of these
pages, to be placed into a corpus for background
knowledge. Those
URLs that were not reachable were ignored by the program that created
the background knowledge.
Business Names
(can also be downloaded from:http://www.cs.cmu.edu/~wcohen/)
Another
data set consisted of a training set of company names,
taken from the Hoover
Web site (http://www.hoovers.com) labeled with
one of 124 industry names. We created background knowledge from an
entirely different Web site, http://biz.yahoo.com.
We
downloaded the Web pages under each business
category in the Yahoo! business hierarchy to create
101 pieces
of background knowledge. The Yahoo! hierarchy had a different
number of classes and different way of dividing the companies,
but
this was irrelevant to our purposes since we treated
it solely as a
source of unlabeled background text. Each piece of
background
knowledge consisted of the combination of Web pages
that were stored
under a sub-topic in the Yahoo! hierarchy.
Clarinet News
Another
data set that we created was obtained from Clarinet news. We
downloaded all articles under the sports and banking
headings on
sets and the older ones for background knowledge. The
background knowledge in this problem consisted of the
first 100 words
of each of these articles. Informal studies showed us that including
the entire articles did not improve accuracy
substantially, probably
because the most informative part of an article is
usually the first
few paragraphs.
Physics Papers
One
common text categorization task is assigning topic labels to
technical papers.
We created a data set from the physics papers
archive (http://xxx.lanl.gov), where we downloaded the
titles for all
technical papers in the first two (or three) areas in
physics for
the month of March 1999. As background knowledge we downloaded the
abstracts of all papers in these same areas from the two
previous
months -- January and February 1999. These background
knowledge
abstracts were downloaded without their labels ( i.e., without
knowledge of what
sub-discipline they were from) so that our learning
programs had no
access to them.
Thesaurus
Roget's
thesaurus places words in the English language into one of six
major categories: space, matter, abstract relations,
intellect,
volition, and affection. From http://thesaurus.reference.com/, we
created
a labeled set of 1000 words, with each word associated with one
category.
We obtained our background knowledge via http://www.thesaurus.com as
well, by downloading the dictionary definitions of all
1000 words in
the labeled set.
We cleaned up the dictionary definitions by removing
the sources that were returned ( i.e. which dictionary
the information
was gleaned from) as well as other miscellaneous
information (such as
how many definitions were found). Each of these dictionary
definitions became an entry in our background
knowledge database. An
interesting point about this data set is that the
background knowledge
contains information directly about the test set, i.e. definitions of
the words in the test set. Since these definitions are not directly
related to the classification task at hand, this poses
no
contradiction.
As a matter of fact, we can look at new test examples
given to the system as follows: given a word and its
definition, place
the definition into the background knowledge data
base, and then
categorize the word using the total background
knowledge.