Advertisements

 

For the labeled set of examples, we downloaded the classified
advertisements from one day in January 2001 from the Courier Post at
http://www.southjerseyclassifieds.com.  The Courier Post online
advertisements are divided into 9 main categories: Employment, Real
Estate for
Sale, Real Estate for Rent, Dial-A-Pro, Announcements,
Transportation, Pets, Employment, and Business
Opportunity.  For testing, we
simply downloaded advertisements from the same paper, from one day a
month later, taking approximately 1000 (25%) of the examples for our
test set. 

The background knowledge from the problem came from another online
newspaper -- The Daily Record (http://classifieds.dailyrecord.com).
The Daily Record advertisements online are divided into 8 categories:
Announcements, Business and Service, Employment, Financial,
Instructions,
Merchandise, Real Estate and Transportation.  We treated
the union of the articles from each one of these categories as a
separate piece of background knowledge. 

ASRS

The Aviation Safety Reporting System (http://asrs.arc.nasa.gov/) is a
combined effort of the Federal Aviation Administration (FAA) and
the National Aeronautics and Space Administration (NASA).  The purpose
of the system is to provide a forum by which airline workers can
anonymously report any problems that arise pertaining to flights and
aircraft.  The incident reports are read by analysts and classified
and diagnosed by them.  The analysts identify any emergencies,
detect situations that compromise safety, and provide actions to
be taken.  These reports are then placed into a database for further
research on safety and other issues.  We obtained the data  from
http://nasdac.faa.gov/asp/ and our database contains the
incident reports from January 1990 through March 1999.

Since we are interested in text categorization tasks, there are two
parts of each incident report that we deal with, the narrative and
the synopsis.  The narrative is a long description of the
incident and the synopsis is a much shorter
summary of the incident. It is interesting to note that many of these
words are sometimes abbreviated which makes the text classification
task even harder. 

There are many different categorization problems that can be taken from
this data set.  A feature that is associated with each incident is the
consequence of the incident that the analyst adds to the report.  This
can take on the values: aircraft damaged, emotional trauma, FAA
investigatory follow-up, FAA assigned or threatened penalties, flight
control/aircraft review, injury, none, and other.  If more than one
consequence was present we removed that incident from our
training and test data.  We also removed from the training and test
data all those incidents that had categories of none and
other. This then became a six class classification problem.

We chose the training and test sets to consist of the synopsis
part of each incident.  The test set consists of data from the year
1999  and the training  set consists of all data from 1997 and 1998.

For the background knowledge, we chose all narratives from 1990-1996. 

In this case we did not remove any examples; thus the background

knowledge contains those reports whose categories were  other and

none as well as the six that are found in our training and test set. 

 

NetVet

 (can also be downloaded from:  http://www.cs.cmu.edu/~wcohen/)

The NetVet site
(http://www.netvet.wustle.edu) includes the Web page headings for the
pages concerning cows, horses, cats, dogs, rodents, birds and
primates.  The text categorization task is to place a web page title
into the appropriate class. For example, a training example in the
class birds might have been: ``Wild Bird Center of Walnut Creek''.
Each of these titles had a URL that linked the title to its associated
Web page.  For the training/test corpus, we randomly chose half of
these titles with their labels.
 

We discarded the other half of the titles, with
their labels, and simply kept the URL to the associated Web page. We
used these URLs to download the first 100 words from each of these
pages, to be placed into a corpus for background knowledge.  Those
URLs that were not reachable were ignored by the program that created
the background knowledge. 

Business Names

 (can also be downloaded from:http://www.cs.cmu.edu/~wcohen/)

 

Another data set consisted of a training set of company names,

 taken from the Hoover Web site (http://www.hoovers.com) labeled with

one of 124 industry names. We created background knowledge from an

entirely different Web site, http://biz.yahoo.com. 

 

We downloaded the Web pages under each business
category in the Yahoo! business hierarchy to create 101 pieces
of background knowledge.  The Yahoo! hierarchy had a different
number of classes and different way of dividing the companies, but
this was irrelevant to our purposes since we treated it solely as a
source of unlabeled background text. Each piece of background
knowledge consisted of the combination of Web pages that were stored
under a sub-topic in the Yahoo! hierarchy.

Clarinet News

Another data set that we created was obtained from Clarinet news.  We
downloaded all articles under the sports and banking headings on
November 17, 1999, using the most recent ones for training and test
sets and the older ones for background knowledge. The
background knowledge in this problem consisted of the first 100 words
of each of these articles.  Informal studies showed us that including
the entire articles did not improve accuracy substantially, probably
because the most informative part of an article is usually the first
few paragraphs.
 

Physics Papers

One common text categorization task is assigning topic labels to
technical papers.  We created a data set from the physics papers
archive (http://xxx.lanl.gov), where we downloaded the titles for all
technical papers in the first two (or three) areas in physics for
the month of March 1999.  As background knowledge we downloaded the
 abstracts of all papers in these same areas from the two previous
months -- January and February 1999.  These background knowledge

abstracts were downloaded without their labels ( i.e., without knowledge of what
sub-discipline they were from) so that our learning programs had no
access to them. 

Thesaurus

Roget's thesaurus places words in the English language into one of six
major categories: space, matter, abstract relations, intellect,
volition, and affection.  From http://thesaurus.reference.com/, we created

a labeled set of 1000 words, with each word associated with one category. 

We obtained our background knowledge via http://www.thesaurus.com as
well, by downloading the dictionary definitions of all 1000 words in
the labeled set.  We cleaned up the dictionary definitions by removing
the sources that were returned ( i.e. which dictionary the information
was gleaned from) as well as other miscellaneous information (such as
how many definitions were found).  Each of these dictionary
definitions became an entry in our background knowledge database.  An
interesting point about this data set is that the background knowledge
contains information directly about the test set,  i.e. definitions of
the words in the test set.  Since these definitions are not directly
related to the classification task at hand, this poses no
contradiction.  As a matter of fact, we can look at new test examples
given to the system as follows: given a word and its definition, place
the definition into the background knowledge data base, and then
categorize the word using the total background knowledge.