iTM is a Java tool for supporting the full process of text mining tasks. The motive of this program was to implement machine learning algorithms and methods that were described in a master's thesis, written by Krisztian Balog at Vrije Universiteit, Amsterdam in 2004. It is recommended to read this paper before the use of iTM.

Abstract

In this paper we introduce a generic text mining system, called iTM. iTM builds models and supports the classification process. It deals with different kind of sources: text files, e-mail/newsgroup messages and HTML pages. iTM provides rich user feedback: important words and phrases of the text which are really significant in decision making. We have implemented a few active learning methods which reduce the number of manually labeled examples.

We demonstrate the usefulness of iTM by using it for labeling 13000 HTML pages of the www.cs.vu.nl domain. The full classification process was done in a few hours. With 550 manually labeled examples the accuracy was given to 74%.

Moreover we run a number of experiments on the 20-newsgroups dataset. We demonstrate that by using our system the number of manually labeled documents can be decreased by 45% to achieve 80% accuracy. Furthermore, with only 100 labeled examples our methods reach 42% accuracy.

Download thesis

Terms of use

iTM is developed under the GNU General Public License which means that this is a free software and you are welcome to use and to redistribute it under certain conditions. It also means that iTM comes with ABSOLUTELY NO WARRANTY!

Components

Source manager
iTM can handle several kind of textual documents (text files, e-mails, html pages). This component is responsible for the definition of data sets. Note that different kind of data can be used at the same time (for example some newsgroups or e-mail messages and a directory with html pages).

 
Vocabulary manager
This component is responsible for managing the vocabulary. It displays the words of the vocabulary together with the corresponding frequencies. Words below or above a given frequency can be neglected here. We can also limit the vocabulary in a given number of words.
Classification
The parameters of the text classifier can be set here. These are:
- classifier
- mode (interactive/simulation)
- initial sample selection
- intelligent sample selection
- use unlabeled data
- prior user knowledge
- continue classification
 
Target class manager
This component is responsible for the maintenance of the target classes structure. Empty categories are neglected before each active learning iteration, then restored. A visual tool helps the user to add/remove nodes of the tree.
Model browser/Document browser
This component is responsible for displaying the classification model built by the iTM system.
The classified documents can be browsed through this tool. The followings are displayed for each:
- classification
- content
- source
- words list
- feedback

 
Batch mode
In many cases we are interested in the result of not just one but more runs. Batch mode is designed to run experiments requested times with a given configuration. A loop can be defined for the size of the training set. The program records the accuracy after each iteration.In the end a Matlab file is created with the detailed results.

Download

v1.0 [09/25/2004]
compiled program
source code and documentation
list of known bugs

Links

20-newsgroups data set
http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It was originally collected by Ken Lang. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
DIANA project
http://www.cs.vu.nl/ci/DataMine/DIANA/
DIANA: Data Interception ANd Analysis
The project is aimed at the development of scientific and technological expertise which enables to design efficient solutions for data stream mining.
Wojtek Kowalczyk
http://www.cs.vu.nl/~wojtek/
The page of dr. Wojtek Kowalczyk, advisor of my master's thesis. I would like to render thanks for his generous support and guiding of my work.

Contact

You can contact us by sending an e-mail to krisztian@balog.hu
copyright (c) 2004 Krisztian Balog