This is the command autoclass that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

**PROGRAM:**

**NAME**

autoclass - automatically discover classes in data

**SYNOPSIS**

**autoclass**

**-search**

__data_file__

__header_file__

__model_file__

__s_param_file__

**autoclass**

**-report**

__results_file__

__search_file__

__r_params_file__

**autoclass**

**-predict**

__results_file__

__search_file__

__results_file__

**DESCRIPTION**

**AutoClass**solves the problem of automatic discovery of classes in data (sometimes called

clustering, or unsupervised learning), as distinct from the generation of class

descriptions from labeled examples (called supervised learning). It aims to discover the

"natural" classes in the data.

**AutoClass**is applicable to observations of things that can

be described by a set of attributes, without referring to other things. The data values

corresponding to each attribute are limited to be either numbers or the elements of a

fixed set of symbols. With numeric data, a measurement error must be provided.

**AutoClass**is looking for the best classification(s) of the data it can find. A

classification is composed of:

1) A set of classes, each of which is described by a set of class parameters, which

specify how the class is distributed along the various attributes. For example,

"height normally distributed with mean 4.67 ft and standard deviation .32 ft",

2) A set of class weights, describing what percentage of cases are likely to be in

each class.

3) A probabilistic assignment of cases in the data to these classes. I.e. for each

case, the relative probability that it is a member of each class.

As a strictly Bayesian system (accept no substitutes!), the quality measure

**AutoClass**uses

is the total probability that, had you known nothing about your data or its domain, you

would have found this set of data generated by this underlying model. This includes the

prior probability that the "world" would have chosen this number of classes, this set of

relative class weights, and this set of parameters for each class, and the likelihood that

such a set of classes would have generated this set of values for the attributes in the

data cases.

These probabilities are typically very small, in the range of e^-30000, and so are usually

expressed in exponential notation.

When run with the

**-search**command,

**AutoClass**searches for a classification. The required

arguments are the paths to the four input files, which supply the data, the data format,

the desired classification model, and the search parameters, respectively.

By default,

**AutoClass**writes intermediate results in a binary file. With the

**-report**

command,

**AutoClass**generates an ASCII report. The arguments are the full path names of

the .results, .search, and .r-params files.

When run with the

**-predict**command,

**AutoClass**predicts the class membership of a "test"

data set based on classes found in a "training" data set (see "PREDICTIONS" below).

**INPUT** **FILES**

An AutoClass data set resides in two files. There is a header file (file type "hd2") that

describes the specific data format and attribute definitions. The actual data values are

in a data file (file type "db2"). We use two files to allow editing of data descriptions

without having to deal with the entire data set. This makes it easy to experiment with

different descriptions of the database without having to reproduce the data set.

Internally, an AutoClass database structure is identified by its header and data files,

and the number of data loaded.

For more detailed information on the formats of these files, see

__/usr/share/doc/autoclass/preparation-c.text__.

**DATA**

**FILE**

The data file contains a sequence of data objects (datum or case) terminated by the end of

the file. The number of values for each data object must be equal to the number of

attributes defined in the header file. Data objects must be groups of tokens delimited by

"new-line". Attributes are typed as REAL, DISCRETE, or DUMMY. Real attribute values are

numbers, either integer or floating point. Discrete attribute values can be strings,

symbols, or integers. A dummy attribute value can be any of these types. Dummys are read

in but otherwise ignored -- they will be set to zeros in the the internal database. Thus

the actual values will not be available for use in report output. To have these attribute

values available, use either type REAL or type DISCRETE, and define their model type as

IGNORE in the .model file. Missing values for any attribute type may be represented by

either "?", or other token specified in the header file. All are translated to a special

unique value after being read, so this symbol is effectively reserved for unknown/missing

values.

For example:

white 38.991306 0.54248405 2 2 1

red 25.254923 0.5010235 9 2 1

yellow 32.407973 ? 8 2 1

all_white 28.953982 0.5267696 0 1 1

**HEADER**

**FILE**

The header file specifies the data file format, and the definitions of the data

attributes. The header file functional specifications consists of two parts -- the data

set format definition specifications, and the attribute descriptors. ";" in column 1

identifies a comment.

A header file follows this general format:

;; num_db2_format_defs value (number of format def lines

;; that follow), range of n is 1 -> 5

num_db2_format_defs n

;; number_of_attributes token and value required

number_of_attributes <as required>

;; following are optional - default values are specified

separator_char ' '

comment_char ';'

unknown_token '?'

separator_char ','

;; attribute descriptors

;; <zero-based att#> <att_type> <att_sub_type> <att_description>

;; <att_param_pairs>

Each attribute descriptor is a line of:

Attribute index (zero based, beginning in column 1)

Attribute type. See below.

Attribute subtype. See below

Attribute description: symbol (no embedded blanks) or

string; <= 40 characters

Specific property and value pairs.

Currently available combinations:

type subtype property type(s)

---- -------- ---------------

dummy none/nil --

discrete nominal range

real location error

real scalar zero_point rel_error

The ERROR property should represent your best estimate of the average error expected in

the measurement and recording of that real attribute. Lacking better information, the

error can be taken as 1/2 the minimum possible difference between measured values. It can

be argued that real values are often truncated, so that smaller errors may be justified,

particularly for generated data. But AutoClass only sees the recorded values. So it

needs the error in the recorded values, rather than the actual measurement error. Setting

this error much smaller than the minimum expressible difference implies the possibility of

values that cannot be expressed in the data. Worse, it implies that two identical values

must represent measurements that were much closer than they might actually have been.

This leads to over-fitting of the classification.

The REL_ERROR property is used for SCALAR reals when the error is proportional to the

measured value. The ERROR property is not supported.

AutoClass uses the error as a lower bound on the width of the normal distribution. So

small error estimates tend to give narrower peaks and to increase both the number of

classes and the classification probability. Broad error estimates tend to limit the

number of classes.

The scalar ZERO_POINT property is the smallest value that the measurement process could

have produced. This is often 0.0, or less by some error range. Similarly, the bounded

real's min and max properties are exclusive bounds on the attributes generating process.

For a calculated percentage these would be 0-e and 100+e, where e is an error value. The

discrete attribute's range is the number of possible values the attribute can take on.

This range must include unknown as a value when such values occur.

Header File Example:

!#; AutoClass C header file -- extension .hd2

!#; the following chars in column 1 make the line a comment:

!#; '!', '#', ';', ' ', and '\n' (empty line)

;#! num_db2_format_defs <num of def lines -- min 1, max 4>

num_db2_format_defs 2

;; required

number_of_attributes 7

;; optional - default values are specified

;; separator_char ' '

;; comment_char ';'

;; unknown_token '?'

separator_char ','

;; <zero-based att#> <att_type> <att_sub_type> <att_description>

<att_param_pairs>

0 dummy nil "True class, range = 1 - 3"

1 real location "X location, m. in range of 25.0 - 40.0" error .25

2 real location "Y location, m. in range of 0.5 - 0.7" error .05

3 real scalar "Weight, kg. in range of 5.0 - 10.0" zero_point 0.0

rel_error .001

4 discrete nominal "Truth value, range = 1 - 2" range 2

5 discrete nominal "Color of foobar, 10 values" range 10

6 discrete nominal Spectral_color_group range 6

**MODEL**

**FILE**

A classification of a data set is made with respect to a model which specifies the form of

the probability distribution function for classes in that data set. Normally the model

structure is defined in a model file (file type "model"), containing one or more models.

Internally, a model is defined relative to a particular database. Thus it is identified

by the corresponding database, the model's model file and its sequential position in the

file.

Each model is specified by one or more model group definition lines. Each model group

line associates attribute indices with a model term type.

Here is an example model file:

# AutoClass C model file -- extension .model

model_index 0 7

ignore 0

single_normal_cn 3

single_normal_cn 17 18 21

multi_normal_cn 1 2

multi_normal_cn 8 9 10

multi_normal_cn 11 12 13

single_multinomial default

Here, the first line is a comment. The following characters in column 1 make the line a

comment: `!', `#', ` ', `;', and `\n' (empty line).

The tokens "model_index

__n__

__m__" must appear on the first non-comment line, and precede the

model term definition lines.

__n__is the zero-based model index, typically 0 where there is

only one model -- the majority of search situations.

__m__is the number of model term

definition lines that follow.

The last seven lines are model group lines. Each model group line consists of:

A model term type (one of

**single_multinomial**,

**single_normal_cm**,

**single_normal_cn**,

**multi_normal_cn**, or

**ignore**).

A list of attribute indices (the attribute set list), or the symbol

**default**. Attribute

indices are zero-based. Single model terms may have one or more attribute indices on

each line, while multi model terms require two or more attribute indices per line. An

attribute index must not appear more than once in a model list.

Notes:

1) At least one model definition is required (model_index token).

2) There may be multiple entries in a model for any model term type.

3) Model term types currently consist of:

**single_multinomial**

models discrete attributes as multinomials, with missing values.

**single_normal_cn**

models real valued attributes as normals; no missing values.

**single_normal_cm**

models real valued attributes with missing values.

**multi_normal_cn**

is a covariant normal model without missing values.

**ignore**allows the model to ignore one or more attributes.

**ignore**is not a valid

default model term type.

See the documentation in models-c.text for further information about specific model

terms.

4)

**Single_normal_cn**,

**single_normal_cm**, and

**multi_normal_cn**modeled data, whose subtype

is

**scalar**(value distribution is away from 0.0, and is thus not a "normal"

distribution) will be log transformed and modeled with the log-normal model. For

data whose subtype is

**location**(value distribution is around 0.0), no transform is

done, and the normal model is used.

**SEARCHING**

AutoClass, when invoked in the "search" mode will check the validity of the set of data,

header, model, and search parameter files. Errors will stop the search from starting, and

warnings will ask the user whether to continue. A history of the error and warning

messages is saved, by default, in the log file.

Once you have succeeded in describing your data with a header file and model file that

passes the AUTOCLASS -SEARCH <...> input checks, you will have entered the search domain

where

**AutoClass**classifies your data. (At last!)

The main function to use in finding a good classification of your data is AUTOCLASS

-SEARCH, and using it will take most of the computation time. Searches are invoked with:

autoclass -search <.db2 file path> <.hd2 file path>

<.model file path> <.s-params file path>

All files must be specified as fully qualified relative or absolute pathnames. File name

extensions (file types) for all files are forced to canonical values required by the

AutoClass program:

data file ("ascii") db2

data file ("binary") db2-bin

header file hd2

model file model

search params file s-params

The sample-run (

__/usr/share/doc/autoclass/examples/__) that comes with

**AutoClass**shows some

sample searches, and browsing these is probably the fastest way to get familiar with how

to do searches. The test data sets located under

__/usr/share/doc/autoclass/examples/__will

show you some other header (.hd2), model (.model), and search params (.s-params) file

setups. The remainder of this section describes how to do searches in somewhat more

detail.

The

**bold**

**faced**tokens below are generally search params file parameters. For more

information on the s-params file, see

**SEARCH**

**PARAMETERS**below, or

__/usr/share/doc/autoclass/search-c.text.gz__.

**WHAT**

**RESULTS**

**ARE**

**AutoClass**is looking for the best classification(s) of the data it can find. A

classification is composed of:

1) a set of classes, each of which is described by a set of class parameters, which

specify how the class is distributed along the various attributes. For example,

"height normally distributed with mean 4.67 ft and standard deviation .32 ft",

2) a set of class weights, describing what percentage of cases are likely to be in

each class.

3) a probabilistic assignment of cases in the data to these classes. I.e. for each

case, the relative probability that it is a member of each class.

As a strictly Bayesian system (accept no substitutes!), the quality measure

**AutoClass**uses

is the total probability that, had you known nothing about your data or its domain, you

would have found this set of data generated by this underlying model. This includes the

prior probability that the "world" would have chosen this number of classes, this set of

relative class weights, and this set of parameters for each class, and the likelihood that

such a set of classes would have generated this set of values for the attributes in the

data cases.

These probabilities are typically very small, in the range of e^-30000, and so are usually

expressed in exponential notation.

**WHAT**

**RESULTS**

**MEAN**

It is important to remember that all of these probabilities are GIVEN that the real model

is in the model family that

**AutoClass**has restricted its attention to. If

**AutoClass**is

looking for Gaussian classes and the real classes are Poisson, then the fact that

**AutoClass**found 5 Gaussian classes may not say much about how many Poisson classes there

really are.

The relative probability between different classifications found can be very large, like

e^1000, so the very best classification found is usually overwhelmingly more probable than

the rest (and overwhelmingly less probable than any better classifications as yet

undiscovered). If

**AutoClass**should manage to find two classifications that are within

about exp(5-10) of each other (i.e. within 100 to 10,000 times more probable) then you

should consider them to be about equally probable, as our computation is usually not more

accurate than this (and sometimes much less).

**HOW**

**IT**

**WORKS**

**AutoClass**repeatedly creates a random classification and then tries to massage this into a

high probability classification though local changes, until it converges to some "local

maximum". It then remembers what it found and starts over again, continuing until you

tell it to stop. Each effort is called a "try", and the computed probability is intended

to cover the whole volume in parameter space around this maximum, rather than just the

peak.

The standard approach to massaging is to

1) Compute the probabilistic class memberships of cases using the class parameters and

the implied relative likelihoods.

2) Using the new class members, compute class statistics (like mean) and revise the

class parameters.

and repeat till they stop changing. There are three available convergence algorithms:

"converge_search_3" (the default), "converge_search_4" and "converge". Their

specification is controlled by search params file parameter

**try_fn_type**.

**WHEN**

**TO**

**STOP**

You can tell AUTOCLASS -SEARCH to stop by: 1) giving a

**max_duration**(in seconds) argument

at the beginning; 2) giving a

**max_n_tries**(an integer) argument at the beginning; or 3) by

typing a "q" and <return> after you have seen enough tries. The

**max_duration**and

**max_n_tries**arguments are useful if you desire to run AUTOCLASS -SEARCH in batch mode. If

you are restarting AUTOCLASS -SEARCH from a previous search, the value of

**max_n_tries**you

provide, for instance 3, will tell the program to compute 3 more tries in addition to

however many it has already done. The same incremental behavior is exhibited by

**max_duration**.

Deciding when to stop is a judgment call and it's up to you. Since the search includes a

random component, there's always the chance that if you let it keep going it will find

something better. So you need to trade off how much better it might be with how long it

might take to find it. The search status reports that are printed when a new best

classification is found are intended to provide you information to help you make this

tradeoff.

One clear sign that you should probably stop is if most of the classifications found are

duplicates of previous ones (flagged by "dup" as they are found). This should only happen

for very small sets of data or when fixing a very small number of classes, like two.

Our experience is that for moderately large to extremely large data sets (~200 to ~10,000

datum), it is necessary to run

**AutoClass**for at least 50 trials.

**WHAT**

**GETS**

**RETURNED**

Just before returning, AUTOCLASS -SEARCH will give short descriptions of the best

classifications found. How many will be described can be controlled with

**n_final_summary**.

By default AUTOCLASS -SEARCH will write out a number of files, both at the end and

periodically during the search (in case your system crashes before it finishes). These

files will all have the same name (taken from the search params pathname [<name>.s-

params]), and differ only in their file extensions. If your search runs are very long and

there is a possibility that your machine may crash, you can have intermediate "results"

files written out. These can be used to restart your search run with minimum loss of

search effort. See the documentation file

__/usr/share/doc/autoclass/checkpoint-c.text__.

A ".log" file will hold a listing of most of what was printed to the screen during the

run, unless you set

**log_file_p**to false to say you want no such foolishness. Unless

**results_file_p**is false, a binary ".results-bin" file (the default) or an ASCII ".results"

text file, will hold the best classifications that were returned, and unless

**search_file_p**

is false, a ".search" file will hold the record of the search tries.

**save_compact_p**

controls whether the "results" files are saved as binary or ASCII text.

If the C global variable "G_safe_file_writing_p" is defined as TRUE in "autoclass-

c/prog/globals.c", the names of "results" files (those that contain the saved

classifications) are modified internally to account for redundant file writing. If the

search params file name is "my_saved_clsfs" you will see the following "results" file

names (ignoring directories and pathnames for this example)

**save_compact_p**= true --

"my_saved_clsfs.results-bin" - completely written file

"my_saved_clsfs.results-tmp-bin" - partially written file, renamed

when complete

**save_compact_p**= false --

"my_saved_clsfs.results" - completely written file

"my_saved_clsfs.results-tmp" - partially written file, renamed

when complete

If check pointing is being done, these additional names will appear

**save_compact_p**= true --

"my_saved_clsfs.chkpt-bin" - completely written checkpoint file

"my_saved_clsfs.chkpt-tmp-bin" - partially written checkpoint file,

renamed when complete

**save_compact_p**= false --

"my_saved_clsfs.chkpt" - completely written checkpoint file

"my_saved_clsfs.chkpt-tmp" - partially written checkpoint file,

renamed when complete

**HOW**

**TO**

**GET**

**STARTED**

The way to invoke AUTOCLASS -SEARCH is:

autoclass -search <.db2 file path> <.hd2 file path>

<.model file path> <.s-params file path>

To restart a previous search, specify that

**force_new_search_p**has the value false in the

search params file, since its default is true. Specifying false tells AUTOCLASS -SEARCH

to try to find a previous compatible search (<...>.results[-bin] & <...>.search) to

continue from, and will restart using it if found. To force a new search instead of

restarting an old one, give the parameter

**force_new_search_p**the value of true, or use the

default. If there is an existing search (<...>.results[-bin] & <...>.search), the user

will be asked to confirm continuation since continuation will discard the existing search.

If a previous search is continued, the message "RESTARTING SEARCH" will be given instead

of the usual "BEGINNING SEARCH". It is generally better to continue a previous search

than to start a new one, unless you are trying a significantly different search method, in

which case statistics from the previous search may mislead the current one.

**STATUS**

**REPORTS**

A running commentary on the search will be printed to the screen and to the log file

(unless

**log_file_p**is false). Note that the ".log" file will contain a listing of all

default search params values, and the values of all params that are overridden.

After each try a very short report (only a few characters long) is given. After each new

best classification, a longer report is given, but no more often than

**min_report_period**

(default is 30 seconds).

**SEARCH**

**VARIATIONS**

AUTOCLASS -SEARCH by default uses a certain standard search method or "try function"

(

**try_fn_type**= "converge_search_3"). Two others are also available: "converge_search_4"

and "converge"). They are provided in case your problem is one that may happen to benefit

from them. In general the default method will result in finding better classifications at

the expense of a longer search time. The default was chosen so as to be robust, giving

even performance across many problems. The alternatives to the default may do better on

some problems, but may do substantially worse on others.

"converge_search_3" uses an absolute stopping criterion (

**rel_delta_range**, default value of

0.0025) which tests the variation of each class of the delta of the log approximate-

marginal-likelihood of the class statistics with-respect-to the class hypothesis

(class->log_a_w_s_h_j) divided by the class weight (class->w_j) between successive

convergence cycles. Increasing this value loosens the convergence and reduces the number

of cycles. Decreasing this value tightens the convergence and increases the number of

cycles.

**n_average**(default value of 3) specifies how many successive cycles must meet the

stopping criterion before the trial terminates.

"converge_search_4" uses an absolute stopping criterion (

**cs4_delta_range**, default value of

0.0025) which tests the variation of each class of the slope for each class of log

approximate-marginal-likelihood of the class statistics with-respect-to the class

hypothesis (class->log_a_w_s_h_j) divided by the class weight (class->w_j) over

**sigma_beta_n_values**(default value 6) convergence cycles. Increasing the value of

**cs4_delta_range**loosens the convergence and reduces the number of cycles. Decreasing this

value tightens the convergence and increases the number of cycles. Computationally, this

try function is more expensive than "converge_search_3", but may prove useful if the

computational "noise" is significant compared to the variations in the computed values.

Key calculations are done in double precision floating point, and for the largest data

base we have tested so far ( 5,420 cases of 93 attributes), computational noise has not

been a problem, although the value of

**max_cycles**needed to be increased to 400.

"converge" uses one of two absolute stopping criterion which test the variation of the

classification (clsf) log_marginal (clsf->log_a_x_h) delta between successive convergence

cycles. The largest of

**halt_range**(default value 0.5) and

**halt_factor***

**current_clsf_log_marginal**) is used (default value of

**halt_factor**is 0.0001). Increasing

these values loosens the convergence and reduces the number of cycles. Decreasing these

values tightens the convergence and increases the number of cycles.

**n_average**(default

value of 3) specifies how many cycles must meet the stopping criteria before the trial

terminates. This is a very approximate stopping criterion, but will give you some feel

for the kind of classifications to expect. It would be useful for "exploratory" searches

of a data base.

The purpose of

**reconverge_type**= "chkpt" is to complete an interrupted classification by

continuing from its last checkpoint. The purpose of

**reconverge_type**= "results" is to

attempt further refinement of the best completed classification using a different value of

**try_fn_type**("converge_search_3", "converge_search_4", "converge"). If

**max_n_tries**is

greater than 1, then in each case, after the reconvergence has completed,

**AutoClass**will

perform further search trials based on the parameter values in the <...>.s-params file.

With the use of

**reconverge_type**( default value ""), you may apply more than one try

function to a classification. Say you generate several exploratory trials using

**try_fn_type**= "converge", and quit the search saving .search and .results[-bin] files.

Then you can begin another search with

**try_fn_type**= "converge_search_3",

**reconverge_type**

= "results", and

**max_n_tries**= 1. This will result in the further convergence of the best

classification generated with

**try_fn_type**= "converge", with

**try_fn_type**=

"converge_search_3". When

**AutoClass**completes this search try, you will have an

additional refined classification.

A good way to verify that any of the alternate

**try_fun_type**are generating a well

converged classification is to run

**AutoClass**in prediction mode on the same data used for

generating the classification. Then generate and compare the corresponding case or class

cross reference files for the original classification and the prediction. Small

differences between these files are to be expected, while large differences indicate

incomplete convergence. Differences between such file pairs should, on average and modulo

class deletions, decrease monotonically with further convergence.

The standard way to create a random classification to begin a try is with the default

value of "random" for

**start_fn_type**. At this point there are no alternatives. Specifying

"block" for

**start_fn_type**produces repeatable non-random searches. That is how the

<..>.s-params files in the autoclass-c/data/.. sub-directories are specified. This is how

development testing is done.

**max_cycles**controls the maximum number of convergence cycles that will be performed in any

one trial by the convergence functions. Its default value is 200. The screen output

shows a period (".") for each cycle completed. If your search trials run for 200 cycles,

then either your data base is very complex (increase the value), or the

**try_fn_type**is not

adequate for situation (try another of the available ones, and use

**converge_print_p**to get

more information on what is going on).

Specifying

**converge_print_p**to be true will generate a brief print-out for each cycle

which will provide information so that you can modify the default values of

**rel_delta_range**&

**n_average**for "converge_search_3";

**cs4_delta_range**&

**sigma_beta_n_values**

for "converge_search_4"; and

**halt_range**,

**halt_factor**, and

**n_average**for "converge". Their

default values are given in the <..>.s-params files in the autoclass-c/data/.. sub-

directories.

**HOW**

**MANY**

**CLASSES?**

Each new try begins with a certain number of classes and may end up with a smaller number,

as some classes may drop out of the convergence. In general, you want to begin the try

with some number of classes that previous tries have indicated look promising, and you

want to be sure you are fishing around elsewhere in case you missed something before.

**n_classes_fn_type**= "random_ln_normal" is the default way to make this choice. It fits a

log normal to the number of classes (usually called "j" for short) of the 10 best

classifications found so far, and randomly selects from that. There is currently no

alternative.

To start the game off, the default is to go down

**start_j_list**for the first few tries, and

then switch to

**n_classes_fn_type**. If you believe that the probable number of classes in

your data base is say 75, then instead of using the default value of

**start_j_list**(2, 3,

5, 7, 10, 15, 25), specify something like 50, 60, 70, 80, 90, 100.

If one wants to always look for, say, three classes, one can use

**fixed_j**and override the

above. Search status reports will describe what the current method for choosing j is.

**DO**

**I**

**HAVE**

**ENOUGH**

**MEMORY**

**AND**

**DISK**

**SPACE?**

Internally, the storage requirements in the current system are of order n_classes_per_clsf

* (n_data + n_stored_clsfs * n_attributes * n_attribute_values). This depends on the

number of cases, the number of attributes, the values per attribute (use 2 if a real

value), and the number of classifications stored away for comparison to see if others are

duplicates -- controlled by

**max_n_store**(default value = 10). The search process does not

itself consume significant memory, but storage of the results may do so.

**AutoClass**

**C**is configured to handle a maximum of 999 attributes. If you attempt to run

with more than that you will get array bound violations. In that case, change these

configuration parameters in prog/autoclass.h and recompile

**AutoClass**

**C**:

#define ALL_ATTRIBUTES 999

#define VERY_LONG_STRING_LENGTH 20000

#define VERY_LONG_TOKEN_LENGTH 500

For example, these values will handle several thousand attributes:

#define ALL_ATTRIBUTES 9999

#define VERY_LONG_STRING_LENGTH 50000

#define VERY_LONG_TOKEN_LENGTH 50000

Disk space taken up by the "log" file will of course depend on the duration of the search.

**n_save**(default value = 2) determines how many best classifications are saved into the

".results[-bin]" file.

**save_compact_p**controls whether the "results" and "checkpoint"

files are saved as binary. Binary files are faster and more compact, but are not

portable. The default value of

**save_compact_p**is true, which causes binary files to be

written.

If the time taken to save the "results" files is a problem, consider increasing

**min_save_period**(default value = 1800 seconds or 30 minutes). Files are saved to disk

this often if there is anything different to report.

**JUST**

**HOW**

**SLOW**

**IS**

**IT?**

Compute time is of order n_data * n_attributes * n_classes * n_tries *

converge_cycles_per_try. The major uncertainties in this are the number of basic back and

forth cycles till convergence in each try, and of course the number of tries. The number

of cycles per trial is typically 10-100 for

**try_fn_type**"converge", and 10-200+ for

"converge_search_3" and "converge_search-4". The maximum number is specified by

**max_n_tries**(default value = 200). The number of trials is up to you and your available

computing resources.

The running time of very large data sets will be quite uncertain. We advise that a few

small scale test runs be made on your system to determine a baseline. Specify

**n_data**to

limit how many data vectors are read. Given a very large quantity of data,

**AutoClass**may

find its most probable classifications at upwards of a hundred classes, and this will

require that

**start_j_list**be specified appropriately (See above section

**HOW**

**MANY**

**CLASSES?**). If you are quite certain that you only want a few classes, you can force

**AutoClass**to search with a fixed number of classes specified by

**fixed_j**. You will then

need to run separate searches with each different fixed number of classes.

**CHANGING**

**FILENAMES**

**IN**

**A**

**SAVED**

**CLASSIFICATION**

**FILE**

**AutoClass**caches the data, header, and model file pathnames in the saved classification

structure of the binary (".results-bin") or ASCII (".results") "results" files. If the

"results" and "search" files are moved to a different directory location, the search

cannot be successfully restarted if you have used absolute pathnames. Thus it is

advantageous to run invoke

**AutoClass**in a parent directory of the data, header, and model

files, so that relative pathnames can be used. Since the pathnames cached will then be

relative, the files can be moved to a different host or file system and restarted --

providing the same relative pathname hierarchy exists.

However, since the ".results" file is ASCII text, those pathnames could be changed with a

text editor (

**save_compact_p**must be specified as false).

**SEARCH**

**PARAMETERS**

The search is controlled by the ".s-params" file. In this file, an empty line or a line

starting with one of these characters is treated as a comment: "#", "!", or ";". The

parameter name and its value can be separated by an equal sign, a space, or a tab:

n_clsfs 1

n_clsfs = 1

n_clsfs<tab>1

Spaces are ignored if "=" or "<tab>" are used as separators. Note there are no trailing

semicolons.

The search parameters, with their default values, are as follows:

**rel_error**= 0.01

Specifies the relative difference measure used by clsf-DS-%=, when deciding if a

new clsf is a duplicate of an old one.

**start_j_list**= 2, 3, 5, 7, 10, 15, 25

Initially try these numbers of classes, so as not to narrow the search too quickly.

The state of this list is saved in the <..>.search file and used on restarts,

unless an override specification of

**start_j_list**is made in the .s-params file for

the restart run. This list should bracket your expected number of classes, and by

a wide margin! "start_j_list = -999" specifies an empty list (allowed only on

restarts)

**n_classes_fn_type**= "random_ln_normal"

Once

**start_j_list**is exhausted,

**AutoClass**will call this function to decide how

many classes to start with on the next try, based on the 10 best classifications

found so far. Currently only "random_ln_normal" is available.

**fixed_j**= 0

When

**fixed_j**> 0, overrides

**start_j_list**and

**n_classes_fn_type,**and

**AutoClass**will

always use this value for the initial number of classes.

**min_report_period**= 30

Wait at least this time (in seconds) since last report until reporting verbosely

again. Should be set longer than the expected run time when checking for

repeatability of results. For repeatable results, also see

**force_new_search_p,**

**start_fn_type**and

**randomize_random_p**.

__NOTE__: At least one of "interactive_p",

"max_duration", and "max_n_tries" must be active. Otherwise

**AutoClass**will run

indefinitely. See below.

**interactive_p**= true

When false, allows run to continue until otherwise halted. When true, standard

input is queried on each cycle for the quit character "q", which, when detected,

triggers an immediate halt.

**max_duration**= 0

When = 0, allows run to continue until otherwise halted. When > 0, specifies the

maximum number of seconds to run.

**max_n_tries**= 0

When = 0, allows run to continue until otherwise halted. When > 0, specifies the

maximum number of tries to make.

**n_save**= 2

Save this many clsfs to disk in the .results[-bin] and .search files. if 0, don't

save anything (no .search & .results[-bin] files).

**log_file_p**= true

If false, do not write a log file.

**search_file_p**= true

If false, do not write a search file.

**results_file_p**= true

If false, do not write a results file.

**min_save_period**= 1800

CPU crash protection. This specifies the maximum time, in seconds, that

**AutoClass**

will run before it saves the current results to disk. The default time is 30

minutes.

**max_n_store**= 10

Specifies the maximum number of classifications stored internally.

**n_final_summary**= 10

Specifies the number of trials to be printed out after search ends.

**start_fn_type**= "random"

One of {"random", "block"}. This specifies the type of class initialization. For

normal search, use "random", which randomly selects instances to be initial class

means, and adds appropriate variances. For testing with repeatable search, use

"block", which partitions the database into successive blocks of near equal size.

For repeatable results, also see

**force_new_search_p**,

**min_report_period**, and

**randomize_random_p**.

**try_fn_type**= "converge_search_3"

One of {"converge_search_3", "converge_search_4", "converge"}. These specify

alternate search stopping criteria. "converge" merely tests the rate of change of

the log_marginal classification probability (clsf->log_a_x_h), without checking

rate of change of individual classes(see

**halt_range**and

**halt_factor**).

"converge_search_3" and "converge_search_4" each monitor the ratio

class->log_a_w_s_h_j/class->w_j for all classes, and continue convergence until all

pass the quiescence criteria for

**n_average**cycles. "converge_search_3" tests

differences between successive convergence cycles (see

**rel_delta_range**). This

provides a reasonable, general purpose stopping criteria. "converge_search_4"

averages the ratio over "sigma_beta_n_values" cycles (see

**cs4_delta_range**). This

is preferred when converge_search_3 produces many similar classes.

**initial_cycles_p**= true

If true, perform base_cycle in initialize_parameters. false is used only for

testing.

**save_compact_p**= true

true saves classifications as machine dependent binary (.results-bin & .chkpt-bin).

false saves as ascii text (.results & .chkpt)

**read_compact_p**= true

true reads classifications as machine dependent binary (.results-bin & .chkpt-bin).

false reads as ascii text (.results & .chkpt).

**randomize_random_p**= true

false seeds lrand48, the pseudo-random number function with 1 to give repeatable

test cases. true uses universal time clock as the seed, giving semi-random

searches. For repeatable results, also see

**force_new_search_p**,

**min_report_period**

and

**start_fn_type**.

**n_data**= 0

With n_data = 0, the entire database is read from .db2. With n_data > 0, only this

number of data are read.

**halt_range**= 0.5

Passed to try_fn_type "converge". With the "converge" try_fn_type, convergence is

halted when the larger of halt_range and (halt_factor * current_log_marginal)

exceeds the difference between successive cycle values of the classification

log_marginal (clsf->log_a_x_h). Decreasing this value may tighten the convergence

and increase the number of cycles.

**halt_factor**= 0.0001

Passed to try_fn_type "converge". With the "converge" try_fn_type, convergence is

halted when the larger of halt_range and (halt_factor * current_log_marginal)

exceeds the difference between successive cycle values of the classification

log_marginal (clsf->log_a_x_h). Decreasing this value may tighten the convergence

and increase the number of cycles.

**rel_delta_range**= 0.0025

Passed to try function "converge_search_3", which monitors the ratio of log approx-

marginal-likelihood of class statistics with-respect-to the class hypothesis

(class->log_a_w_s_h_j) divided by the class weight (class->w_j), for each class.

"converge_search_3" halts convergence when the difference between cycles, of this

ratio, for every class, has been exceeded by "rel_delta_range" for "n_average"

cycles. Decreasing "rel_delta_range" tightens the convergence and increases the

number of cycles.

**cs4_delta_range**= 0.0025

Passed to try function "converge_search_4", which monitors the ratio of

(class->log_a_w_s_h_j)/(class->w_j), for each class, averaged over

"sigma_beta_n_values" convergence cycles. "converge_search_4" halts convergence

when the maximum difference in average values of this ratio falls below

"cs4_delta_range". Decreasing "cs4_delta_range" tightens the convergence and

increases the number of cycles.

**n_average**= 3

Passed to try functions "converge_search_3" and "converge". The number of cycles

for which the convergence criterion must be satisfied for the trial to terminate.

**sigma_beta_n_values**= 6

Passed to try_fn_type "converge_search_4". The number of past values to use in

computing sigma^2 (noise) and beta^2 (signal).

**max_cycles**= 200

This is the maximum number of cycles permitted for any one convergence of a

classification, regardless of any other stopping criteria. This is very dependent

upon your database and choice of model and convergence parameters, but should be

about twice the average number of cycles reported in the screen dump and .log file

**converge_print_p**= false

If true, the selected try function will print to the screen values useful in

specifying non-default values for

**halt_range**,

**halt_factor**,

**rel_delta_range**,

**n_average**,

**sigma_beta_n_values**, and

**range_factor**.

**force_new_search_p**= true

If true, will ignore any previous search results, discarding the existing .search

and .results[-bin] files after confirmation by the user; if false, will continue

the search using the existing .search and .results[-bin] files. For repeatable

results, also see

**min_report_period**,

**start_fn_type**and

**randomize_random_p**.

**checkpoint_p**= false

If true, checkpoints of the current classification will be written every

"min_checkpoint_period" seconds, with file extension .chkpt[-bin]. This is only

useful for very large classifications

**min_checkpoint_period**= 10800

If checkpoint_p = true, the checkpointed classification will be written this often

- in seconds (default = 3 hours)

**reconverge_type**= "

Can be either "chkpt" or "results". If "checkpoint_p" = true and "reconverge_type"

= "chkpt", then continue convergence of the classification contained in

<...>.chkpt[-bin]. If "checkpoint_p " = false and "reconverge_type" = "results",

continue convergence of the best classification contained in <...>.results[-bin].

**screen_output_p**= true

If false, no output is directed to the screen. Assuming log_file_p = true, output

will be directed to the log file only.

**break_on_warnings_p**= true

The default value asks the user whether or not to continue, when data definition

warnings are found. If specified as false, then

**AutoClass**will continue, despite

warnings -- the warning will continue to be output to the terminal and the log

file.

**free_storage_p**= true

The default value tells

**AutoClass**to free the majority of its allocated storage.

This is not required, and in the case of the DEC Alpha causes core dump [is this

still true?]. If specified as false,

**AutoClass**will not attempt to free storage.

**HOW**

**TO**

**GET**

**AUTOCLASS**

**C**

**TO**

**PRODUCE**

**REPEATABLE**

**RESULTS**

In some situations, repeatable classifications are required: comparing basic

**AutoClass**

**C**

integrity on different platforms, porting

**AutoClass**

**C**to a new platform, etc. In order to

accomplish this two things are necessary: 1) the same random number generator must be

used, and 2) the search parameters must be specified properly.

Random Number Generator. This implementation of

**AutoClass**

**C**uses the Unix srand48/lrand48

random number generator which generates pseudo-random numbers using the well-known linear

congruential algorithm and 48-bit integer arithmetic. lrand48() returns non- negative

long integers uniformly distributed over the interval [0, 2**31].

Search Parameters. The following .s-params file parameters should be specified:

force_new_search_p = true

start_fn_type "block"

randomize_random_p = false

;; specify the number of trials you wish to run

max_n_tries = 50

;; specify a time greater than duration of run

min_report_period = 30000

Note that no current best classification reports will be produced. Only a final

classification summary will be output.

**CHECKPOINTING**

With very large databases there is a significant probability of a system crash during any

one classification try. Under such circumstances it is advisable to take the time to

checkpoint the calculations for possible restart.

Checkpointing is initiated by specifying "

**checkpoint_p**= true" in the ".s-params" file.

This causes the inner convergence step, to save a copy of the classification onto the

checkpoint file each time the classification is updated, providing a certain period of

time has elapsed. The file extension is ".chkpt[-bin]".

Each time a AutoClass completes a cycle, a "." is output to the screen to provide you with

information to be used in setting the

**min_checkpoint_period**value (default 10800 seconds

or 3 hours). There is obviously a trade-off between frequency of checkpointing and the

probability that your machine may crash, since the repetitive writing of the checkpoint

file will slow the search process.

Restarting AutoClass Search:

To recover the classification and continue the search after rebooting and reloading

AutoClass, specify

**reconverge_type**= "chkpt" in the ".s-params" file (specify

**force_new_search_p**as false).

AutoClass will reload the appropriate database and models, provided there has been no

change in their filenames since the time they were loaded for the checkpointed

classification run. The ".s-params" file contains any non-default arguments that were

provided to the original call.

In the beginning of a search, before

**start_j_list**has been emptied, it will be necessary

to trim the original list to what would have remained in the crashed search. This can be

determined by looking at the ".log" file to determine what values were already used. If

the

**start_j_list**has been emptied, then an empty

**start_j_list**should be specified in the

".s-params" file. This is done either by

**start_j_list**=

or

**start_j_list**= -9999

Here is an a set of scripts to demonstrate check-pointing:

autoclass -search data/glass/glassc.db2 data/glass/glass-3c.hd2 \

data/glass/glass-mnc.model data/glass/glassc-chkpt.s-params

Run 1)

## glassc-chkpt.s-params

max_n_tries = 2

force_new_search_p = true

## --------------------

;; run to completion

Run 2)

## glassc-chkpt.s-params

force_new_search_p = false

max_n_tries = 10

checkpoint_p = true

min_checkpoint_period = 2

## --------------------

;; after 1 checkpoint, ctrl-C to simulate cpu crash

Run 3)

## glassc-chkpt.s-params

force_new_search_p = false

max_n_tries = 1

checkpoint_p = true

min_checkpoint_period = 1

reconverge_type = "chkpt"

## --------------------

;; checkpointed trial should finish

**OUTPUT** **FILES**

The standard reports are

1) Attribute influence values: presents the relative influence or significance of the

data's attributes both globally (averaged over all classes), and locally

(specifically for each class). A heuristic for relative class strength is also

listed;

2) Cross-reference by case (datum) number: lists the primary class probability for

each datum, ordered by case number. When report_mode = "data", additional lesser

class probabilities (greater than or equal to 0.001) are listed for each datum;

3) Cross-reference by class number: for each class the primary class probability and

any lesser class probabilities (greater than or equal to 0.001) are listed for each

datum in the class, ordered by case number. It is also possible to list, for each

datum, the values of attributes, which you select.

The attribute influence values report attempts to provide relative measures of the

"influence" of the data attributes on the classes found by the classification. The

normalized class strengths, the normalized attribute influence values summed over all

classes, and the individual influence values (I[jkl]) are all only relative measures and

should be interpreted with more meaning than rank ordering, but not like anything

approaching absolute values.

The reports are output to files whose names and pathnames are taken from the ".r-params"

file pathname. The report file types (extensions) are:

**influence**

**values**

**report**

"influ-o-text-

__n__" or "influ-no-text-

__n__"

**cross-reference**

**by**

**case**

"case-text-

__n__"

**cross-reference**

**by**

**class**

"class-text-

__n__"

or, if report_mode is overridden to "data":

**influence**

**values**

**report**

"influ-o-data-

__n__" or "influ-no-data-

__n__"

**cross-reference**

**by**

**case**

"case-data-

__n__"

**cross-reference**

**by**

**class**

"class-data-

__n__"

where

__n__is the classification number from the "results" file. The first or best

classification is numbered 1, the next best 2, etc. The default is to generate reports

only for the best classification in the "results" file. You can produce reports for other

saved classifications by using report params keywords

**n_clsfs**and

**clsf_n_list**. The

"influ-o-text-

__n__" file type is the default (

**order_attributes_by_influence_p**= true), and

lists each class's attributes in descending order of attribute influence value. If the

value of

**order_attributes_by_influence_p**is overridden to be false in the <...>.r-params

file, then each class's attributes will be listed in ascending order by attribute number.

The extension of the file generated will be "influ-no-text-

__n__". This method of listing

facilitates the visual comparison of attribute values between classes.

For example, this command:

autoclass -reports sample/imports-85c.results-bin

sample/imports-85c.search sample/imports-85c.r-params

with this line in the ".r-params" file:

xref_class_report_att_list = 2, 5, 6

will generate these output files:

imports-85.influ-o-text-1

imports-85.case-text-1

imports-85.class-text-1

The

**AutoClass**

**C**reports provide the capability to compute sigma class contour values for

specified pairs of real valued attributes, when generating the influence values report

with the data option (report_mode = "data"). Note that sigma class contours are not

generated from discrete type attributes.

The sigma contours are the two dimensional equivalent of n-sigma error bars in one

dimension. Specifically, for two independent attributes the n-sigma contour is defined as

the ellipse where

((x - xMean) / xSigma)^2 + ((y - yMean) / ySigma)^2 == n

With covariant attributes, the n-sigma contours are defined identically, in the rotated

coordinate system of the distribution's principle axes. Thus independent attributes give

ellipses oriented parallel with the attribute axes, while the axes of sigma contours of

covariant attributes are rotated about the center determined by the means. In either case

the sigma contour represents a line where the class probability is constant, irrespective

of any other class probabilities.

With three or more attributes the n-sigma contours become k-dimensional ellipsoidal

surfaces. This code takes advantage of the fact that the parallel projection of an n-

dimensional ellipsoid, onto any 2-dim plane, is bounded by an ellipse. In this simplified

case of projecting the single sigma ellipsoid onto the coordinate planes, it is also true

that the 2-dim covariances of this ellipse are equal to the corresponding elements of the

n-dim ellipsoid's covariances. The Eigen-system of the 2-dim covariance then gives the

variances w.r.t. the principal components of the eclipse, and the rotation that aligns it

with the data. This represents the best way to display a distribution in the marginal

plane.

To get contour values, set the keyword

**sigma_contours_att_list**to a list of real valued

attribute indices (from .hd2 file), and request an influence values report with the data

option. For example,

report_mode = "data"

sigma_contours_att_list = 3, 4, 5, 8, 15

**OUTPUT**

**REPORT**

**PARAMETERS**

The contents of the output report are controlled by the ".r-params" file. In this file,

an empty line or a line starting with one of these characters is treated as a comment:

"#", "!", or ";". The parameter name and its value can be separated by an equal sign, a

space, or a tab:

n_clsfs 1

n_clsfs = 1

n_clsfs<tab>1

Spaces are ignored if "=" or "<tab>" are used as separators. Note there are no trailing

semicolons.

The following are the allowed parameters and their default values:

**n_clsfs**= 1

number of clsfs in the .results file for which to generate reports, starting with

the first or "best".

**clsf_n_list**=

if specified, this is a one-based index list of clsfs in the clsf sequence read

from the .results file. It overrides "n_clsfs". For example:

clsf_n_list = 1, 2

will produce the same output as

n_clsfs = 2

but

clsf_n_list = 2

will only output the "second best" classification report.

**report_type**=

type of reports to generate: "all", "influence_values", "xref_case", or

"xref_class".

**report_mode**=

mode of reports to generate. "text" is formatted text layout. "data" is numerical

-- suitable for further processing.

**comment_data_headers_p**= false

the default value does not insert # in column 1 of most report_mode = "data" header

lines. If specified as true, the comment character will be inserted in most header

lines.

**num_atts_to_list**=

if specified, the number of attributes to list in influence values report. if not

specified,

__all__attributes will be listed. (e.g. "num_atts_to_list = 5")

**xref_class_report_att_list**=

if specified, a list of attribute numbers (zero-based), whose values will be output

in the "xref_class" report along with the case probabilities. if not specified, no

attributes values will be output. (e.g. "xref_class_report_att_list = 1, 2, 3")

**order_attributes_by_influence_p**= true

The default value lists each class's attributes in descending order of attribute

influence value, and uses ".influ-o-text-n" as the influence values report file

type. If specified as false, then each class's attributes will be listed in

ascending order by attribute number. The extension of the file generated will be

"influ-no-text-n".

**break_on_warnings_p**= true

The default value asks the user whether to continue or not when data definition

warnings are found. If specified as false, then

**AutoClass**will continue, despite

warnings -- the warning will continue to be output to the terminal.

**free_storage_p**= true

The default value tells

**AutoClass**to free the majority of its allocated storage.

This is not required, and in the case of the DEC Alpha causes a core dump [is this

still true?]. If specified as false,

**AutoClass**will not attempt to free storage.

**max_num_xref_class_probs**= 5

Determines how many lessor class probabilities will be printed for the case and

class cross-reference reports. The default is to print the most probable class

probability value and up to 4 lessor class prob- ibilities. Note this is true for

both the "text" and "data" class cross-reference reports, but only true for the

"data" case cross- reference report. The "text" case cross-reference report only

has the most probable class probability.

**sigma_contours_att_list**=

If specified, a list of real valued attribute indices (from .hd2 file) will be to

compute sigma class contour values, when generating influence values report with

the data option (report_mode = "data"). If not specified, there will be no sigma

class contour output. (e.g. "sigma_contours_att_list = 3, 4, 5, 8, 15")

**INTERPRETATION** **OF** **AUTOCLASS** **RESULTS**

**WHAT**

**HAVE**

**YOU**

**GOT?**

Now you have run

**AutoClass**on your data set -- what have you got? Typically, the

**AutoClass**search procedure finds many classifications, but only saves the few best. These

are now available for inspection and interpretation. The most important indicator of the

relative merits of these alternative classifications is Log total posterior probability

value. Note that since the probability lies between 1 and 0, the corresponding Log

probability is negative and ranges from 0 to negative infinity. The difference between

these Log probability values raised to the power e gives the relative probability of the

alternatives classifications. So a difference of, say 100, implies one classification is

e^100 ~= 10^43 more likely than the other. However, these numbers can be very misleading,

since they give the relative probability of alternative classifications under the

**AutoClass**

__assumptions__.

**ASSUMPTIONS**

Specifically, the most important

**AutoClass**assumptions are the use of normal models for

real variables, and the assumption of independence of attributes within a class. Since

these assumptions are often violated in practice, the difference in posterior probability

of alternative classifications can be partly due to one classification being closer to

satisfying the assumptions than another, rather than to a real difference in

classification quality. Another source of uncertainty about the utility of Log

probability values is that they do not take into account any specific prior knowledge the

user may have about the domain. This means that it is often worth looking at alternative

classifications to see if you can interpret them, but it is worth starting from the most

probable first. Note that if the Log probability value is much greater than that for the

one class case, it is saying that there is overwhelming evidence for

__some__structure in the

data, and part of this structure has been captured by the

**AutoClass**classification.

**INFLUENCE**

**REPORT**

So you have now picked a classification you want to examine, based on its Log probability

value; how do you examine it? The first thing to do is to generate an "influence" report

on the classification using the report generation facilities documented in

__/usr/share/doc/autoclass/reports-c.text__. An influence report is designed to summarize the

important information buried in the

**AutoClass**data structures.

The first part of this report gives the heuristic class "strengths". Class "strength" is

here defined as the geometric mean probability that any instance "belonging to" class,

would have been generated from the class probability model. It thus provides a heuristic

measure of how strongly each class predicts "its" instances.

The second part is a listing of the overall "influence" of each of the attributes used in

the classification. These give a rough heuristic measure of the relative importance of

each attribute in the classification. Attribute "influence values" are a class

probability weighted average of the "influence" of each attribute in the classes, as

described below.

The next part of the report is a summary description of each of the classes. The classes

are arbitrarily numbered from 0 up to n, in order of descending class weight. A class

weight of say 34.1 means that the weighted sum of membership probabilities for class is

34.1. Note that a class weight of 34 does not necessarily mean that 34 cases belong to

that class, since many cases may have only partial membership in that class. Within each

class, attributes or attribute sets are ordered by the "influence" of their model term.

**CROSS**

**ENTROPY**

A commonly used measure of the divergence between two probability distributions is the

cross entropy: the sum over all possible values x, of P(x|c...)*log[P(x|c...)/P(x|g...)],

where c... and g... define the distributions. It ranges from zero, for identical

distributions, to infinite for distributions placing probability 1 on differing values of

an attribute. With conditionally independent terms in the probability distributions, the

cross entropy can be factored to a sum over these terms. These factors provide a measure

of the corresponding modeled attribute's influence in differentiating the two

distributions.

We define the modeled term's "influence" on a class to be the cross entropy term for the

class distribution w.r.t. the global class distribution of the single class

classification. "Influence" is thus a measure of how strongly the model term helps

differentiate the class from the whole data set. With independently modeled attributes,

the influence can legitimately be ascribed to the attribute itself. With correlated or

covariant attributes sets, the cross entropy factor is a function of the entire set, and

we distribute the influence value equally over the modeled attributes.

**ATTRIBUTE**

**INFLUENCE**

**VALUES**

In the "influence" report on each class, the attribute parameters for that class are given

in order of highest influence value for the model term attribute sets. Only the first few

attribute sets usually have significant influence values. If an influence value drops

below about 20% of the highest value, then it is probably not significant, but all

attribute sets are listed for completeness. In addition to the influence value for each

attribute set, the values of the attribute set parameters in that class are given along

with the corresponding "global" values. The global values are computed directly from the

data independent of the classification. For example, if the class mean of attribute

"temperature" is 90 with standard deviation of 2.5, but the global mean is 68 with a

standard deviation of 16.3, then this class has selected out cases with much higher than

average temperature, and a rather small spread in this high range. Similarly, for

discrete attribute sets, the probability of each outcome in that class is given, along

with the corresponding global probability -- ordered by its significance: the absolute

value of (log {<local-probability> / <global-probability>}). The sign of the significance

value shows the direction of change from the global class. This information gives an

overview of how each class differs from the average for all the data, in order of the most

significant differences.

**CLASS**

**AND**

**CASE**

**REPORTS**

Having gained a description of the classes from the "influence" report, you may want to

follow-up to see which classes your favorite cases ended up in. Conversely, you may want

to see which cases belong to a particular class. For this kind of cross-reference

information two complementary reports can be generated. These are more fully documented

in

__/usr/share/doc/autoclass/reports-c.text__. The "class" report, lists all the cases which

have significant membership in each class and the degree to which each such case belongs

to that class. Cases whose class membership is less than 90% in the current class have

their other class membership listed as well. The cases within a class are ordered in

increasing case number. The alternative "cases" report states which class (or classes) a

case belongs to, and the membership probability in the most probable class. These two

reports allow you to find which cases belong to which classes or the other way around. If

nearly every case has close to 99% membership in a single class, then it means that the

classes are well separated, while a high degree of cross-membership indicates that the

classes are heavily overlapped. Highly overlapped classes are an indication that the idea

of classification is breaking down and that groups of mutually highly overlapped classes,

a kind of meta class, is probably a better way of understanding the data.

**COMPARING**

**CLASS**

**WEIGHTS**

**AND**

**CLASS/CASE**

**REPORT**

**ASSIGNMENTS**

The class weight given as the class probability parameter, is essentially the sum over all

data instances, of the normalized probability that the instance is a member of the class.

It is probably an error on our part that we format this number as an integer in the

report, rather than emphasizing its real nature. You will find the actual real value

recorded as the w_j parameter in the class_DS structures on any .results[-bin] file.

The .case and .class reports give probabilities that cases are members of classes. Any

assignment of cases to classes requires some decision rule. The maximum probability

assignment rule is often implicitly assumed, but it cannot be expected that the resulting

partition sizes will equal the class weights unless nearly all class membership

probabilities are effectively one or zero. With non-1/0 membership probabilities,

matching the class weights requires summing the probabilities.

In addition, there is the question of completeness of the EM (expectation maximization)

convergence. EM alternates between estimating class parameters and estimating class

membership probabilities. These estimates converge on each other, but never actually

meet.

**AutoClass**implements several convergence algorithms with alternate stopping

criteria using appropriate parameters in the .s-params file. Proper setting of these

parameters, to get reasonably complete and efficient convergence may require

experimentation.

**ALTERNATIVE**

**CLASSIFICATIONS**

In summary, the various reports that can be generated give you a way of viewing the

current classification. It is usually a good idea to look at alternative classifications

even though they do not have the minimum Log probability values. These other

classifications usually have classes that correspond closely to strong classes in other

classifications, but can differ in the weak classes. The "strength" of a class within a

classification can usually be judged by how dramatically the highest influence value

attributes in the class differ from the corresponding global attributes. If none of the

classifications seem quite satisfactory, it is always possible to run

**AutoClass**again to

generate new classifications.

**WHAT**

**NEXT?**

Finally, the question of what to do after you have found an insightful classification

arises. Usually, classification is a preliminary data analysis step for examining a set

of cases (things, examples, etc.) to see if they can be grouped so that members of the

group are "similar" to each other.

**AutoClass**gives such a grouping without the user

having to define a similarity measure. The built-in "similarity" measure is the mutual

predictiveness of the cases. The next step is to try to "explain" why some objects are

more like others than those in a different group. Usually, domain knowledge suggests an

answer. For example, a classification of people based on income, buying habits, location,

age, etc., may reveal particular social classes that were not obvious before the

classification analysis. To obtain further information about such classes, further

information, such as number of cars, what TV shows are watched, etc., would reveal even

more information. Longitudinal studies would give information about how social classes

arise and what influences their attitudes -- all of which is going way beyond the initial

classification.

**PREDICTIONS**

Classifications can be used to predict class membership for new cases. So in addition to

possibly giving you some insight into the structure behind your data, you can now use

**AutoClass**directly to make predictions, and compare

**AutoClass**to other learning systems.

This technique for predicting class probabilities is applicable to all attributes,

regardless of data type/sub_type or likelihood model term type.

In the event that the class membership of a data case does not exceed 0.0099999 for any of

the "training" classes, the following message will appear in the screen output for each

case:

xref_get_data: case_num xxx => class 9999

Class 9999 members will appear in the "case" and "class" cross-reference reports with a

class membership of 1.0.

Cautionary Points:

The usual way of using

**AutoClass**is to put all of your data in a data_file, describe that

data with model and header files, and run "autoclass -search". Now, instead of one

data_file you will have two, a training_data_file and a test_data_file.

It is most important that both databases have the same

**AutoClass**internal representation.

Should this not be true,

**AutoClass**will exit, or possibly in in some situations, crash.

The prediction mode is designed to hopefully direct the user into conforming to this

requirement.

Preparation:

Prediction requires having a training classification and a test database. The training

classification is generated by the running of "autoclass -search" on the training

data_file ("data/soybean/soyc.db2"), for example:

autoclass -search data/soybean/soyc.db2 data/soybean/soyc.hd2

data/soybean/soyc.model data/soybean/soyc.s-params

This will produce "soyc.results-bin" and "soyc.search". Then create a "reports" parameter

file, such as "soyc.r-params" (see

__/usr/share/doc/autoclass/reports-c.text__), and run

**AutoClass**in "reports" mode, such as:

autoclass -reports data/soybean/soyc.results-bin

data/soybean/soyc.search data/soybean/soyc.r-params

This will generate class and case cross-reference files, and an influence values file.

The file names are based on the ".r-params" file name:

data/soybean/soyc.class-text-1

data/soybean/soyc.case-text-1

data/soybean/soyc.influ-text-1

These will describe the classes found in the training_data_file. Now this classification

can be used to predict the probabilistic class membership of the test_data_file cases

("data/soybean/soyc-predict.db2") in the training_data_file classes.

autoclass -predict data/soybean/soyc-predict.db2

data/soybean/soyc.results-bin data/soybean/soyc.search

data/soybean/soyc.r-params

This will generate class and case cross-reference files for the test_data_file cases

predicting their probabilistic class memberships in the training_data_file classes. The

file names are based on the ".db2" file name:

data/soybean/soyc-predict.class-text-1

data/soybean/soyc-predict.case-text-1

Use autoclass online using onworks.net services