CMSSW Reference Manual

00001 //$Id: SprGoFDecisionTreeApp.cc,v 1.3 2007/10/07 21:03:02 narsky Exp $
00002 
00003 /*
00004   This executable evaluates consistency between two multivariate samples
00005   using a goodness-of-fit method described by Friedman in the proceedings
00006   of Phystat 2003.
00007 
00008   Two samples labeled as category 0 and category 1, of sizes N0 and N1,
00009   respectively, are mixed together. At first step, the executable builds
00010   a decision tree to separate these two samples from each other using
00011   the Gini index as the optimization criterion. The associated value of the
00012   Gini index is used as an estimate of the consistency between the two 
00013   samples.
00014 
00015   To obtain the GOF value, one needs to obtain a distribution of Gini 
00016   indexes for toy samples. The toy samples are obtained by random relabeling
00017   of points from the two original samples in such a way that one always has
00018   N0 points from category 0 and N1 points from category 1. For each toy 
00019   experiment, a decision tree with the same parameters is used to separate
00020   category 0 from category 1 and compute the associated Gini index. The GOF
00021   is estimated as a fraction of toy experiments in which the negative Gini 
00022   index is greater than the one computed for the two original samples
00023   (or, equivalently, the conventional positive Gini index is less than the
00024   one for the two original samples).
00025 
00026   In this approach, an ideal consistency between two samples would be
00027   expressed by a 50% GOF value. Small values of GOF indicate that the two
00028   samples are inconsistent. Values of GOF close to 100% generally indicate
00029   a good agreement between the two samples but may be worth further
00030   investigating: somehow the good agreement of the two original samples cannot
00031   be reproduced by random relabeling of input points. This may indicate
00032   a problem with how the method works for this particular dataset.
00033 
00034   To use the executable, the user has to choose two parameters:
00035     -n --- number of toy experiments
00036     -l --- minimal number of events in the decision tree leaf
00037   The minimal leaf size has to be chosen by finding a good trade-off between
00038   bias and variance. One can think of it in terms of selecting a proper
00039   bin size for a multidimensional histogram. If the bin size is chosen small,
00040   the histogram will capture the data structure but suffer from large
00041   statistical fluctuations in the bins (low bias, high variance). If the 
00042   bin size is chosen large, statistical fluctuations will be suppressed but
00043   the histogram may not be able to capture the data structure (high bias,
00044   low variance). In principle, the optimal leaf size can be chosen by an
00045   independent experiment - if you generate an independent sample of 
00046   category 0 and an independent sample of category 1 and run this method
00047   for several values of the leaf size, choose the one that gives the poorest
00048   GOF value; that way you will maximize the sensitivity of your method.
00049   In the absence of an independent sample, guess.
00050 
00051   If the method returns a poor GOF value, you can train a decision tree with
00052   the same leaf size and see for yourself what part of the input space causes
00053   the problem.
00054 
00055   Examples of usage:
00056   -----------------
00057 
00058   unix> SprGoFDecisionTreeApp -n 100 -l 1000 uniform_on_uniform2_d.pat
00059 
00060   Evaluates consistency of two 2D uniform distributions. You get:
00061 
00062 24 experiments out of 100 have better GoF values than the data.
00063 GoF=0.76
00064 
00065   unix> SprGoFDecisionTreeApp -n 100 -l 1000 gauss_uniform2_d_train.pat 
00066 
00067   Evaluates consistency of a 2D Gaussian and a 2D uniform distribution. 
00068   You get:
00069 
00070 100 experiments out of 100 have better GoF values than the data.
00071 GoF=0
00072 
00073   Let us look at this sample with a decision tree:
00074 
00075   unix> SprDecisionTreeApp -c 5 -n 1000 -f save.spr gauss_uniform2_d_train.pat
00076 
00077 Read data from file gauss_uniform2_d_train.pat for variables "x0" "x1"
00078 Total number of points read: 20000
00079 Points in class 0: 10000 1: 10000
00080 Optimization criterion set to Gini index  -1+p^2+q^2
00081 Decision tree initialized mith minimal number of events per node 1000
00082 Included 9 nodes in category 1 with overall FOM=-0.725763    W1=8736 W0=2731    N1=8736 N0=2731
00083 
00084   unix> less save.spr
00085 
00086 Trained Decision Tree: 9      nodes.    Overall FOM=-0.725763  W0=2731       W1=8736       N0=2731       N1=8736       
00087 -------------------------------------------------------
00088 Node      0    Size 2       FOM=-0.385045  W0=172        W1=1422       N0=172        N1=1422
00089 Variable                             x0    Limits         -1.04095        0.993995
00090 Variable                             x1    Limits        -0.631275         0.30833
00091 .......
00092 
00093   The first node of the decision tree points to the region in space with 
00094   a large excess of events from category 1 over events from category 0.
00095 */
00096 
00097 #include "PhysicsTools/StatPatternRecognition/interface/SprExperiment.hh"
00098 #include "PhysicsTools/StatPatternRecognition/interface/SprAbsFilter.hh"
00099 #include "PhysicsTools/StatPatternRecognition/interface/SprDecisionTree.hh"
00100 #include "PhysicsTools/StatPatternRecognition/interface/SprEmptyFilter.hh"
00101 #include "PhysicsTools/StatPatternRecognition/interface/SprAbsReader.hh"
00102 #include "PhysicsTools/StatPatternRecognition/interface/SprRWFactory.hh"
00103 #include "PhysicsTools/StatPatternRecognition/interface/SprStringParser.hh"
00104 #include "PhysicsTools/StatPatternRecognition/interface/SprTwoClassGiniIndex.hh"
00105 #include "PhysicsTools/StatPatternRecognition/interface/SprIntegerPermutator.hh"
00106 
00107 #include <stdlib.h>
00108 #include <unistd.h>
00109 #include <iostream>
00110 #include <vector>
00111 #include <set>
00112 #include <string>
00113 #include <memory>
00114 #include <algorithm>
00115 #include <functional>
00116 #include <utility>
00117 #include <cassert>
00118 
00119 using namespace std;
00120 
00121 
00122 void help(const char* prog) 
00123 {
00124   cout << "Usage:  " << prog 
00125        << " training_data_file" << endl;
00126   cout << "\t Options: " << endl;
00127   cout << "\t-h --- help                                        " << endl;
00128   cout << "\t-a input ascii file mode (see SprSimpleReader.hh)  " << endl;
00129   cout << "\t-s random seed for permutations (default=0)        " << endl;
00130   cout << "\t-n number of cycles for GoF evaluation             " << endl;
00131   cout << "\t-l minimal number of entries per tree leaf (def=1) " << endl;
00132   cout << "\t-v verbose level (0=silent default,1,2)            " << endl;
00133   cout << "\t-w scale all signal weights by this factor         " << endl;
00134   cout << "\t-V include only these input variables              " << endl;
00135   cout << "\t-z exclude input variables from the list           " << endl;
00136   cout << "\t\t Variables must be listed in quotes and separated by commas." 
00137        << endl;
00138 }
00139 
00140 
00141 int main(int argc, char ** argv)
00142 {
00143   // check command line
00144   if( argc < 2 ) {
00145     help(argv[0]);
00146     return 1;
00147   }
00148 
00149   // init
00150   string tupleFile;
00151   int readMode = 0;
00152   unsigned cycles = 0;
00153   unsigned nmin = 1;
00154   int verbose = 0;
00155   string includeList, excludeList;
00156   int seed = 0;
00157   bool scaleWeights = false;
00158   double sW = 1.;
00159 
00160   // decode command line
00161   int c;
00162   extern char* optarg;
00163   //  extern int optind;
00164   while( (c = getopt(argc,argv,"ha:s:n:l:v:w:V:z:")) != EOF ) {
00165     switch( c )
00166       {
00167       case 'h' :
00168         help(argv[0]);
00169         return 1;
00170       case 'a' :
00171         readMode = (optarg==0 ? 0 : atoi(optarg));
00172         break;
00173       case 's' :
00174         seed = (optarg==0 ? 0 : atoi(optarg));
00175         break;
00176       case 'n' :
00177         cycles = (optarg==0 ? 1 : atoi(optarg));
00178         break;
00179       case 'l' :
00180         nmin = (optarg==0 ? 1 : atoi(optarg));
00181         break;
00182       case 'v' :
00183         verbose = (optarg==0 ? 0 : atoi(optarg));
00184         break;
00185       case 'w' :
00186         if( optarg != 0 ) {
00187           scaleWeights = true;
00188           sW = atof(optarg);
00189         }
00190         break;
00191       case 'V' :
00192         includeList = optarg;
00193         break;
00194       case 'z' :
00195         excludeList = optarg;
00196         break;
00197       }
00198   }
00199 
00200   // There has to be 1 argument after all options.
00201   string trFile = argv[argc-1];
00202   if( trFile.empty() ) {
00203     cerr << "No training file is specified." << endl;
00204     return 1;
00205   }
00206 
00207   // make reader
00208   SprRWFactory::DataType inputType 
00209     = ( readMode==0 ? SprRWFactory::Root : SprRWFactory::Ascii );
00210   auto_ptr<SprAbsReader> reader(SprRWFactory::makeReader(inputType,readMode));
00211 
00212   // include variables
00213   set<string> includeSet;
00214   if( !includeList.empty() ) {
00215     vector<vector<string> > includeVars;
00216     SprStringParser::parseToStrings(includeList.c_str(),includeVars);
00217     assert( !includeVars.empty() );
00218     for( int i=0;i<includeVars[0].size();i++ ) 
00219       includeSet.insert(includeVars[0][i]);
00220     if( !reader->chooseVars(includeSet) ) {
00221       cerr << "Unable to include variables in training set." << endl;
00222       return 2;
00223     }
00224     else {
00225       cout << "Following variables have been included in optimization: ";
00226       for( set<string>::const_iterator 
00227              i=includeSet.begin();i!=includeSet.end();i++ )
00228         cout << "\"" << *i << "\"" << " ";
00229       cout << endl;
00230     }
00231   }
00232 
00233   // exclude variables
00234   set<string> excludeSet;
00235   if( !excludeList.empty() ) {
00236     vector<vector<string> > excludeVars;
00237     SprStringParser::parseToStrings(excludeList.c_str(),excludeVars);
00238     assert( !excludeVars.empty() );
00239     for( int i=0;i<excludeVars[0].size();i++ ) 
00240       excludeSet.insert(excludeVars[0][i]);
00241     if( !reader->chooseAllBut(excludeSet) ) {
00242       cerr << "Unable to exclude variables from training set." << endl;
00243       return 2;
00244     }
00245     else {
00246       cout << "Following variables have been excluded from optimization: ";
00247       for( set<string>::const_iterator 
00248              i=excludeSet.begin();i!=excludeSet.end();i++ )
00249         cout << "\"" << *i << "\"" << " ";
00250       cout << endl;
00251     }
00252   }
00253 
00254   // read training data from file
00255   auto_ptr<SprAbsFilter> filter(reader->read(trFile.c_str()));
00256   if( filter.get() == 0 ) {
00257     cerr << "Unable to read data from file " << trFile.c_str() << endl;
00258     return 2;
00259   }
00260   vector<string> vars;
00261   filter->vars(vars);
00262   cout << "Read data from file " << trFile.c_str() 
00263        << " for variables";
00264   for( int i=0;i<vars.size();i++ ) 
00265     cout << " \"" << vars[i].c_str() << "\"";
00266   cout << endl;
00267   cout << "Total number of points read: " << filter->size() << endl;
00268   const unsigned n0 = filter->ptsInClass(0);
00269   const unsigned n1 = filter->ptsInClass(1);
00270   cout << "Points in class 0: " << n0 << " 1: " << n1 << endl;
00271 
00272   // scale weights
00273   vector<double> origWeights;
00274   if( scaleWeights ) {
00275     filter->weights(origWeights);
00276     cout << "Signal weights are multiplied by " << sW << endl;
00277     filter->scaleWeights(1,sW);
00278   }
00279 
00280   // make optimization criterion
00281   SprTwoClassGiniIndex crit;
00282 
00283   // make decision tree
00284   bool doMerge = false;
00285   SprDecisionTree tree(filter.get(),&crit,nmin,doMerge,true);
00286 
00287   // train the tree on the original sample and save the original fom
00288   if( !tree.train(verbose) ) {
00289     cerr << "Unable to train decision tree." << endl;
00290     return 3;
00291   }
00292   double origFom = tree.fom();
00293 
00294   // save original class labels
00295   vector<pair<SprPoint*,int> > origLabels(filter->size());
00296   for( int i=0;i<filter->size();i++ ) {
00297     SprPoint* p = (*filter.get())[i];
00298     origLabels[i] = pair<SprPoint*,int>(p,p->class_);
00299   }
00300 
00301   // train decision tree on permuted replicas of the data
00302   cout << "Will perform " << cycles 
00303        << " toy experiments for GoF calculation." << endl;
00304   vector<double> fom;
00305   SprIntegerPermutator permu(filter->size(),seed);
00306   assert( (n0+n1) == filter->size() );
00307   for( int ic=0;ic<cycles;ic++ ) {
00308     // print out
00309     if( (ic%10) == 0 ) 
00310       cout << "Performing toy experiment " << ic << endl;
00311 
00312     // permute labels
00313     vector<unsigned> labels;
00314     permu.sequence(labels);
00315     for( int i=0;i<n0;i++ ) {
00316       unsigned ip = labels[i];
00317       (*filter.get())[ip]->class_ = 0;
00318     }
00319     for( int i=n0;i<n0+n1;i++ ) {
00320       unsigned ip = labels[i];
00321       (*filter.get())[ip]->class_ = 1;
00322     }
00323 
00324     // reset weights
00325     if( scaleWeights ) {
00326       filter->setPermanentWeights(origWeights);
00327       filter->scaleWeights(1,sW);
00328     }
00329 
00330     // reset tree
00331     tree.reset();
00332 
00333     // train the tree and save FOM
00334     if( !tree.train(verbose) ) continue;
00335     fom.push_back(tree.fom());
00336   }
00337   if( fom.empty() ) {
00338     cerr << "Failed to compute FOMs for any experiments." << endl;
00339     return 4;
00340   }
00341 
00342   // restore original class labels and weights
00343   // Not necessary here - just to remember that this needs to be done
00344   //  if data will be used in the future.
00345   for( int i=0;i<origLabels.size();i++ )
00346     origLabels[i].first->class_ = origLabels[i].second;
00347   if( scaleWeights ) filter->setPermanentWeights(origWeights);
00348 
00349   // compute fraction of experiments with worse FOM's
00350   stable_sort(fom.begin(),fom.end());
00351   vector<double>::iterator iter = find_if(fom.begin(),fom.end(),
00352                                           bind2nd(greater<double>(),origFom));
00353   int below = iter - fom.begin();
00354   int above = fom.size() - below;
00355   cout << below << " experiments out of " << fom.size() 
00356        << " have better GoF values than the data." << endl;
00357   cout << "GoF=" << double(above)/double(fom.size()) << endl;
00358 
00359   // exit
00360   return 0;
00361 }