A survey of discretization techniques: taxonomy and empirical analysis in supervised learning

thumbnail of 2013-Garcia-IEEETKDE.compressed Original by S. Garcia, J. Luengo, J. A. Sáez, V. López, F. Herrera, IEEE Computer Society, 2013, 16 pages 

This summary note was Posted on

Main characteristics of discretizers

  • Statics vs dynamic: static is independent of the learner and acts prior to the learning task. Most are static. Dynamic are ID3 discretizers, ITFP
  • Univariate vs multivariate: multivariate consider all attributes to define the initial set of cut points (or final ones). Univariate inly work with one attribute at the time
  • Supervised vs unsupervised: Supervised consider heuristic measures to determine the best cut points( entropy, interdependence etc..). Most are supervised. Unsupervised discretizers are equal width and equal frequency.
  • Splitting vs merging: Splitting methods establish a cut point among all possible boundaries and divide the domain into two intervals. Merging starts with a predefined partition and remove a candidate cut point to mix both intervals
  • Global vs local:  To make a decision a discretizer can either require all available data in the attribute or use only partial information
  • Direct vs incremental: direct discretizers divide the range into k intervals simultaneously. By contrast incremental pass through an improvement process (also called hierarchical discretizers)
  • Evaluation measure: metrics used by the discretizer to compare to candidate schemes: (information measure, statistical measure, rough sets, wrappers, binning)
  • Parametric vs nonparametric: parametric (ChiMerge, CADD) requires a  maximum number of intervals fixed by the user. Non parametric (MDLP, CAIM) computes the minimum number of intervals considering a tradeoff with loss of information
  • Top down vs bottom up: Top down starts with an empty discretization and adds a new cut point every time (MDLP). Bottom up starts with all possible cutpoints and merge two intervals (ChiMerge)
  • Stopping condition: Must be specified for non parametric approaches (minimum description length measure, confidence thresholds, inconsistency ratios)
  • Disjoint vs non disjoint: In a disjoint discretization intervals can not overlap
  • Ordinal vs nominal: ordinal discretization transforms quantitative data into ordered qualitative data (non common)

Comparison criteria

  • Number of intervals
  • Inconsistency
  • Accuracy: Use Cohen’ s kappa which compensates for random hits. Original purpose was to measure the degree of dis/agreement between two people observing the came phenomenon. Less expressive as than ROC curves when applied to binary classification but effective for multiclass problems
  • Predictive classification rate
  • Time

Results

Can not recommend one best performing method, it depends upon the problem tackled

  • FUSINTER, ChiMerge, CAIM and Modified CHi2 offer excellent performances over all
  • PKID, FFD are suitable methods for lazy d learning and CACC, Distance and MODL are good choices in the rule induction learning
  • FUSINTER, Distance, Chi2, MDLP and UCPD obtain satisfactory tradeoff between the number of intervals and accuracy
  • CAIM is one of the simplest discretizer and is pretty effective