Improving classification performance with discretization on biomedical datasets

thumbnail of amia-0445-s2008 Original by J. L. Lustgarten, V. Gopalakrishnan, H. Grover, S. Visweswaran, 2018, 5 pages 

This summary note was Posted on

  • Distretization is typically used as a pre-processing step for machine learning
  • Supervised distretization methods will distretize a variable to a single interval in the variable has little to no correlation with the target variable
  • Support Vector Machines (SVM) and Random Forest (RF) are favored for their ability to handle high-dimensional data

Discretization method

  • Boullé developped a Minimum Opitmal Descritpion Length. (MODL) based on minimal description length principle (MDL)
  • Examines all possible solutions  so O(n^3) order of magnitudes size
  • New efficient Bayesian Discretization (EBD) using Bayesian score to evaluate discretization model
  • Runs faster than MODL with O(n^2) time order
  • EDB has better performance than commonly used Fayyad and Irani’s MDLPC discretization algorithm

Classification Performance Measure

  • Relative Classifier Information (RCI), quantifies amount of uncertainty of a decision problem that is reduced relative to using only the prior probabilities of each class
  • Similar to area under the curve (ROC as it measures the discriminatory power of the classifier while minimizing the effect of the distribution of the classes.
  • Use Wilcoxon paired sample signed rank test to compare RCI values

Result

  • EDB resulted in substantial decrease in the number of selected variables
  • EDB improved performance of all the algorithm testes: SVM, RF and NB (Naive Bayes)
  • Using discrete values over continuous values improved performance of RF and NB but not SVM
  • NB benefits from smoothing of the parameters that discretization provides
  • Performance from discretization accrues to a large extend from variable selection and to a smaller extend from the transformation of the variable from continuous to discrete.