July 9, 2010
The search for structures in real datasets e.g. in the form of bumps, components, classes or clusters is important as these often reveal underlying phenomena leading to scientific discoveries. One of these tasks, known as bump hunting, is to locate domains of a multidimensional input space where the target function assumes local maxima without pre-specifying their total number. A number of related methods already exist, yet are challenged in the context of high dimensional data. We introduce a novel supervised and multivariate bump hunting strategy for exploring modes or classes of a target function of many continuous variables. This addresses the issues of correlation, interpretability, and high-dimensionality (p >> n case), while making minimal assumptions. The method is based upon a divide and conquer strategy, combining a tree-based method, a dimension reduction technique, and the Patient Rule Induction Method (PRIM). Important to this task, we show how to estimate the PRIM meta-parameters. Using accuracy evaluation procedures such as cross-validation and ROC analysis, we show empirically how the method outperforms a naive PRIM as well as competitive non-parametric supervised and unsupervised methods in the problem of class discovery. The method has practical application especially in the case of noisy high-throughput data. It is applied to a class discovery problem in a colon cancer micro-array dataset aimed at identifying tumor subtypes in the metastatic stage. Supplemental Materials are available online.