Hierarchical k-means clustering using principal components to solve the unsupervised multi-class classification problem

CINF 19

James F. Rathman, rathman.1@osu.edu1, Syed B. Mohiddin1, and Chihae Yang, cyang@leadscope.com2. (1) Department of Chemical and Biomolecular Engineering, The Ohio State University, Koffolt Laboratories, 140 West 19th Avenue, Columbus, OH 43210-1110, (2) Leadscope, Inc, 1393 Dublin Road, Columbus, OH 43215
Current clustering techniques can be grouped as either supervised or unsupervised. In a supervised method, each observation in the training dataset is pre-assigned to a class based on prior knowledge, while an unsupervised method uses no prior knowledge of the class distinction. Numerous supervised techniques have been demonstrated to work well for binary classification and a few of these are reasonably good at making supervised multi-class predictions. However, techniques for unsupervised binary and multi-class predictions have not been fully developed. In this work, we present an analysis technique based on hierarchical K-means using differentially weighted principal component analysis to address unsupervised classification for both binary and multi-class problems. We demonstrate the methodology on both biological (NCI 60 cancer cell lines dataset and acute leukemia dataset) as well as chemical datasets with the objectives of predicting class membership and identifying non-redundant features most responsible for differentiating the observed classes.