Mining of Microarray, Proteomics, and Clinical Data for Improved Identification of Chronic Fatigue Syndrome Hongbo Xie, Zoran Obradovic, Slobodan Vucetic Information Science and Technology Center, Temple University 1805 N. Broad St., Philadelphia, PA 19122 Abstract. Chronic Fatigue Syndrome (CFS) is a recently recognized disease whose pathophysiology is insufficiently understood. The objective of this study was to explore if identification accuracy of CFS could be improved using microarray and proteomics data alone or when integrated with clinical data. First, a two-step approach for selection of genetic CFS biomarkers from microarray data is proposed. The underlying assumption is that CFS is characterized by deviations in expression of genes from a limited set of functions. The approach starts by selection of significantly differentially expressed genes by using standard statistical testing procedure. Using Gene Ontology (GO) resource, biological functions of the selected genes are studied to discover the ones that are highly overrepresented by the selection. Only the selected genes annotated with the most significant function are selected as biomarkers for identification of CFS. This approach results in a small set of biomarkers whose function is the most relevant to CFS. In our experiments Support Vector Machine (SVM) that uses as attributes genes obtained by the two-step approach achieved higher accuracy than when using genes obtained by the traditional one-step approach. (e.g. 72% vs 53% accuracy when selection based on p-value 0.05). Moreover, the finding that mRNA processing is the most representative function is consistent with the previously published results. In the second part of the study, benefits of combining microarray and proteomics data in CFS identification were explored. Using the standard procedure for preprocessing of ProteinChip data, we developed a proteomics-based predictor of CFS. Our results on the 38 samples with both microarray and ProteinChip data indicates that predictor combination can provide improved CFS identification (79% accuracy by a combination when two approaches agree vs. 72% obtained by microarray alone). However, an important observation is that the achieved accuracy of CFS identification of less than 80% is relatively low as compared to some other diseases, such as cancer. This suggests that identification of CFS biomarkers is a challenging task that requires significantly larger amounts of experimental data. Finally, we studied the clinical CFS data to discover factors that explain sources of CFS identification mistakes. We discovered significant difference in mental health, physical fatigue, and general fatigue indicators among cases differently classified by microarray and proteomics methods. This suggests that CSF identification could be improved by revising definitions of certain clinical conditions. Full text including tables at: http://tinyurl.com/qsre2 i.e. http://www.camda.duke.edu/camda06/papers/days/thursday/obradovic/paper