Genetic risk variants for schizophrenia have been linked to many related clinical and biological phenotypes with the hopes of delineating how individual variation across thousands of variants corresponds to the clinical and etiologic heterogeneity within schizophrenia. This has primarily been done using risk score profiling, which aggregates effects across all variants into a single predictor. While effective, this method lacks flexibility in certain domains: risk scores cannot capture nonlinear effects and do not employ any variable selection. We used random forest, an algorithm with this flexibility designed to maximize predictive power, to predict 6 cognitive endophenotypes in a combined sample of psychiatric patients and controls (N = 739) using 77 genetic variants strongly associated with schizophrenia. Tenfold cross-validation was applied to the discovery sample and models were externally validated in an independent sample of similar ancestry (N = 336). Linear approaches, including linear regression and task-specific polygenic risk scores, were employed for comparison. Random forest models for processing speed (P = .019) and visual memory (P = .036) and risk scores developed for verbal (P = .042) and working memory (P = .037) successfully generalized to an independent sample with similar predictive strength and error. As such, we suggest that both methods may be useful for mapping a limited set of predetermined, disease-associated SNPs to related phenotypes. Incorporating random forest and other more flexible algorithms into genotype–phenotype mapping inquiries could contribute to parsing heterogeneity within schizophrenia; such algorithms can perform as well as standard methods and can capture a more comprehensive set of potential relationships.
More detail can easily be written here using Markdown and $\rm \LaTeX$ math code.