Classifying Lung Adenocarcinoma and Squamous Cell Carcinoma using RNA-Seq Data

Authors

  • Zhengyan Huang
  • Li Chen
  • Chi Wang

Keywords:

LUAD, LUSC, Principal Components, Kth Nearest Neighbors

Abstract

Background: Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are two primary subtypes of non-small cell lung carcinoma (NSCLC). Currently, the most widely used method to discriminate between LUAD and LUSC is hematoxylin-eosin (HE) staining. However, this method is not always able to precisely diagnose LUAD or LUSC. More accurate diagnostic approaches are highly desired. Methods: We propose to use gene expression profile to discriminate a patient’s NSCLC subtype. We leveraged RNA-Seq data from The Cancer Genome Atlas (TCGA) and randomly split the data into training and testing subsets. To construct classifiers based on the training data, we considered three methods: logistic regression on principal components (PCR), logistic regression with LASSO shrinkage (LASSO), and kth nearest neighbors (KNN). Performances of the classifiers were evaluated and compared based on the testing data. Results: All gene expression-based classifiers show high accuracy in discriminating between LUSC and LUAD. The classifier obtained by LASSO has the smallest overall misclassification rate of 3.42% (95% CI: 3.25%-3.60%) when using 0.5 as the cutoff value for the predicted probability of belonging to a subtype, followed by classifiers obtained by PCR (4.36%, 95% CI: 4.23%-4.49%) and KNN (8.70%, 95% CI: 8.57%-8.83%). The LASSO classifier also has the highest average area under the receiver operating characteristic curve (AUC) value of 0.993, compared to PCR (0.987) and KNN (0.965). Conclusions: Our results suggest that mRNA expressions are highly informative for classifying NSCLC subtypes and may potentially be used to assist clinical diagnosis.

Downloads

Published

2017-09-19