Karthik Devarajan, PhD

Karthik Devarajan, PhD
​​

This Fox Chase professor participates in the Undergraduate Summer Research Fellowship
Learn more about Research Volunteering.

Associate Professor, Population Science

Adjunct Associate Professor, Department of Clinical Science, Lewis Katz School of Medicine, Temple University, Philadelphia, PA

Affiliated Faculty Member, Center for High-Dimensional Statistics, Big Data Institute, Temple University, Philadelphia, PA

Research Program

 

Educational Background

  • PhD, Northern Illinois University
  • MSc (Tech) Birla Institute of Technology & Science, India

Industry Experience

  • Statistical Scientist, Cancer Bioinformatics, AstraZeneca R&D Boston, Waltham, MA
  • Biostatistician, Bristol-Myers Squibb Pharmaceutical Research Institute, Bristol-Myers Squibb, Princeton, NJ

People

  • Matthew Smith, BS, MPH

Research Interests

Unsupervised learning methods

  • Unsupervised dimension reduction and model-based clustering for high-dimensional data with applications in molecular pattern discovery, biomedical informatics, imaging and neuroscience
  • Assessment of technical reproducibility and probability-based methods for outlier detection in large-scale biological data

Supervised and semi-supervised learning methods

  • Analysis of censored survival data
  • Feature selection and predictive modeling for large-scale genomic data in the presence of censored survival outcomes
  • Integrative genomics analysis investigating the association between digital gene expression, single nucleotide polymorphisms, copy number variation, methylation and censored survival outcomes
  • Statistical machine learning methods for biomarker discovery

Lab Overview

Advances in high-throughput genomic technologies in the past two decades have given rise to large-scale biological data that is measured on a variety of scales. Genome-wide studies enable the simultaneous measurement of the expression profiles of tens of thousands of genomic features, from an ever increasing number of biological samples that may represent phenotypes, experimental conditions or time points. Examples include studies of various types of gene and protein expression, methylation and copy number variation, and high-throughput compound screening assays, among others. Similarly, studies in biomedical imaging and computational neuroscience generate tens of thousands of signals from brain or muscle activity under a variety of experimental conditions across the time-frequency domain. These massive data sets offer tremendous potential for growth in our understanding of the pathophysiology of many diseases. My research spans the two major areas of statistical learning - unsupervised and supervised, as well as survival analysis, with applications in the aforementioned domains. Its principal focus is in the development of statistical and computational approaches for high-dimensional data and includes methods for dimension reduction as well as methods for correlating a quantitative or qualitative outcome variable (such as patient survival time, presence of disease, patient response to treatment)  with a large number of covariates (genomic, clinical, laboratory and demographic variables). Our current research activities involve the development of methods for analyzing data from microbiome, radiomics and single-cell RNA-Seq studies.

Lab Description

Advances in high-throughput genomic technologies in the past two decades have given rise to large-scale biological data that is measured on a variety of scales. Genome-wide studies enable the simultaneous measurement of the expression profiles of tens of thousands of genomic features, from an ever increasing number of biological samples that may represent phenotypes, experimental conditions or time points. Examples include studies of various types of gene and protein expression, methylation and copy number variation, and high-throughput compound screening assays, among others. Similarly, studies in biomedical imaging and computational neuroscience generate tens of thousands of signals from brain or muscle activity under a variety of experimental conditions across the time-frequency domain. These massive data sets offer tremendous potential for growth in our understanding of the pathophysiology of many diseases. My research spans the two major areas of statistical learning - unsupervised and supervised, as well as survival analysis, with applications in the aforementioned domains. Its principal focus is in the development of statistical and computational approaches for high-dimensional data and includes methods for dimension reduction as well as methods for correlating a quantitative or qualitative outcome variable (such as patient survival time,  presence of disease, patient response to treatment)  with a large number of covariates (genomic, clinical, laboratory and demographic variables). Our current research activities involve the development of methods for analyzing data from microbiome, radiomics and single-cell RNA-seq studies.

Unsupervised learning methods

Unsupervised dimension reduction

We have developed methods for unsupervised dimension reduction and model-based clustering of large-scale biological data and demonstrated their applications in high-throughput genomics, biomedical informatics, imaging and computational neuroscience using non-negative matrix factorization (NMF). An important, but often ignored, aspect of high-dimensional biological data is the signal-dependent and correlated nature of noise in the measurements. We addressed this problem by developing a variety of methods (i) using an information-theoretic approach, (ii) by extending NMF using the theory of generalized linear models and quasi-likelihood and (iii) by developing a statistical framework for NMF using generalized dual divergence. Our methods provide a unified framework for the modeling and analysis of data obtained on different scales and are broadly applicable to a variety of high-dimensional data. We have developed computational tools for dimension reduction and visualization using NMF that are freely available to the academic research community. These include hpcNMF, a C++ package that uses high-performance computing clusters (http://devarajan.fccc.edu/) and the R package gnmf (http://cran.r-project.org/web/packages/gnmf/index.html).

Outlier detection

A problem that arises frequently in high-throughput studies is the assessment of technical reproducibility of data obtained under homogeneous experimental conditions. This is an important problem considering the significant growth in the number of high-throughput technologies that have become available to the researcher in the past two decades. Existing methods for determining data quality are typically graphical, lack statistical rigor and do not necessarily translate to data obtained across multiple technologies; also, there is an inherent need for quantitative evaluation of reproducibility. To this end, we have developed empirical model-based methods as well as probability-based methods that account for technical variability and potential asymmetry that arise naturally in replicate data. This data-driven approach borrows strength from the large volume of available data and is broadly applicable to a variety of high-throughput studies – such as next-generation sequencing, compound and siRNA screening and other modern “omics” studies - for assessing technical reproducibility and identifying outliers. The R package replicateOutliers implements five different methods for outlier detection and is available at https://github.com/matthew-seth-smith/replicateOutliers.

Supervised and semi-supervised learning methods

Analysis of censored survival data

In studies where information on an outcome variable such as time to an event (or survival time) is available, one of the goals of an investigator is to understand how the expression levels of genomic, clinical and demographic variables (covariates) relate to an individual’s survival in the course of a disease. The analysis of time to event (or survival) data arises in many fields of study such as biology, medicine and public health, and its role and significance in cancer research cannot be overstated. The Cox proportional hazards (PH) model is the most celebrated and widely used statistical model linking survival time to covariates. It is a multiplicative hazards model that implies constant hazard ratio and assumes that the hazard and survival curves do not cross. While this model has proved to be very useful in practice due to its simplicity and interpretability, the assumption of constant hazard ratio has been shown to be invalid in a variety of situations in medical studies. For example, non-proportional hazards are typical when treatment effect increases or decreases over time leading to converging or diverging hazards. This situation cannot be handled by the Cox PH model, and more general models that consider non-proportionality of hazards are required for modeling survival data. To this end, we have developed a class of non-proportional hazards models that embeds the Cox PH model as a special case. We proposed a theoretical and a computational framework for estimation using this generalized model that allows us to rigorously test the PH assumption. Furthermore, we have developed information-theoretic methods to test the effect of an individual covariate or a group of covariates in the PH model as well as in complex survival models that account for varying trends in hazard over time. By identifying different classes of probability link models with symmetric information divergence, we have proposed computationally efficient solutions to the problem of model averaging and model selection.

Feature selection and predictive modeling for large-scale genomic data with censored survival outcomes

Within the context of high-throughput genomic data, our preliminary work involved the development of a model for predicting patient survival by extracting genomic components that were strongly correlated with it. In this high-dimensional setting, it is unreasonable to expect the expression levels of the many thousands of genomic features to exhibit proportionality in hazards. Our current research interests in this area include the systematic comparison of several well-known models for correlating genomic feature expression with patient survival and the identification of features that demonstrate a time-varying effect using publicly available data from repositories such as Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA).

We have developed an array of measures, using information divergence, for quantifying explained randomness in different survival models that incorporate time-varying effects of features and a generalized pseudo-R2 index that covers a spectrum of such survival models. Indeed our investigations involving the re-analysis or meta-analysis of existing data sets have revealed such a time-varying trend exhibited by several genomic features implicated in kidney, head and neck, ovarian and brain cancers. Furthermore, we have developed methods using continuum regression (CR) – a unified framework for supervised dimension reduction - in conjunction with the accelerated failure time model for predictive modeling. CR embeds a spectrum of regression methods into a single framework that includes methods such as ordinary least squares, partial least squares and principal components regression as special cases, thereby enabling a powerful array of methods to be developed for this problem within the linear models framework.  R packages implementing these methods are freely available to the research community at https://github.com/lburns27/Feature-Selection and at https://github.com/lburns27/ACPR-AFT.

Integrative genomics analysis

In collaboration with the Ragin laboratory, we are investigating the association between digital gene expression, single nucleotide polymorphisms (SNP), copy number variation, methylation and survival in different cancers using data from TCGA and GEO. One such integrative genomic analysis identified several ancestral-related SNPs for the POLB gene and supported the association of genetic ancestry with survival disparity in head and neck cancer. A follow-up study analyzing genome-wide expression quantitative trait loci identified candidate genes associated with survival. In an ongoing study funded by the ACS (Molecular Modeling, Genomics and Racial Disparities in HNSCC, PI: Ragin) in which we are investigating (i) the genetic susceptibility of Blacks in developing HNSCC by aiming to identify distinctive polymorphic and metabolic profiles related to gene expression and function and (ii) the association between treatment and survival according to race by aiming to determine whether genetic variations related to relevant biological pathways modify this association.

Biomarker discovery

Another area of active interest is the development and novel application of modern statistical machine learning methods for detecting the presence of cancer in a cohort of patients based on biomarker measurements and clinical variables. In collaboration with colleagues at the Drexel college of Medicine, we have systematically compared the performance of various methods and developed the Doylestown algorithm that is better able to detect the presence of hepatocellular carcinoma in the background of cirrhosis using levels of established serum biomarkers and other relevant clinical characteristics of the patient. Our algorithm has been independently validated by the Early Detection Research Network as well as the National Cancer Institute, and provides a significant improvement in prediction accuracy of up to 20%.

Selected Publications

Spirko-Burns L.,Devarajan K., Supervised dimension reduction for large-scale "omics" data with censored survival outcomes under possible non-proportional hazards. IEEE/ACM Trans Comput Biol Bioinform. 18(5): 2032-2044, 2021. https://www.ncbi.nlm.nih.gov/pubmed/31940547.

Spirko-Burns L, Devarajan K. Unified methods for feature selection in large-scale genomic studies with censored survival outcomes. Bioinformatics, Volume 36, Issue 11, June 2020, Pages 3409–3417, https://doi.org/10.1093/bioinformatics/btaa161. PMID: 32154833.

Spirko-Burns, L., Devarajan, K. Supervised dimension reduction for large-scale “omics" data with censored survival outcomes under possible non-proportional hazards. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020 Jan 10. doi: 10.1109/TCBB.2020.2965934. [Epub ahead of print] PMID: 31940547.

Devarajan K, Cheung VC. A Quasi-Likelihood Approach to Nonnegative Matrix Factorization. Neural Computation. 2016 Aug;28(8):1663-93. Epub 2016 Jun 27. PubMed PMID: 27348511; PubMed Central PMCID: PMC5549860.

Devarajan K, Wang G, Ebrahimi N. A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing. Machine Learning. 2015 Apr 1;99(1):137-163. PMID: 25821345; PMCID: PMC4371760.

Devarajan K, Cheung VC. On nonnegative matrix factorization algorithms for signal-dependent noise with application to electromyography data. Neural Computation. 2014 Jun;26(6):1128-68. PMID: 24684448; PMCID: PMC5548326.

Devarajan K, Ebrahimi N. A semi-parametric generalization of the Cox proportional hazards regression model: Inference and Applications. Computational Statistics and Data Analysis. 2011 Jan 1;55(1):667-676. PMID: 21076652; PMCID: PMC2976538.

Devarajan K. Nonnegative matrix factorization: an analytical and interpretive tool in computational biology. PLoS Computational Biology. 2008 Jul 25;4(7):e1000029.doi: 10.1371/journal.pcbi.1000029. Review. PMID: 18654623; PMCID: PMC2447881.

Chang WL, Jackson C, Riel S, Cooper HS, Devarajan K, Hensley HH, Zhou Y, Vanderveer LA, Nguyen MT, Clapper ML. Differential preventive activity of sulindac and atorvastatin in Apc(+/Min-FCCC)mice with or without colorectal adenomas. Gut. 2018 Jul;67(7):1290-1298. Epub 2017 Nov 9. PubMed PMID: 29122850; PubMed Central PMCID: PMC6031273.

Ramakodi MP, Devarajan K, Blackman E, Gibbs D, Luce D, Deloumeaux J, Duflo S, Liu JC, Mehra R, Kulathinal RJ, Ragin CC. Integrative genomic analysis identifies ancestry-related expression quantitative trait loci on DNA polymerase β and supports the association of genetic ancestry with survival disparities in head and neck squamous cell carcinoma. Cancer. 2017 Mar 1;123(5):849-860. PMID: 27906459; PMCID: PMC5319896... Expand

Additional Publications

MyNCBI

Statistical and Computing Software

hpcNMF: C++ based software package for generalized non-negative matrix factorization using high-performance computing clusters available in Linux, Windows and Max OS versions (jointly with G. Wang). Available at devarajan.fccc.edu.

gnmf: an R package for generalized non-negative matrix factorization (jointly with J. Maisog and G. Wang). Available at cran.r-project.org/web/packages/gnmf/.

The Doylestown Algorithm: A program for evaluating the performance of biomarkers in the detection of hepatocellular carcinoma (jointly with M. Wang and A. Mehta).

replicateOutliers: an R package implementing probability-based outlier detection methods for replicated data (jointly with M. Smith). Available at github.com/matthew-seth-smith/replicateOutliers.

ACPR-AFT: Algorithms for supervised dimension reduction of large-scale “omics" data with censored survival outcomes under possible non-proportional hazards (jointly with L. spirko-Burns). Available at github.com/lburns27/ACPR-AFT.

Feature-Selection: Methods for feature selection in large-scale “omics" data with censored survival outcomes under possible non-proportional hazards (jointly with L. Spirko-Burns). Available at github.com/lburns27/Feature-Selection.

Pre-prints available online

Smith, M., Devarajan, K. Probability-based methods for outlier detection in replicated high-throughput biological data. bioRxiv 240473; doi: https://doi.org/10.1101/2020.08.07.240473.

Asadi, M., Devarajan, K. Ebrahimi, N., Soofi, E., Spirko-Burns, L. Probability link models with symmetric information divergence. arXiv: 2008.04387v1 [stat.ML] 10 Aug 2020. https://arxiv.org/abs/2008.04387.

Devarajan, K. (2019). Non-negative matrix factorization based on generalized dual divergence. arXiv: 1905.07034v1 [stat.ML] 16 May 2019. https://arxiv.org/abs/1905.07034.

Devarajan, K., Wang, G. (2016). hpcNMF – a high performance toolbox for non-negative matrix factorization. COBRA pre-print series, Article 115 (April 2016). http://biostats.bepress.com/cobra/art115.  

The following ratings and reviews are based on verified feedback collected from independently administered patient experience surveys. The ratings and comments submitted by patients reflect their own views and opinions. Patient identities are withheld to ensure confidentiality and privacy. Learn more about our Patient Experience Ratings.

Ratings Breakdown

Loading ...

Patient comments

Loading ...
​​

This Fox Chase professor participates in the Undergraduate Summer Research Fellowship
Learn more about Research Volunteering.