BAKR: Bayesian Approximate Kernel Regression

BAKR is a software package that provides an effect size analog for each of the input features within Bayesian kernel regression models. Nonlinear kernel regression models are often used in statistics and machine learning due to greater accuracy than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. BAKR uses function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. The projection onto the original explanatory variables serves as an analog of effect sizes. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

L. Crawford, K.C. Wood, X. Zhou, and S. Mukherjee (2018). Bayesian approximate kernel regression with variable selection. Journal of the American Statistical Association. 113(524): 1710-1721.

Contact:

Please contact Lorin Crawford with any comments or questions.


BANNs: Biologically Annotated Neural Networks

BANNs is a software package that implements a class of probabilistic feedforward Bayesian models with partially connected architectures that are guided by predefined SNP-set annotations. This setup yields a fully interpretable neural network where the input layer encodes SNP-level effects, and the hidden layer models the aggregated effects among SNP-sets. Part of the key innovation in BANNs is to treat the weights and connections of the network as random variables with prior distributions that reflect how genetic effects manifest at different genomic scales. The BANNs software uses scalable variational inference to provide fully interpretable posterior summaries which allow researchers to simultaneously perform (i) fine-mapping with SNPs and (ii) enrichment analyses with SNP-sets on complex traits. The software is distributed under the GNU General Public License.

Download:

We implement BANNs in three different software packages. The first two are implemented in Python using Tensorflow and numpy, respectively. The third version is implemented in R. All software is currently available on GitHub.

Citations:

P. Demetci*, W. Cheng*, G. Darnell, X. Zhou, S. Ramachandran, and L. Crawford (2021). Multi-scale inference of genetic architecture using biologically annotated neural networks. PLOS Genetics. 17(8): e1009754.

Contact:

Please contact Pinar Demetci or Wei Cheng with any comments or questions.


ESNN: Ensemble of Single-Effect Neural Networks

ESNN is a software package that implements the “ensemble of single-effect neural networks” framework which generalizes the “sum of single-effects” regression framework by both accounting for nonlinear structure in genotypic data (e.g., dominance effects) and having the capability to model discrete phenotypes (e.g., case-control studies). The ESNN model uses scalable variational inference with an assumed grouped “single-effect” shrinkage prior on the input weights of neural networks which allows it to produce posterior inclusion probabilities and credible sets that can guide variable selection. While motivated by fine-mapping in genome-wide association (GWA) studies, this method is also applicable to other fields especially when data are correlated and sparse. The software is distributed under the MIT License.

Download:

Source code and tutorials for implementing the “ensemble of single-effect neural networks” (ESNN) framework are publicly available on GitHub.

Citations:

W. Cheng, S. Ramachandran, and L. Crawford. Uncertainty quantification in variable selection for genetic fine-mapping using Bayesian neural networks. bioRxiv. 2022.02.23.481675.

Contact:

Please contact Wei Cheng with any comments or questions.


gene-ε: Recalibrated Hypothesis Test for SNP-Level Summary Statistics

gene-ε (pronounced "genie") is software that implements a new empirical Bayesian approach for identifying statistical associations between sets of variants and quantitative traits The central innovation of gene-ε is reformulating the genome-wide association null model to distinguish between (i) mutations that are statistically associated with the disease but are unlikely to directly influence it, and (ii) mutations that are most strongly associated with a disease of interest. With a reformulated SNP-level null hypothesis, gene-ε presents a powerful framework for enrichment methods and scales well for application to emerging biobank datasets. The software is distributed under the MIT License.

Download:

The software is currently available on GitHub.

Citations:

W. Cheng, S. Ramachandran, and L. Crawford (2020). Estimation of non-null SNP effect size distributions enables the detection of enriched genes underlying complex traits. PLOS Genetics. 16(6): e1008855.

Contact:

Please contact Wei Cheng with any comments or questions.


Grid-LMM: Fast and Flexible Linear Mixed Models for Genetic Association Studies

Grid-LMM is a software package for fitting linear mixed models (LMMs) with multiple random effects. The fitting process is optimized for repeated evaluation of the random effect model with different sets of fixed effects (e.g., for genome-wide association studies or GWAS analyses). The approximation is due to the use of a discrete grid of possible values for the random effect variance component proportions. Grid-LMM includes functions for both frequentist and Bayesian GWAS, (restricted) maximum likelihood (REML) evaluation, and Bayesian Posterior inference of variance components. The software is distributed under the MIT License.

Download:

The software is currently available on GitHub.

Citations:

D.E. Runcie# and L. Crawford (2019). Fast and general-purpose linear mixed models for genome-wide genetics. PLOS Genetics. 15(2): e1007978.

Contact:

Please contact Dan Runcie with any comments or questions.


HEBAE: Hierarchical Empirical Bayes Autoencoder

HEBAE is a software package that implements a computationally stable framework for probabilistic and Bayesian generative models. The contributions from HEBAE to the autoencoder literature are two-fold. First, HEBAE makes performance gains by placing a hierarchical prior over the encoding distribution, enabling us to adaptively balance the trade-off between minimizing the reconstruction loss function and avoiding over-regularization. Second, HEBAE assumes a general dependency structure between variables in the latent space which produces better convergence onto the mean-field assumption for improved posterior inference. Overall, HEBAE is more robust to a wide-range of hyperparameter initializations than an analogous (and more traditional) variational autoencoder or VAE. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

W. Cheng, G. Darnell, S. Ramachandran, and L. Crawford. Generalizing variational autoencoders with hierarchical empirical Bayes. arXiv. 2007.10389.

Contact:

Please contact Wei Cheng with any comments or questions.


MAPIT: MAriginal ePIstasis Test

MAPIT is the software implementing the new strategy for mapping epistasis: instead of directly identifying individual pairwise or higher-order interactions, MAPIT focuses on mapping variants that have non-zero marginal epistatic effects --- the combined pairwise interaction effects between a given variant and all other variants. By testing marginal epistatic effects, MAPIT can identify candidate variants that are involved in epistasis without the need to identify the exact partners with which the variants interact, thus potentially alleviating much of the statistical and computational burden associated with standard epistatic mapping procedures. MAPIT is based on a variance component model, and relies on a recently developed variance component estimation method for efficient parameter inference and p-value computation. The software is distributed under the GNU General Public License.

Download:

The software package for MAPIT is currently available on GitHub. The software package for MAPIT-R for enrichment of genomic regions and SNP-sets is currently available on CRAN.

Citations:

L. Crawford, P. Zeng, S. Mukherjee, and X. Zhou (2017). Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLOS Genetics. 13(7): e1006869.

M.C. Turchin, G. Darnell, L. Crawford, and S. Ramachandran. Pathway analysis within multiple human ancestries reveals novel signals for epistasis in complex traits. bioRxiv. 2020.09.24.312421.

Contact:

Please contact Lorin Crawford or Michael Turchin with any comments or questions.


MegaLMM: Mega-scale Linear Mixed Models for Multivariate Genomic Prediction

MegaLMM is a software package for fitting multi-trait linear mixed models (MvLMMs) with multiple random effects. There are many notable and unique aspects of MegaLMM relative to other factor models. (1) Residuals of the phenotype after accounting for the factors are not assumed to be iid, but are modeled with independent (across traits) LMMs accounting for both fixed and random effects. (2) The factors themseleves are also not assumed to be iid, but are modeled with the same LMMs. This highlights the parallel belief that these latent factors represent traits that we just didn't measure directly. (3) Each factor is shared by all modeled sources of variation (fixed effects, random effects and residuals), rather than being unique to a particular source. (4) he factor loadings are strongly regularized so ensure that estimation is efficient. We accomplish this by ordering the factors from most-to-least important using a prior similar to that proposed by Bhattarchya and Dunson (2011) The software is distributed under the PolyForm Noncommercial License.

Download:

The software is currently available on GitHub.

Citations:

D.E. Runcie, J. Qu, H. Cheng, and L. Crawford (2021). Mega-scale linear mixed models for genomic predictions with thousands of traits. Genome Biology. 22: 213.

Contact:

Please contact Dan Runcie with any comments or questions.


MELD: Marginal Epistatic LD Score Regression

MELD is the software implementing marginal epistatic LD score regression. The inflation of test statistics in genome-wide association (GWA) studies due to confounding factors such as cryptic relatedness, population stratification, and spurious non-zero genetic effects driven by linkage disequilibrium (LD) has been well characterized in the literature. The key theoretical contribution of this work is that epistasis (i.e., the interaction between multiple loci and/or genes) can also lead to bias in GWA summary statistics. To address this challenge, we develop MELD: an extended framework which takes in GWA test statistics and accurately partitions true additive genetic variation from non-additive genetic variation, as well as other biases. MELD improves upon the estimation of narrow-sense heritability when genetic interactions are indeed present in the generative model for complex traits. More importantly, MELD has a calibrated type I error rate and does not overestimate non-additive genetic contribution to trait variation in simulated data when only additive effects are present. The software is distributed under the GNU General Public License.

Download:

The software package for MELD is currently available on GitHub.

Citations:

G. Darnell*, S.P. Smith*, D. Udwin, S. Ramachandran, and L. Crawford. Partitioning marginal epistasis distinguishes nonlinear effects from polygenicity and other biases in GWA summary statistics. bioRxiv. 2022.07.21.501001.

Contact:

Please contact Lorin Crawford with any comments or questions.


RATE: RelATive cEntrality Measures for Variable Prioritization

RATE is a software package that provides a novel for assessing input variable importance after having fit a nonlinear or nonparametric (Bayesian) model. By assessing entropy in the joint posterior distribution via Kullback-Leibler divergence (KLD), RATE can correctly prioritize candidate variables which are not just marginally important, but also those whose associations stem from a significant covarying relationship with other variables in the data. RATE is demonstrated in the context of statistical genetics, where the discovery of variants that are involved in nonlinear interactions is of particular interest. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

L. Crawford, S.R. Flaxman, D.E. Runcie, and M. West (2019). Variable prioritization in nonlinear black box methods: a genetic association case study. Annals of Applied Statistics. 13(2): 958-989.

J. Ish-Horowicz*, D. Udwin*, K. Scharfstein, S.R. Flaxman, L. Crawford, and S.L. Filippi. Interpreting deep neural networks through variable importance. arXiv. 1901.09839.

Contact:

Please contact Lorin Crawford with any comments or questions.


SECT: The Smooth Euler Characteristic Transform

This software package explores the use of a novel statistic, the smooth Euler characteristic transform (SECT), as an automated procedure to extract geometric or topological statistics from tumor images. More generally, the SECT is designed to integrate shape information into regression models by representing shapes and surfaces as a collection of curves. Due to its well-defined inner product structure, the SECT can be used in a wider range of functional and nonparametric modeling approaches than other previously proposed topological summary statistics. We illustrate the utility of the SECT in a radiomics context by showing that the topological quantification of tumors, assayed by magnetic resonance imaging (MRI), are better predictors of clinical outcomes in patients with glioblastoma multiforme (GBM). Using publicly available data from The Cancer Genome Atlas (TCGA) and The Cancer Imaging Archive (TCIA), we show that SECT features alone explain more of the variance in patient survival than gene expression, volumetric features, and morphometric features. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

L. Crawford, A. Monod, A.X. Chen, S. Mukherjee, and R. Rabadán (2020). Predicting clinical outcomes in glioblastoma: an application of topological and functional data analysis. Journal of the American Statistical Association. 115(531): 1139-1150.

Contact:

Please contact Lorin Crawford or Anthea Monod with any comments or questions.


SINATRA: Pipeline for Sub-Image Analysis and Feature Selection on 3D Shapes

The sub-image selection problem is to identify physical regions that most explain the variation between two classes of three dimensional shapes. SINATRA is a software package that implements a statistical pipeline for carrying out sub-image analyses using topological summary statistics. The algorithm follows four key steps: (1) 3D shapes (represented as triangular meshes) are summarized by a collection of vectors (or curves) detailing their topology (e.g., Euler characteristics, persistence diagrams). (2) A statistical model is used to classify the shapes based on their topological summaries. Here, we make use of a Gaussian process classification model with a probit link function. (3) After itting the model, an association measure is computed for each topological feature (e.g., centrality measures, posterior inclusion probabilities, p-values, etc). (4) Association measures are mapped back onto the original shapes via a reconstruction algorithm — thus, highlighting evidence of the physical (spatial) locations that best explain the variation between the two groups. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

B. Wang*, T. Sudijono*, H. Kirveslahti*, T. Gao, D.M. Boyer, S. Mukherjee, and L. Crawford (2021). A statistical pipeline for identifying physical features that differentiate classes of 3D shapes. Annals of Applied Statistics. 15(2): 638-661.

Contact:

Please contact Bruce Wang or Timothy Sudijono with any comments or questions.


SINATRA Pro: Protein Conformation Analysis using Topological Summary Statistics

The sub-image selection problem is to identify physical regions that most explain the variation between two classes of three dimensional shapes. SINATRA is a statistical pipeline for carrying out sub-image analyses using topological summary statistics (Wang et al. 2021, Ann Appl Stat). SINATRA Pro is an adaptation of the SINATRA framework for structure-based applications in protein dynamics. The general algorithm follows four key steps: (1) 3D shapes of protein structures (represented as triangular meshes) are summarized by a collection of vectors (or curves) detailing their topology (e.g., Euler characteristics, persistence diagrams, etc). (2) A statistical model is used to classify the shapes based on their topological summaries. Here, we make use of a Gaussian process classification model with a probit link function. (3) After fitting the model, an association measure is computed for each topological feature (e.g., centrality measures, posterior inclusion probabilities, p-values, etc). (4) Association measures are mapped back onto the original protein structures via a reconstruction algorithm, thus, highlighting atomic or residue-level positions that best explain the variation between two ensembles. The software is distributed under the MIT License.

Download:

The software is currently available on GitHub.

Citations:

W.S. Tang*, G.M. da Silva*, H. Kirveslahti, E. Skeens, B. Feng, T. Sudijono, K.K. Yang, S. Mukherjee, B. Rubenstein, and L. Crawford. A topological data analytic approach for discovering biophysical signatures in protein dynamics. PLOS Computational Biology. In Press.

Contact:

Please contact Wai Shing Tang with any comments or questions.


Tropix: Tropical Sufficient Statistics for Persistent Homology

Tropix is a software package that uses an embedding in Euclidean space based on tropical geometry to generate stable sufficient statistics for barcodes --- multiscale summaries of topological characteristics that capture the “shape” of data, but have complex structures and are therefore difficult to use in statistical settings. This statistical sufficiency result allows for the assumption of classical probability distributions on Euclidean representations of barcodes. This in turn makes a variety of parametric inference methods amenable to barcodes, all while maintaining their initial interpretations. In particular, this work shows that exponential family distributions may be assumed, and that likelihoods for persistent homology may be constructed. In the citation below, we use Tropix to conceptually demonstrate sufficiency and illustrate its utility in persistent homology dimensions 0 and 1 with concrete parametric applications to HIV and avian influenza data. The software is distributed under the GNU General Public License.

Download:

The software is currently available on GitHub.

Citations:

A. Monod, S. Kališnik Verovšek, J.Á. Patiño-Galindo, and L. Crawford (2019). Tropical sufficient statistics for persistent homology. SIAM Journal on Applied Algebra and Geometry. 3(2): 337-371.

Contact:

Please contact Lorin Crawford or Anthea Monod with any comments or questions.