SOFTWARE DEVELOPED BY THE BFG
All of our work has led at various points to generation of new algorithms and software. In order to ensure that our bioinformatics analyses are reproducible and available to fellow researchers, we have adopted a philosophy of open source and work with public software repositories to release it. Much of this work appears as publications in addition to easily downloaded packages, but exists primarily in two forms: as a package on Bioconductor or as an installable, Bioconductor-compatible package on github. The reasons for the latter category are largely that packages often take many months to years to make it through the Bioconductor vetting process, and so we make it available through git for convenience.
What kind of software?
Most of our software and workflows are related to downstream analyses of next-gen sequencing data (NGS). Data are processed in multiple stages; first off the sequencer individual reads are matched to the reference human genome (alignment). Next the aligned reads are mapped relative to genes or other features of interest. For some data (e.g. methylation) there are additional steps. Most of our software concerns the biological questions that are of interest after these other steps have been completed. This includes integration with genetic data (funciSNP, funciVar), chromatin state identification, allelic imbalances, integration with expression data (HiCAGE, ELMER), and identification of potential disruptive sequences (motifBreakR), to name a few.
GENAVi (Gene Expression Normalization Analysis and Visualization) is a GUI based web application that provides a user-friendly platform to normalize gene expression data, cluster samples based on expression, perform differential expression analysis and visualize results. Our software implements commonly used R packages that represent the “gold standard” for RNA-seq analysis. With GENAVi, users can perform an entire analysis pipeline on the breast and ovarian cancer data provided within the application as well as upload their own expression matrix.
ELMER (Enhancer Linking by Methylation Expression Relationships) is designed to combine DNA methylation and gene expression data from human tissues to infer multi-level cis-regulatory networks. It uses DNA methylation to identify enhancers, and correlates enhancer state with expression of nearby genes to identify one or more transcriptional targets. Transcription factor (TF) binding site analysis of enhancers is coupled with expression analysis of all TFs to infer upstream regulators. This package can be easily applied to TCGA public available cancer data sets and custom DNA methylation and gene expression data
We introduce motifbreakR, which allows the biologist to judge in the first place whether the sequence surrounding the polymorphism is a good match, and in the second place how much information is gained or lost in one allele of the polymorphism relative to another. MotifbreakR is both flexible and extensible over previous offerings; giving a choice of algorithms for interrogation of genomes with motifs from public sources that users can choose from; these are 1) a weighted-sum probability matrix, 2) log-probabilities, and 3) weighted by relative entropy. MotifbreakR can predict effects for novel or previously described variants in public databases, making it suitable for tasks beyond the scope of its original design. Lastly, it can be used to interrogate any genome curated within Bioconductor (currently there are 22).
Statehub-Statepaintr: Rules-Based Chromatin State Annotations. Genome annotation is critical to understand the function of disease variants, especially for clinical applications. To meet this need there are segmentations available from public consortia reflecting varying unsupervised approaches to functional annotation based on epigenetics data, but there remains a need for transparent, reproducible, and easily interpreted genomic maps of the functional biology of chromatin. We introduce a new methodological framework for defining a combinatorial epigenomic model of chromatin state on a web database, StateHub.
HICAGE: An R Package for Large-Scale Annotation and Visualization of 3C-Based Genomic Data. Chromatin interactions measured by the 3C-based family of next generation technologies are becoming increasingly important for measuring the physical basis for regulatory interactions between different classes of functional domains in the genome. Software is needed to streamline analyses of these data and integrate them with custom genome annotations, RNA-seq, and gene ontologies. We introduce a new R package compatible with Bioconductor---Hi-C Annotation and Graphics Ensemble (HiCAGE)---to perform these tasks with minimum effort. In addition, the package contains a shiny/R web app interface to provide ready access to its functions.
BisSNP (A Bisulfite Space Genotyper & Methylation Caller) is a package based on the Genome Analysis Toolkit (GATK) map-reduce framework for genotyping and accurate DNA methylation calling in bisulfite treated massively parallel sequencing (Bisulfite-seq, NOMe-seq, RRBS and any other bisulfite treated sequencing) with Illumina directional library protocol.
ECDP (Epigenome Center Data Portal). This scalable portal allows researchers to explore and download their datasets in a secure fashion. From the initial LIMS sample entry (currently using Genologics) through sequencing and downstream analysis on our supercomputing cluster, all characteristics of a sample are parsed and tracked allowing for the presentation of these metrics on a single integrated interface.