The home of bohra
Comprehensive sequence characterisation for microbial genomics
Introduction
bohra
is microbial genomics pipeline, designed predominantly for use in public health, but may also be useful in research settings. It leverages existing high quality bioinformatics tools, to provide users with an easily accessible report of comprehensive analysis results of bacterial sequence data to for characterisation of single samples or for outbreak investigations or population studies.
- Quality assessment of the input data
- Speciation and appropriate in silico serotyping (where applicable).
- MLST
- Species relevant recovery of AMR mechanisms and inference of genomic AST/DST were available (S. enterica and M. tuberulosis).
- Plasmid information
- Comparative analysis using a reference-free or reference-based appproaches.
- Pangenome analysis.
The pipeline is designed to be flexible and modular, allowing for inputs from paired end fastq or assemblies, with direct support for ONT coming soon.
Stand alone html reports are generated for easy sharing and visualisation of the results.
Workflows
bohra
is a flexible pipeline and allows users to customise the workflows used. Below is an overview of each workflow. More detail on tools and options for each workflow can be found here and here. Further explanations and detailed guides can be found here
basic
This workflow will run on fastq and/or fasta (depending user supplied input) and is the first step in all other workflows implmented by bohra
. It can also be used alone as a simply quality control workflow.
flowchart LR
sequence --> sequence_assessment --> report
sequence --> speciation --> report
assembly
This workflow will simple generate assemblies from paired-end fastq, run basic genome annotation with prokka
and assess the quality of both the input reads and the resulting assemblies. This workflow forms the basis for amr, typing and pangenome analysis.
flowchart LR
fastq --> assembly --> annotation --> sequence_assessment
assembly --> speciation
fastq --> sequence_assessment --> report
fastq --> speciation --> report
amr and typing
This workflow will use user supplied species or the species detected in the sequence to determine the appropriate typing and AMR pipeline to use. Additional inferrence of genomic DST/AST will be undertaken for S. enterica and M. tuberculosis.
If assembly is required and fastq are used as input - the assembly workflow will be triggered.
Note that for AMR and gDST in M. tuberculosis paired-end fastq are required. We recommend to use the bohra run tb
workflow for M. tuberculosis.
flowchart LR
fastq --> assembly --> annotation --> sequence_assessment
assembly --> speciation
fastq --> sequence_assessment --> report
fastq --> speciation --> report
speciation --> typing --> report
assembly --> typing
speciation --> AMR --> report
assembly --> AMR
comparative analysis
This workflow undertakes a comparative anaysis of all the sequences included in the analysis. You can use reference based alignments with snippy
or you can use reference free approaches with mash
and ska2
.
flowchart LR
sequence --> sequence_assessment --> report
sequence --> speciation --> report
sequence --> variant_detection --> distances --> cluster --> report
variant_detection --> alignment --> tree_generation --> report
full
The full workflow includes all the workflows outlined above with the addition of pangenome analysis using panaroo
.
flowchart LR
fastq --> assembly --> annotation --> sequence_assessment
assembly --> speciation
fastq --> sequence_assessment --> report
fastq --> speciation --> report
speciation --> typing --> report
assembly --> typing
speciation --> AMR --> report
assembly --> AMR
assembly --> pangenome --> report
assembly -- "only possible with reference free" --> variant_detection
fastq --> variant_detection --> distances --> cluster --> report
variant_detection --> alignment --> tree_generation --> report
tb
bohra
now has a M. tuberulosis specific workflow, which does not run MLST or other assembly based tools. And undertakes M. tuberculosis relevant gDST. It uses the H37rV reference genome, masking repetitive sites and tbtAMR
for generation of an inferred antibiogram.
flowchart LR
fastq --> sequence_assessment --> report
fastq --> speciation --> report
speciation --> lineage --> report
fastq --> AMR --> report
fastq --> variant_detection --> distances --> cluster --> report
variant_detection --> alignment --> tree_generation --> report
Etymology
The name 'bohra', is the name of an exinct species of tree kangaroo that lived on the Nullarbor plain in Australia was chosen to reflect the fact that it was originally developed to used to build trees, relies on snippy (named for a very famous kangaroo) and was inspired by nullarbor.