DeepTaxa Tutorials

Modified

April 7, 2026

DeepTaxa is a hybrid CNN-BERT model that classifies 16S rRNA gene sequences into a seven-rank taxonomic hierarchy (Domain through Species). These tutorials are intended for researchers in microbial ecology, clinical metagenomics, or machine learning who want to use the pre-trained model, train a custom model, or understand the design choices behind the architecture.

1 Tutorials

The tutorials are independent but build on each other conceptually. For new users, the recommended reading order is Prediction, Training, Analysis, then Architecture.

Tutorial Scope
Prediction Classify 16S sequences with the pre-trained model, interpret per-rank accuracy, and examine confidence scores
Training Train a model from scratch on the Greengenes 2 dataset, monitor learning curves, and evaluate the trained checkpoint
Analysis Evaluate classification performance in depth: confusion patterns, sequence and embedding similarity, calibration, and novel taxa detection
Architecture Understand the CNN-Transformer fusion, the focal loss formulation, and how to adapt the model to other marker genes

2 Prerequisites

  • Python 3.10 or later
  • A CUDA-capable GPU is strongly recommended for training; CPU training is possible but approximately 10x slower. Prediction and analysis can run on CPU, though at reduced throughput.

Install DeepTaxa and the packages required for plotting and evaluation:

pip install git+https://github.com/systems-genomics-lab/deeptaxa.git
pip install matplotlib scikit-learn

3 Resources

References

Salah Khalel, R., Abdelaal, K., Ghonaim, L., Awe, O. I., & Moustafa, A. (2026). DeepTaxa: Deep learning framework for taxonomic classification.