DeepTaxa Tutorials
DeepTaxa is a hybrid CNN-BERT model that classifies 16S rRNA gene sequences into a seven-rank taxonomic hierarchy (Domain through Species). These tutorials are intended for researchers in microbial ecology, clinical metagenomics, or machine learning who want to use the pre-trained model, train a custom model, or understand the design choices behind the architecture.
1 Tutorials
The tutorials are independent but build on each other conceptually. For new users, the recommended reading order is Prediction, Training, Analysis, then Architecture.
| Tutorial | Scope |
|---|---|
| Prediction | Classify 16S sequences with the pre-trained model, interpret per-rank accuracy, and examine confidence scores |
| Training | Train a model from scratch on the Greengenes 2 dataset, monitor learning curves, and evaluate the trained checkpoint |
| Analysis | Evaluate classification performance in depth: confusion patterns, sequence and embedding similarity, calibration, and novel taxa detection |
| Architecture | Understand the CNN-Transformer fusion, the focal loss formulation, and how to adapt the model to other marker genes |
2 Prerequisites
- Python 3.10 or later
- A CUDA-capable GPU is strongly recommended for training; CPU training is possible but approximately 10x slower. Prediction and analysis can run on CPU, though at reduced throughput.
Install DeepTaxa and the packages required for plotting and evaluation:
pip install git+https://github.com/systems-genomics-lab/deeptaxa.git
pip install matplotlib scikit-learn3 Resources
- Source code
- Pre-trained model
- Training data (Greengenes 2)
- DeepTaxa manuscript (Salah Khalel et al., 2026)
References
Salah Khalel, R., Abdelaal, K., Ghonaim, L., Awe, O. I., & Moustafa, A. (2026). DeepTaxa: Deep learning framework for taxonomic classification.