Awesome Bioinformatics Algorithms

Citation Format Specification

All references in this knowledge base's algorithm entries follow the IEEE citation standard format. Recording elements include: primary authors, title, publication venue, volume/issue/pages, year, and DOI.

Format Example

[1] N. C. Jones and P. A. Pevzner, An Introduction to Bioinformatics Algorithms. Cambridge, MA: MIT Press, 2004.

[2] S. F. Altschul, T. L. Madden, A. A. Schaffer, et al., "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Res., vol. 25, no. 17, pp. 3389–3402, 1997. DOI:10.1093/nar/25.17.3389.

Classic Papers by Domain

Sequence Alignment

[1] S. B. Needleman and C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequence of two proteins," J. Mol. Biol., vol. 48, no. 3, pp. 443–453, 1970. DOI:10.1016/0022-2836(70)90057-4.
[2] T. F. Smith and M. S. Waterman, "Identification of common molecular subsequences," J. Mol. Biol., vol. 147, no. 1, pp. 195–197, 1981. DOI:10.1016/0022-2836(81)90087-5.
[3] S. F. Altschul, W. Gish, W. Miller, et al., "Basic local alignment search tool," J. Mol. Biol., vol. 215, no. 3, pp. 403–410, 1990. DOI:10.1016/S0022-2836(05)80360-2.
[4] H. Li and R. Durbin, "Fast and accurate short read alignment with Burrows-Wheeler transform," Bioinformatics, vol. 25, no. 14, pp. 1754–1760, 2009. DOI:10.1093/bioinformatics/btp324.
[5] H. Li, "Minimap2: pairwise alignment for nucleotide sequences," Bioinformatics, vol. 34, no. 18, pp. 3094–3100, 2018. DOI:10.1093/bioinformatics/bty191.

Sequence Assembly

[1] P. A. Pevzner, H. Tang, and M. S. Waterman, "An Eulerian path approach to DNA fragment assembly," Proc. Natl. Acad. Sci. USA, vol. 98, no. 17, pp. 9748–9753, 2001. DOI:10.1073/pnas.171285098.
[2] D. R. Zerbino and E. Birney, "Velvet: algorithms for de novo short read assembly using de Bruijn graphs," Genome Res., vol. 18, no. 5, pp. 821–829, 2008. DOI:10.1101/gr.074492.107.
[3] A. Bankevich, S. Nurk, D. Antipov, et al., "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing," J. Comput. Biol., vol. 19, no. 5, pp. 455–477, 2012. DOI:10.1089/cmb.2012.0021.
[4] S. Koren, B. P. Walenz, K. Berlin, et al., "Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation," Genome Res., vol. 27, no. 5, pp. 722–736, 2017. DOI:10.1101/gr.215087.116.
[5] M. Kolmogorov, J. Yuan, Y. Lin, and P. A. Pevzner, "Assembly of long, error-prone reads using repeat graphs," Nat. Biotechnol., vol. 37, no. 5, pp. 540–546, 2019. DOI:10.1038/s41587-019-0072-8.

Variant Calling

[1] A. McKenna, M. Hanna, E. Banks, et al., "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data," Genome Res., vol. 20, no. 9, pp. 1297–1303, 2010. DOI:10.1101/gr.107524.110.
[2] M. A. DePristo, E. Banks, R. Poplin, et al., "A framework for variation discovery and genotyping using next-generation DNA sequencing data," Nat. Genet., vol. 43, no. 5, pp. 491–498, 2011. DOI:10.1038/ng.806.
[3] R. Poplin, P. C. Chang, D. Alexander, et al., "A universal SNP and small-indel variant caller using deep neural networks," Nat. Biotechnol., vol. 36, no. 10, pp. 983–987, 2018. DOI:10.1038/nbt.4235.
[4] S. Kim, K. Scheffler, A. L. Halpern, et al., "Strelka2: fast and accurate calling of germline and somatic variants," Nat. Methods, vol. 15, no. 8, pp. 591–594, 2018. DOI:10.1038/s41592-018-0051-x.
[5] K. Cibulskis, M. S. Lawrence, S. L. Carter, et al., "Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples," Nat. Biotechnol., vol. 31, no. 3, pp. 213–219, 2013. DOI:10.1038/nbt.2514.

Protein Structure Prediction

[1] J. Jumper, R. Evans, A. Pritzel, et al., "Highly accurate protein structure prediction with AlphaFold," Nature, vol. 596, no. 7873, pp. 583–589, 2021. DOI:10.1038/s41586-021-03819-2.
[2] M. Baek, F. DiMaio, I. Anishchenko, et al., "Accurate prediction of protein structures and interactions using a three-track neural network," Science, vol. 373, no. 6557, pp. 871–876, 2021. DOI:10.1126/science.abj8754.
[3] Z. Lin, H. Akin, R. Rao, et al., "Evolutionary-scale prediction of atomic-level protein structure with a language model," Science, vol. 379, no. 6637, pp. 1123–1130, 2023. DOI:10.1126/science.ade2574.
[4] R. Wu, F. Ding, R. Wang, et al., "High-resolution de novo structure prediction from primary sequence," Nat. Methods, vol. 21, no. 4, pp. 682–690, 2024. DOI:10.1038/s41592-024-02272-z.
[5] A. W. Senior, R. Evans, J. Jumper, et al., "Improved protein structure prediction using potentials from deep learning," Nature, vol. 577, no. 7792, pp. 706–710, 2020. DOI:10.1038/s41586-019-1923-7.

Single-Cell Analysis

[1] R. Satija, J. A. Farrell, D. Gennert, et al., "Spatial reconstruction of single-cell gene expression data," Nat. Biotechnol., vol. 33, no. 5, pp. 495–502, 2015. DOI:10.1038/nbt.3192.
[2] F. A. Wolf, P. Angerer, and F. J. Theis, "SCANPY: large-scale single-cell gene expression data analysis," Genome Biol., vol. 19, no. 1, p. 15, 2018. DOI:10.1186/s13059-017-1382-0.
[3] C. Trapnell, D. Cacchiarelli, J. Grimsby, et al., "The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells," Nat. Biotechnol., vol. 32, no. 4, pp. 381–386, 2014. DOI:10.1038/nbt.2859.
[4] R. Lopez, J. Regier, M. B. Cole, et al., "Deep generative modeling for single-cell transcriptomics," Nat. Methods, vol. 15, no. 12, pp. 1053–1058, 2018. DOI:10.1038/s41592-018-0229-2.
[5] G. X. Y. Zheng, J. M. Terry, P. Belgrader, et al., "Massively parallel digital transcriptional profiling of single cells," Nat. Commun., vol. 8, p. 14049, 2017. DOI:10.1038/ncomms14049.

Metagenomics

[1] D. E. Wood and S. L. Salzberg, "Kraken: ultrafast metagenomic sequence classification using exact alignments," Genome Biol., vol. 15, no. 3, p. R46, 2014. DOI:10.1186/gb-2014-15-3-r46.
[2] J. Qin, R. Li, J. Raes, et al., "A human gut microbial gene catalogue established by metagenomic sequencing," Nature, vol. 464, no. 7285, pp. 59–65, 2010. DOI:10.1038/nature08821.
[3] D. T. Truong, E. A. Franzosa, T. L. Tickle, et al., "MetaPhlAn2 for enhanced metagenomic taxonomic profiling," Nat. Methods, vol. 12, no. 10, pp. 902–903, 2015. DOI:10.1038/nmeth.3589.
[4] S. Abubucker, N. Segata, J. Goll, et al., "Metabolic reconstruction for metagenomic data and its application to the human microbiome," PLoS Comput. Biol., vol. 8, no. 6, p. e1002358, 2012. DOI:10.1371/journal.pcbi.1002358.
[5] J. Sung, L. Zheng, V. Duvvuri, et al., "Metabolic modeling with objective quantification of the human gut microbiome in inflammatory bowel disease," Nat. Microbiol., vol. 7, no. 7, pp. 1126–1136, 2022. DOI:10.1038/s41564-022-01147-6.

Must-Read Reviews

The following reviews are "map-level" literature for each domain and are recommended as the first reading material when entering a subfield:

Sequence alignment and search: S. F. Altschul et al., "Basic local alignment search tool," J. Mol. Biol., 1990. (Foundational BLAST work; essential reading for understanding heuristic search.)
Protein structure prediction: J. Jumper et al., "Highly accurate protein structure prediction with AlphaFold," Nature, 2021. (AlphaFold, a watershed moment in structural biology.)
Single-cell technology: E. Papalexi and R. Satija, "High-dimensional genomic data analysis: methods and challenges," Nat. Methods, 2022. (Methodological review of single-cell high-dimensional data analysis.)
Metagenomics: C. Quince et al., "Shotgun metagenomics, from sampling to analysis," Nat. Biotechnol., 2017. (Complete methodology from wet lab to dry lab.)
Graph genomics: B. Paten et al., "Genome graphs and the evolution of genome inference," Genome Res., 2017. (Systematic review of graph genomics.)

The following table compares this knowledge base with similar open-source projects in terms of product positioning, functional scope, and engineering practices:

Project Name	Core Function	Stars	Primary Language	License	Difference from This Project
Awesome-Bioinformatics	Algorithm and tool list	2.8k+	Markdown	CC0	Pure list, no structured metadata or generation pipeline
bioinformatics-workflows	Analysis workflow templates	N/A	Snakemake / Nextflow	Mixed	Focuses on workflows rather than algorithm ontology
biostars-handbook	Tutorials and guides	N/A	—	Commercial	Operational manual for beginners, not architecture-grade knowledge base
OBF / BioPython	Tool library and community	N/A	Python	MIT/BSD	Provides algorithm implementations, not algorithm metadata indexing
This Project	Structured algorithm knowledge base + whitepaper	—	Python	MIT	Emphasizes data-driven, generation pipeline, quality verification, and bilingual support

Engineering Insights

In the process of building and maintaining this knowledge base, we summarize the following three engineering principles with universal applicability to large-scale technical knowledge systems:

1. Single Source of Truth

When knowledge entries exceed 100, "handwritten documents scattered in multiple places" inevitably become inconsistent. Centralizing data as structured YAML, with all presentation layers generated from the same source, is the only sustainable solution for maintaining consistency.

2. Generation-Driven Documentation

The efficiency of human-edited Markdown drops sharply after 50 entries, and format drift becomes unavoidable. Using code to generate documents, directing human creativity toward "data content" rather than "layout formatting," can reduce maintenance costs by an order of magnitude.

3. Validation Before Deployment

In CI/CD, any data change that fails validation must block the build. The sequence of "validate first, then generate, then deploy" must not be reversed; otherwise dead links, formatting errors, and data inconsistencies will pollute the production environment.

References and Related Projects ​

Citation Format Specification ​

Format Example ​

Classic Papers by Domain ​

Sequence Alignment ​

Sequence Assembly ​

Variant Calling ​

Protein Structure Prediction ​

Single-Cell Analysis ​

Metagenomics ​

Must-Read Reviews ​

Related Open Source Ecosystem Analysis ​

Engineering Insights ​

1. Single Source of Truth ​

2. Generation-Driven Documentation ​

3. Validation Before Deployment ​

References and Related Projects