Evolution Notes
Overview
Since its inception in early 2024, this knowledge base has gone through three distinct evolutionary phases. Each phase corresponds to different core objectives, key actions, and deliverables. Understanding these historical decisions helps anticipate future technical debt and expansion directions.
Phase 1: List-Oriented Curation (2024 Q1–Q2)
Goal
Solve the "breadth of coverage" problem by establishing a multi-category algorithm directory covering major bioinformatics subfields. Core KPIs: >100 algorithm entries, >=10 top-level categories.
Key Actions
- Designed the initial YAML schema (v1), containing six required fields: id, name, description, purpose, time_complexity, category
- Established a hierarchy of 16 top-level categories and 30+ subcategories based on
categories.yaml - Manually curated the first 100+ algorithm entries, focusing on classic algorithms (Smith-Waterman, Needleman-Wunsch, BLAST, etc.)
- Built a minimally viable VitePress site supporting category browsing and algorithm detail pages
Deliverables
- 195+ algorithm entries (exceeded target)
- 16 top-level categories × 30+ subcategory hierarchy
- Basic VitePress site (Chinese + English mirror)
Phase 2: Engineering Governance (2024 Q2–Q4)
Goal
Solve the "consistency and maintainability" problem by upgrading from "human-maintained Markdown" to a "data-driven generation system." Core KPIs: zero false positives in data validation, generator test coverage >85%, fully automated CI/CD.
Key Actions
- Introduced
validate.pyfield rules and JSON Schema dual validation mechanisms - Refactored
generate_docs.pyto programmatically generate algorithm pages, category pages, and index pages instead of handwritten Markdown - Established a CLI command suite (validate, stats, search, info, compare, export, vitepress)
- Integrated ruff, mypy, and pytest code quality toolchain; raised test coverage to 89%
- Configured GitHub Actions workflow, achieving fully automated push→validate→generate→build→deploy pipeline
- Extended YAML schema to v3, adding space_complexity, year, tags, difficulty, language, references, and other fields
Deliverables
- Data-driven VitePress documentation generator
- Complete toolchain with 8 CLI subcommands
- Python test suite with 89% coverage
- Fully automated CI/CD release pipeline
- Algorithm template file (
templates/algorithm_template.yaml)
Phase 3: Whitepaper Positioning (2025 Q1–Present)
Goal
Solve the "professional persuasiveness" problem by elevating the knowledge base from an "algorithm list" to a "technical whitepaper and architecture academy." Core KPIs: average whitepaper page length >200 lines, academic citation coverage >85%, full Mermaid architecture diagram coverage.
Key Actions
- Rewrote all whitepaper generator functions (
_generate_*) to output in-depth academic content (project overview, learning path, system architecture, data pipeline, quality assurance, references, evolution notes, CLI workflow) - Unified academic citation formats: GB-T 7714 for Chinese, IEEE for English
- Introduced Mermaid architecture diagrams (data flow, CI/CD, learning path, system architecture) to enhance visual expressiveness
- Optimized homepage (Hero, Features, statistics dashboard, whitepaper entry points, research directions, latest additions)
- Enhanced algorithm pages: independent complexity analysis section, more professional link and tag presentation
- Established OpenSpec specification-driven development (SDD) process;
openspec/specs/serves as the single source of requirements
Deliverables
- 14 in-depth whitepaper documents (28 pages in Chinese + English)
- Unified academic citation system (GB-T 7714 / IEEE)
- Architecture Decision Records (ADR)
- OpenSpec specification directory and proposal workflow
Technical Debt Register
| Debt Item | Impact Level | Description | Mitigation Plan |
|---|---|---|---|
| Insufficient bilingual coverage | Medium | Only ~60% of entries provide English descriptions | Gradually fill through community contributions and automated translation APIs |
| Low optional field completeness | Medium | space_complexity, related_tools, references coverage <70% | Add warnings (non-blocking) in validate to guide contributors |
| Generator not templatized | Low | Currently uses Python f-string concatenation for Markdown; maintenance becomes difficult as complexity grows | Evaluate introducing Jinja2 template engine |
| No runtime API | Low | All queries must be completed at generation time; cannot support dynamic retrieval | Long-term plan for REST API layer |
| External links not continuously monitored | Low | paper_url / implementation_url may become invalid | Enhance CI integration frequency for link_checker.py |
Future Roadmap
Short-term (1–3 months)
| Task | Priority | Acceptance Criteria |
|---|---|---|
| Raise bilingual coverage to 75% | P0 | stats shows description_en coverage >=75% |
| Enhance algorithm page visualization | P1 | Add complexity analysis extended descriptions for top-20 algorithm pages |
| Optimize VitePress search | P1 | Support local search filtered by complexity, year, and difficulty |
| Dead link auto-fix suggestions | P2 | CI link_checker failures output alternative link suggestions |
Medium-term (3–6 months)
| Task | Priority | Acceptance Criteria |
|---|---|---|
| Introduce algorithm benchmark data fields | P0 | YAML schema v4 supports accuracy, runtime, and memory fields |
| Plugin system MVP | P1 | Support third-party data enrichment plugin registration and execution |
| Category page visualization enhancement | P1 | Category pages add algorithm distribution bar charts and era trend line charts |
| Interactive complexity comparison tool | P2 | Support selecting multiple algorithms to generate complexity comparison tables |
Long-term (6–12 months)
| Task | Priority | Acceptance Criteria |
|---|---|---|
| REST API read-only service | P1 | Provide /api/v1/algorithms endpoints with latency <200ms |
| Multimodal content support | P2 | Support embedding algorithm flowcharts, pseudocode, and video tutorials |
| Community contribution platform | P2 | Algorithm proposal and review workflow based on GitHub Issues |
| Knowledge graph construction | P3 | Build interactive knowledge graphs based on category/tag/citation |
Design Pattern Records
The following three design patterns have been repeatedly validated as effective in the engineering implementation of this knowledge base:
Repository Pattern
DataStore serves as the unified repository for algorithm and category data, encapsulating all data loading, index building, and query logic. In the future, regardless of whether the underlying storage migrates from YAML files to SQLite, PostgreSQL, or graph databases, business layer code will require no modifications.
Template Method Pattern
The Chinese and English generators in generate_docs.py share the same traversal skeleton (traverse all algorithms to generate detail pages, traverse all categories to generate category pages), but defer language-specific content filling to subclass/function implementations. This pattern significantly reduces the marginal cost of adding new language versions (e.g., Japanese, German).
Pipeline Pattern
The entire data pipeline (load → validate → generate → build → deploy) is designed as a sequentially executed pipeline, where each stage's output serves as the next stage's input, and failure at any stage triggers a fail-fast mechanism. This pattern naturally aligns with the design philosophy of CI/CD workflows.