CleanBook Technical Report
Document Information
| Property | Value |
|---|---|
| Project | CleanBook - AI Bookmark Classification System v2.0 |
| Tech Stack | Python + Machine Learning + NLP |
| Version | v1.0 |
| Created | 2025-07-30 |
| Updated | 2026-04-16 |
| License | MIT |
Table of Contents
- Project Overview
- System Architecture
- Core Technology Stack
- Machine Learning Models
- Algorithm Implementation
- Performance Optimization
- Data Processing Pipeline
- Challenges & Solutions
- Performance Metrics
- Future Roadmap
Project Overview
Introduction
CleanBook is an AI-powered bookmark management tool based on machine learning and natural language processing. The system automatically analyzes, classifies, and organizes browser bookmarks, transforming chaotic collections into well-structured libraries.
Technical Highlights
- Multi-algorithm fusion: Combines rule engine, machine learning, and ensemble classifiers
- High-performance processing: Multi-threaded concurrent processing with intelligent caching
- Adaptive learning: Supports online learning and continuous model optimization
- Modular design: Clear layered architecture, easy to extend and maintain
- High intelligence: Integrates multiple AI technologies with 91.4% classification accuracy
System Architecture
Overall Architecture
┌─────────────────────────────────────────┐
│ User Interface Layer │
│ CLI / Web Interface / API │
├─────────────────────────────────────────┤
│ Business Logic Layer │
│ Bookmark Processor / AI Classifier │
├─────────────────────────────────────────┤
│ AI Algorithm Layer │
│ Rule Engine / ML Classifier / Cache │
├─────────────────────────────────────────┤
│ Data Access Layer │
│ Config Management / Model Storage │
└─────────────────────────────────────────┘2
3
4
5
6
7
8
9
10
11
12
13
Core Components
| Component | File | Responsibility |
|---|---|---|
| AIBookmarkClassifier | src/ai_classifier.py | Unified classification interface |
| RuleEngine | src/rule_engine.py | Rule-based fast classification |
| MLClassifier | src/ml_classifier.py | ML-based classification |
| BookmarkProcessor | src/bookmark_processor.py | Workflow orchestration |
Core Technology Stack
Programming Language & Frameworks
| Category | Technology | Version | Purpose |
|---|---|---|---|
| Core | Python | 3.10+ | Primary language |
| ML | scikit-learn | 1.4.x | Machine learning |
| NLP | jieba | 0.42.x | Chinese tokenization |
| NLP | langdetect | 1.0.x | Language detection |
| Parsing | BeautifulSoup4 | 4.12.x | HTML parsing |
| CLI | rich | 13.x | Terminal UI |
| CLI | click | 8.1.x | CLI framework |
Key Dependencies
# Machine learning
scikit-learn>=1.4.2
numpy>=1.26.4
pandas>=2.2.2
# NLP
jieba>=0.42.1
langdetect>=1.0.9
# Web & parsing
beautifulsoup4>=4.12.3
lxml>=5.2.2
# CLI & UI
rich>=13.7.1
click>=8.1.7
tqdm>=4.66.42
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Machine Learning Models
Algorithm Comparison
| Algorithm | Accuracy | Training Time | Prediction Time | Best For |
|---|---|---|---|---|
| Random Forest | 78.4% | 2.3s | 0.8ms | High-dimensional features |
| SVM | 73.3% | 15.2s | 1.2ms | Nonlinear problems |
| Logistic Regression | 88.8% | 1.8s | 0.3ms | Fast, explainable |
| Naive Bayes | 88.8% | 0.5s | 0.2ms | Text classification |
| Gradient Boosting | 85.3% | 8.7s | 0.9ms | Complex relationships |
| SGD | 88.8% | 1.2s | 0.4ms | Online learning |
| Ensemble | 91.4% | 12.3s | 2.1ms | Best overall |
Feature Engineering
TF-IDF Vectorization
TfidfVectorizer(
max_features=500,
ngram_range=(1, 2),
min_df=1,
lowercase=True,
stop_words=None
)2
3
4
5
6
7
Feature Types
- Text features: TF-IDF of title + URL
- Numerical features: URL length, title length, domain depth
- Categorical features: Content type, language, domain category
Ensemble Strategy
VotingClassifier(
estimators=[
('rf', RandomForestClassifier),
('lr', LogisticRegression),
('nb', MultinomialNB)
],
voting='soft'
)2
3
4
5
6
7
8
Algorithm Implementation
Rule Engine
Pre-compiled regex patterns for fast matching:
class RuleEngine:
def __init__(self, config):
self._compile_rules() # Pre-compile patterns
def classify(self, features) -> ClassificationResult:
# Domain matching, title matching, URL pattern matching
pass2
3
4
5
6
7
Similarity Calculation
Multi-dimensional similarity for deduplication:
def calculate_similarity(bookmark1, bookmark2) -> float:
url_sim = url_similarity(bookmark1['url'], bookmark2['url'])
title_sim = title_similarity(bookmark1['title'], bookmark2['title'])
domain_sim = 1.0 if bookmark1['domain'] == bookmark2['domain'] else 0.0
# Weighted combination
weights = {'url': 0.4, 'title': 0.4, 'domain': 0.2}
total_sim = (url_sim * weights['url'] +
title_sim * weights['title'] +
domain_sim * weights['domain'])
return total_sim2
3
4
5
6
7
8
9
10
11
Caching Strategy
class LRUCache:
def __init__(self, max_size=1000):
self.max_size = max_size
self.cache = OrderedDict()
def get(self, key):
if key in self.cache:
self.cache.move_to_end(key) # Mark as recently used
return self.cache[key]
return None
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
elif len(self.cache) >= self.max_size:
self.cache.popitem(last=False) # Remove least recently used
self.cache[key] = value2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Performance Optimization
Parallel Processing
from concurrent.futures import ThreadPoolExecutor
class BookmarkProcessor:
def __init__(self, max_workers=4):
self.executor = ThreadPoolExecutor(max_workers=max_workers)
def process_batch(self, bookmarks, batch_size=50):
results = []
for i in range(0, len(bookmarks), batch_size):
batch = bookmarks[i:i + batch_size]
futures = [self.executor.submit(self.process_single, b)
for b in batch]
results.extend([f.result() for f in futures])
return results2
3
4
5
6
7
8
9
10
11
12
13
14
Memory Optimization
- Lazy loading of ML models
- Streaming processing for large files
- Object pooling for frequently created objects
- Periodic garbage collection
I/O Optimization
- Chunked file reading
- Async I/O operations
- Buffered writes for export
Data Processing Pipeline
HTML Input
│
▼
Parse (BeautifulSoup)
│
▼
Extract Features
- Domain parsing
- Language detection
- Content type detection
│
▼
Deduplicate
- URL normalization
- Similarity calculation
│
▼
Classify
- Rule engine (fast path)
- ML classifier (if needed)
- LLM fallback (optional)
│
▼
Organize
- Group by subject
- Group by resource_type
│
▼
Export
- HTML (Netscape format)
- Markdown (reports)
- JSON (structured data)2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Challenges & Solutions
| Challenge | Solution |
|---|---|
| Mixed Chinese/English content | Separate processing pipelines for each language |
| Cold start (no training data) | Rule-based fallback with lowered confidence |
| Class imbalance in training data | SMOTE oversampling for minority classes |
| Large-scale data processing | Streaming + chunked processing |
| Real-time requirements | Multi-layer caching with LRU eviction |
| URL normalization | Strip tracking parameters, normalize protocols |
Performance Metrics
System Performance
| Metric | Value |
|---|---|
| Classification Accuracy | 91.4% |
| Processing Speed | ~50 bookmarks/second |
| Cache Hit Rate | 87-92% |
| Memory Usage (baseline) | ~45MB |
| Memory Usage (1000 bookmarks) | ~125MB |
Scalability Test
| Bookmark Count | Processing Time | Throughput |
|---|---|---|
| 100 | 2.3s | 43/s |
| 1,000 | 18.7s | 53/s |
| 5,000 | 89.2s | 56/s |
| 10,000 | 178.5s | 56/s |
Future Roadmap
Deep Learning Integration
- BERT-based text understanding
- Sentence embeddings for semantic similarity
- Transformer models for classification
Advanced Features
- Active Learning: Smart sampling for efficient labeling
- Model Drift Detection: Automatic retraining triggers
- Reinforcement Learning: Optimize classification strategy based on user feedback
Cloud & Scale
- Microservices architecture
- Distributed processing
- Auto-scaling deployment
Technical Achievements
- Multi-algorithm fusion: Successfully integrated 6 ML algorithms with 91.4% accuracy
- High-performance architecture: Concurrent processing, intelligent caching, memory optimization
- Modular design: Clear layered architecture, easy to maintain and extend
- Chinese optimization: Specialized processing for Chinese content
- Real-time capability: Supports real-time classification of large-scale data
Conclusion
CleanBook demonstrates the practical application of modern AI technologies in solving real-world bookmark management problems. The project will continue to explore deep learning, cloud-native architecture, and big data processing to provide users with more intelligent and efficient bookmark management solutions.
This document details the technical implementation of CleanBook's AI bookmark classification system, including architecture design, algorithm implementation, performance optimization, and future development directions.