Skip to content

配置参考

CleanBook 的配置文件 config.json 采用 JSON 格式,支持完整的自定义。

配置结构概览

json
{
  "show_confidence_indicator": false,
  "ai_settings": { ... },
  "llm": { ... },
  "title_cleaning_rules": { ... },
  "taxonomy": { ... },
  "processing_order": [ ... ],
  "category_order": [ ... ],
  "domain_grouping_rules": { ... },
  "priority_rules": { ... },
  "category_rules": { ... }
}

ai_settings

AI 处理相关的核心配置。

字段类型默认值说明
confidence_thresholdfloat0.4分类置信度阈值,低于此值视为"未分类"
use_semantic_analysisbooleantrue是否启用语义分析
use_user_profilingbooleantrue是否启用用户画像分析
cache_sizeinteger10000URL 特征缓存大小
max_workersinteger4并行处理数
enable_learningbooleantrue是否启用增量学习
json
{
  "ai_settings": {
    "confidence_threshold": 0.7,
    "use_semantic_analysis": true,
    "use_user_profiling": true,
    "cache_size": 10000,
    "max_workers": 4,
    "enable_learning": true
  }
}

category_rules

分类规则是 CleanBook 的核心功能。每个分类可以包含多个匹配规则。

规则类型

类型说明示例
domain匹配域名github.com, *.github.io
title匹配标题关键词python, tutorial
url_ends_with匹配 URL 后缀.pdf, .md
match_all_keywords_in所有关键词都必须命中["python", "asyncio"]

规则配置示例

json
{
  "category_rules": {
    "💻 编程/Python": {
      "rules": [
        {
          "match": "domain",
          "keywords": ["python.org", "pypi.org", "readthedocs.io"],
          "weight": 15
        },
        {
          "match": "title",
          "keywords": ["python", "django", "flask", "fastapi", "pip"],
          "weight": 10,
          "require_all": false
        }
      ],
      "description": "Python 相关资源"
    },
    "🤖 AI/ML": {
      "rules": [
        {
          "match": "domain",
          "keywords": ["huggingface.co", "pytorch.org", "tensorflow.org"],
          "weight": 20
        },
        {
          "match": "title",
          "keywords": ["machine learning", "deep learning", "neural network", "LLM"],
          "weight": 12
        }
      ]
    },
    "📚 文档": {
      "rules": [
        {
          "match": "url_ends_with",
          "patterns": ["/docs", "/documentation", ".pdf"],
          "weight": 8
        }
      ]
    }
  }
}

权重系统

  • 每条规则可以设置 weight(1-100)
  • 多个规则的权重累加
  • 超过 confidence_threshold 才分配分类
  • 权重最高的分类胜出

title_cleaning_rules

这个配置段用于在分类前清洗标题噪声。

json
{
  "title_cleaning_rules": {
    "prefixes": ["Sign in ·"],
    "suffixes": ["· GitHub"],
    "replacements": {
      "(7条消息)": ""
    }
  }
}

taxonomy

词表配置用于管理分类体系和同义词。

json
{
  "taxonomy": {
    "subjects_file": "taxonomy/subjects.yaml",
    "resource_types_file": "taxonomy/resource_types.yaml"
  }
}

llm

LLM 相关配置(可选)。

json
{
  "llm": {
    "enable": false,
    "provider": "openai",
    "model": "gpt-4o-mini",
    "api_key_env": "OPENAI_API_KEY",
    "temperature": 0.0,
    "top_p": 1.0,
    "timeout_seconds": 25,
    "max_retries": 1
  }
}
字段说明
enable是否启用 LLM 分类
provider提供商,例如 openai
model模型名称
api_key_env存储 API Key 的环境变量名

使用方式

bash
# 使用默认配置
cleanbook -i bookmarks.html -o output/

# 显式指定配置
cleanbook -i bookmarks.html -o output/ -c ./config.json

完整示例

json
{
  "ai_settings": {
    "confidence_threshold": 0.7,
    "use_semantic_analysis": true,
    "use_user_profiling": true,
    "cache_size": 10000,
    "max_workers": 4,
    "enable_learning": true
  },
  "category_rules": {
    "💻 编程": {
      "rules": [
        {
          "match": "domain",
          "keywords": ["github.com", "stackoverflow.com", "gitlab.com"],
          "weight": 20
        },
        {
          "match": "title",
          "keywords": ["programming", "developer", "code", "github"],
          "weight": 8
        }
      ]
    },
    "🎨 设计": {
      "rules": [
        {
          "match": "domain",
          "keywords": ["figma.com", "dribbble.com", "behance.net"],
          "weight": 20
        }
      ]
    }
  },
  "taxonomy": {
    "subjects_file": "taxonomy/subjects.yaml",
    "resource_types_file": "taxonomy/resource_types.yaml"
  },
  "llm": {
    "enable": false,
    "provider": "openai",
    "model": "gpt-4o-mini",
    "api_key_env": "OPENAI_API_KEY"
  },
  "show_confidence_indicator": false
}

下一步

CleanBook · Offline-first bookmark cleanup