跳转至

DataMax Docs

清洗

Hi-Dolphin/datamax

清洗¶

能力清单¶

异常清理：HTML 标签、不可见字符、空白/换行规整、简繁转换
质量过滤：重复率、长度范围、数字占比
隐私脱敏：IP/邮箱/手机号/QQ/身份证/银行卡（含 Luhn 校验）

使用方式¶

管道式（推荐）¶

from datamax import DataMax

cleaned = DataMax(file_path="a.txt").clean_data([
    "abnormal",  # 异常清理
    "filter",    # 质量过滤
    "private"    # 隐私脱敏
])
print(cleaned["content"])  # 统一结构，含 lifecycle

细粒度类（进阶）¶

from datamax.cleaner import AbnormalCleaner, TextFilter, PrivacyDesensitization

text = "含 <b>HTML</b> 与 个人信息：182****，test@example.com"

text = AbnormalCleaner(text).to_clean()["text"]
text = TextFilter(text).to_filter().get("text", text)
text = PrivacyDesensitization(text).to_private()["text"]

小贴士¶

建议先“异常清理”再做“过滤”，最后做“隐私脱敏”
clean_data 会为输入/输出追加生命周期事件，便于审计与追踪

示例脚本¶

""" Clean text using DataMax cleaner pipeline. """ from datamax import DataMax

def main(): # Use a small text file from examples or replace with your own path input_path = "examples/parse/sample_document.txt"

cleaned = DataMax(file_path=input_path).clean_data(["abnormal", "filter", "private"])
print(cleaned.get("content", "")[:200])

if name == "main": main()