清洗¶
能力清单¶
- 异常清理:HTML 标签、不可见字符、空白/换行规整、简繁转换
- 质量过滤:重复率、长度范围、数字占比
- 隐私脱敏:IP/邮箱/手机号/QQ/身份证/银行卡(含 Luhn 校验)
使用方式¶
管道式(推荐)¶
from datamax import DataMax
cleaned = DataMax(file_path="a.txt").clean_data([
"abnormal", # 异常清理
"filter", # 质量过滤
"private" # 隐私脱敏
])
print(cleaned["content"]) # 统一结构,含 lifecycle
细粒度类(进阶)¶
from datamax.cleaner import AbnormalCleaner, TextFilter, PrivacyDesensitization
text = "含 <b>HTML</b> 与 个人信息:182****,test@example.com"
text = AbnormalCleaner(text).to_clean()["text"]
text = TextFilter(text).to_filter().get("text", text)
text = PrivacyDesensitization(text).to_private()["text"]
小贴士¶
- 建议先“异常清理”再做“过滤”,最后做“隐私脱敏”
clean_data会为输入/输出追加生命周期事件,便于审计与追踪
示例脚本¶
""" Clean text using DataMax cleaner pipeline. """ from datamax import DataMax
def main(): # Use a small text file from examples or replace with your own path input_path = "examples/parse/sample_document.txt"
cleaned = DataMax(file_path=input_path).clean_data(["abnormal", "filter", "private"])
print(cleaned.get("content", "")[:200])
if name == "main": main()