爬虫¶
引擎与模式¶
- 引擎:
web(通用网页)、arxiv(学术论文),或engine="auto"聚合多引擎 - 模式:支持同步/异步(内部已封装聚合接口)
统一调用(一行式)¶
from datamax.crawler import crawl
# 指定引擎
data = crawl("https://example.com", engine="web")
# 聚合所有可用引擎
mix = crawl("航运", engine="auto")
结果解析(转 Markdown)¶
from datamax.parser import CrawlerParser
md_vo = CrawlerParser(file_path="crawl_result.json").parse()
print(md_vo.to_dict()["content"]) # 结构化 Markdown
存储适配¶
- 本地:JSON/YAML(
LocalStorageAdapter) - 云:
CloudStorageAdapter占位(可扩展 S3/GCS/Azure)
小贴士¶
engine="auto"会并发调用已注册的引擎并汇总成功/失败结果- 可结合解析/清洗/标注形成完整链路(爬取 → 解析 → 清洗 → 标注)
示例脚本¶
""" Crawl a web page or search query and save result to a file. """ import json from datamax.crawler import crawl
def main(): # URL or keyword target = "https://example.com" result = crawl(target, engine="web")
with open("crawl_result.json", "w", encoding="utf-8") as f:
json.dump(result, f, ensure_ascii=False, indent=2)
print("Saved crawl_result.json")
if name == "main": main()