跳转至

评估

文本质量评估

  • 指标:BERTScore、ROUGE、BLEU、Self‑CIDEr(语义多样性)
from datamax.evaluator import TextQualityEvaluator

e = TextQualityEvaluator(lang="zh")
bertscore = e.evaluate_bertscore(["生成句子"], ["参考句子"])  # {precision, recall, f1}

需要安装相应依赖(如 bert-score, rouge-score, sacrebleu, pycocoevalcap)。

多模态一致性

  • 指标:CLIPScore(DashScope 多模态嵌入)、VQA(OpenAI 兼容接口)
from datamax.evaluator import MultimodalConsistencyEvaluator as MCE

m = MCE(
  clip_model_name="qwen-vl-clip",
  vqa_model_name="qwen-vl-max",
  dashscope_api_key="${DASHSCOPE_API_KEY}"
)
score = m.evaluate_clip_score("img.png", "这张图描述了...")

端到端筛选(范式)

  • 解析 PDF → 输出含绝对图片路径的 Markdown
  • 生成多模态 QA → 计算 CLIPScore/VQA 分数
  • 设阈值过滤低质量样本 → 导出

示例脚本

""" Evaluate generated text quality with BERTScore. Requires bert-score installed. """ from datamax.evaluator import TextQualityEvaluator

def main(): cand = ["智能体在航运场景中的应用包括……"] refs = ["航运场景中,智能体主要应用于……"] e = TextQualityEvaluator(lang="zh") scores = e.evaluate_bertscore(cand, refs) print(scores)

if name == "main": main()