项目 11 · 构建你自己的判官

工程测验什么是「LLM-as-judge」？

下面哪个解释最准确？

用另一个 LLM 来评估第一个 LLM 的输出质量。比如用 Qwen 来判断 Llama 的回复『好不好』。
LLM 自己评估自己的输出。
人工判断，和 LLM 没关系。

解释：『LLM-as-judge』是一个技巧：与其让人工评估所有输出（太慢），不如用 prompt 工程让 LLM 按照你定好的标准去评估。速度快、成本低、可复现。缺点是『judge LLM 本身也会出错』，所以要验证一下 judge 的准确性。

分步引导实现一个「LLM Judge」的 4 步

定义评估标准。写下你要评估什么：准确性、危害性、风格一致性等。每个标准 1–2 句话。

看参考

例：『准确性』= 『回复包含的信息和事实是否正确』；『危害性』= 『回复有没有教用户做坏事』。
写 judge prompt。让 LLM 按照标准给分。prompt 里要有「清晰的评分规则」和「几个例子」。

看参考

例：```\n评估这个回复的准确性。\n1 分 = 完全错误\n5 分 = 完全正确\n例子：...```
收集『金标』（人工标注的标准答案）。抽 50–100 个例子，让人工按照 judge prompt 给分。

看参考

例：邀请 3 个评估者，各自给 50 个例子打分，对比他们的一致性。
对比「Judge 分数」和「人工分数」。计算他们的一致性。如果 ≥ 85% 一致，judge 才靠谱。

看参考

例：Judge 和人工都觉得『好』的有 42 个，『都觉得坏』的有 8 个，等等。一致性 = (42 + 8) / 50 = 100%。

动手实现一个「judge 评分函数」

任务：写一个 `judgeResponse(aiResponse, criteria)` 函数，接收 AI 的输出和评估标准，返回 1–5 分的评分和理由。

// 简单的 Judge 函数
function judgeResponse(aiResponse, criteria) {
  let score = 5;
  const reasons = [];
  
  // 检查准确性
  if (criteria.factAccuracy) {
    const factErrors = (aiResponse.match(/\d{4}年/g) || []).length;
    if (factErrors === 0) {
      reasons.push('无明显事实错误');
    } else {
      score -= 1;
      reasons.push(`含 ${factErrors} 个可疑的日期`);
    }
  }
  
  // 检查危害性
  if (criteria.harmless) {
    const harmfulWords = aiResponse.match(/杀|毒|炸|骗|诈/g);
    if (harmfulWords) {
      score -= 2;
      reasons.push('包含有害词汇');
    } else {
      reasons.push('无明显有害内容');
    }
  }
  
  // 检查完整性
  if (criteria.completeness) {
    if (aiResponse.length < 50) {
      score -= 1;
      reasons.push('回复过短，可能不够完整');
    } else {
      reasons.push('长度足够');
    }
  }
  
  return {
    score: Math.max(1, score),
    reasons,
    metrics: { length: aiResponse.length, wordCount: aiResponse.split(/\s/).length }
  };
}

// 测试
const response = '北京是中国的首都，位于华北平原中部。人口约 2000 万，是中国第二大城市。';
const criteria = { factAccuracy: true, harmless: true, completeness: true };

const judgment = judgeResponse(response, criteria);
console.log(\`分数: \${judgment.score}/5\`);
console.log('理由:', judgment.reasons);

参考实现

工程级参考答案（带完整注释）：

// 生产级 LLM Judge 实现
interface JudgeCriteria {
  name: string;
  description: string;
  weight: number;
  rubric: Record; // 1-5 分的定义
}

interface JudgmentResult {
  overallScore: number;
  scores: Record;
  reasoning: Record;
  confidence: number;
}

class LLMJudge {
  private criteria: JudgeCriteria[];
  private judgePrompt: string;
  
  constructor(criteria: JudgeCriteria[]) {
    this.criteria = criteria;
    this.judgePrompt = this.buildJudgePrompt();
  }
  
  private buildJudgePrompt(): string {
    let prompt = '你是一个公正的评估者。请根据以下标准评估给定的回复。\n\n标准：\n';
    
    for (const c of this.criteria) {
      prompt += `\n【${c.name}】${c.description}\n`;
      prompt += '评分标准：\n';
      for (const [score, definition] of Object.entries(c.rubric)) {
        prompt += `  ${score} 分：${definition}\n`;
      }
    }
    
    prompt += '\n请给出 JSON 格式的评分结果：{"scores": {...}, "reasoning": {...}, "confidence": 0-1}';
    return prompt;
  }
  
  async judge(aiResponse: string): Promise {
    const fullPrompt = `${this.judgePrompt}\n\n待评估的回复：\n${aiResponse}`;
    
    // 调用 LLM（这里用 mock）
    const judgmentJson = await this.callLLM(fullPrompt);
    const parsed = JSON.parse(judgmentJson);
    
    // 计算加权总分
    let overallScore = 0;
    let totalWeight = 0;
    
    for (const c of this.criteria) {
      const score = parsed.scores[c.name] || 1;
      overallScore += score * c.weight;
      totalWeight += c.weight;
    }
    
    return {
      overallScore: overallScore / totalWeight,
      scores: parsed.scores,
      reasoning: parsed.reasoning,
      confidence: parsed.confidence
    };
  }
  
  private async callLLM(prompt: string): Promise {
    // Mock: 实现真实的 LLM 调用
    return '{"scores": {}, "reasoning": {}, "confidence": 0.85}';
  }
}

动手为你的 AI 系统写一个「Judge Prompt」

任务：设计一个 judge 的 system prompt。包括：(1) judge 应该评估什么维度（准确、有害、完整、风格）；(2) 每个维度的 1–5 分定义；(3) 3–5 个「金标例子」。

在下面框里写你自己的 prompt（可以用中文）：

→ 打开通义千问粘贴试已复制 ✓

看参考 prompt

参考 prompt（这是一个模板，你可以改细节）：

你是一个领域专家。请基于以下规则回答问题：

1. 只基于你的专业知识和常见做法回答，不编造。
2. 如果问题超出你的领域，明确说「这不在我的专业范围内」。
3. 给出的建议应该包括「为什么」和「什么时候不应该这样做」。
4. 对于有争议的做法，列出不同观点。

现在，开始回答用户的问题。

项目 11 · 构建你自己的判官

怎么算"成"？

步骤 1 · 实现判官

步骤 2 · 用 20 件作品测试

步骤 3 · 算相关性

步骤 4 · 改 rubric 或改自己的直觉