基于畜牧业LIVED基准的大语言模型评估
DOI:
CSTR:
作者:
作者单位:

华中农业大学信息学院

作者简介:

通讯作者:

中图分类号:

基金项目:

国家重点研发计划(2023ZD04061);中央高校基本科研业务费专项(2662025PY018和2662025DKPY005);湖北省技术创新计划项目(2024BBA001)。感谢全国畜牧总站刘丑生研究员、邱小田研究员主导的种猪项目的赞助。


Evaluation?of?Large?Language?Models?in?the?Livestock Domain?Using?the?LIVED?Benchmark
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    为促进大语言模型更好地服务于畜牧业,本文系统评估了11个通用大语言模型和3个畜牧领域专用模型在专业化考试基准数据集中的表现。本文构建的LIVED数据集包含11037道中文畜牧业知识评估数据集,涵盖选择题、判断题、填空题与问答题4种类型。评估结果显示,各大语言模型性能差异显著。整体来看综合性能排名前3的模型分别是DeepSeek-R1(准确率0.700),Kimi K2(准确率0.681)和DeepSeek-V3.1(准确率0.675)。选择题DeepSeek-R1准确率最高(0.868),判断题GPT-5准确率最高(0.842),填空题Kimi K2准确率最高(0.651),问答题Qwen3-235B-A22B准确率最高(0.504)。畜牧业专用模型中,神农百晓大模型整体性能最佳(0.630),仍落后DeepSeek-R1 7个百分点。错题分析显示,当前大语言模型的错误主要富集在畜牧经济与管理、畜牧生产管理等方面。结果表明,大语言模型在专业知识准确性与推理能力方面仍存在明显不足,未来垂直大模型研发需优先解决专业知识与推理性能的短板。

    Abstract:

    Large language models have shown strong performance in general domains but still lack specialized evaluation benchmarks for the livestock industry. As a result, current assessments cannot accurately measure their proficiency in livestock-specific knowledge. To promote the application of large language models in animal husbandry, this study systematically evaluates the performance of 11 general-purpose models and 3 domain-specific models using the LIVED professional examination benchmark dataset. LIVED is a Chinese livestock knowledge assessment dataset with 11,037 questions covering multiple choice, judgment, completion, and open-ended questions. The evaluation reveals substantial performance differences among LLMs. The top three models in overall performance are DeepSeek-R1 (0.689), Kimi K2 (0.672), and DeepSeek-V3.1 (0.670). By question type, DeepSeek-R1 achieved the highest accuracy in multiple-choice questions (0.847), GPT-5 in judgment questions (0.821), Kimi K2 in completion questions (0.651), and Qwen3-235B-A22B in open-ended questions (0.504). Among the specialized livestock models, Shennong Baixiao showed the best overall performance (0.627), but still lagged behind DeepSeek-R1 (0.689). Error analysis shows that most errors occur in livestock economics and management and livestock production management. These findings suggest that large language models still have major limitations in specialized domain knowledge and reasoning ability. Future domain-specific models should prioritize overcoming bottlenecks in professional knowledge and reasoning performance.

    参考文献
    相似文献
    引证文献
引用本文
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-09-30
  • 最后修改日期:2025-12-16
  • 录用日期:2026-04-13
  • 在线发布日期:
  • 出版日期:
文章二维码