Abstract:Large language models have shown strong performance in general domains but still lack specialized evaluation benchmarks for the livestock industry. As a result, current assessments cannot accurately measure their proficiency in livestock-specific knowledge. To promote the application of large language models in animal husbandry, this study systematically evaluates the performance of 11 general-purpose models and 3 domain-specific models using the LIVED professional examination benchmark dataset. LIVED is a Chinese livestock knowledge assessment dataset with 11,037 questions covering multiple choice, judgment, completion, and open-ended questions. The evaluation reveals substantial performance differences among LLMs. The top three models in overall performance are DeepSeek-R1 (0.689), Kimi K2 (0.672), and DeepSeek-V3.1 (0.670). By question type, DeepSeek-R1 achieved the highest accuracy in multiple-choice questions (0.847), GPT-5 in judgment questions (0.821), Kimi K2 in completion questions (0.651), and Qwen3-235B-A22B in open-ended questions (0.504). Among the specialized livestock models, Shennong Baixiao showed the best overall performance (0.627), but still lagged behind DeepSeek-R1 (0.689). Error analysis shows that most errors occur in livestock economics and management and livestock production management. These findings suggest that large language models still have major limitations in specialized domain knowledge and reasoning ability. Future domain-specific models should prioritize overcoming bottlenecks in professional knowledge and reasoning performance.