农业大数据学报 ›› 2024, Vol. 6 ›› Issue (1): 1-8.doi: 10.19788/j.issn.2096-6369.100002

• 数据资源 •    下一篇

农业知识图谱构建数据集

陈雷1,2(), 周娜1, 朱芃璇2, 袁媛1,2,*()   

  1. 1.安徽建筑大学电子与信息工程学院,合肥 230601
    2.中国科学院合肥物质科学研究院,智能机械研究所,合肥 230031
  • 收稿日期:2023-08-30 接受日期:2023-11-27 出版日期:2024-03-26 发布日期:2024-01-26
  • 通讯作者: 袁媛,E-mail:yuanyuan@iim.ac.cn
  • 作者简介:陈雷,E-mail:chenlei@iim.ac.cn
  • 基金资助:
    国家自然科学基金项目(32071901);国家自然科学基金项目(32271981);国家基础学科公共科学数据中心课题(NBSDC-DB-20)

A Dataset for Constructing Agricultural Knowledge Graph

CHEN Lei1,2(), ZHOU Na1, ZHU PengXuan2, YUAN Yuan1,2,*()   

  1. 1. School of Electronic and Information Engineering, Anhui Jianzhu University, Hefei 230601, China
    2. Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China
  • Received:2023-08-30 Accepted:2023-11-27 Online:2024-03-26 Published:2024-01-26

摘要:

通过信息技术提高农业生产的效率、优化农业生产中的问题对我国农业发展至关重要。目前,信息技术的发展产生了海量数据,这些数据大多以碎片化、非结构化的形式分布在网络上。尤其在农业领域,使用传统搜索引擎进行信息检索难以高效准确地获取其中有价值的农业信息,往往需要消耗大量的时间和精力从海量无组织的数据中进行二次收集和整理。针对上述问题,本文通过网络爬虫技术挖掘公开的农业网站中的数据,经过自动化或半自动化数据清洗、去噪等过程,将非结构化的数据重新组合成结构化的数据,最终以知识图谱的方式进行存储。所构建的农业知识图谱数据集包括粮食作物、经济作物、水果、蔬菜等11个农业大类、共计8 481个小类的条目数据,每个小类条目对应一种农业生物或药物。具体包括粮食作物461种、经济作物2 208种、水果1 294种、蔬菜257种、食用菌118种、花木1 161种、水产142种、农药113种、农作物病虫害1 605种、兽药519种、中草药603种。根据该数据集构建的农业知识图谱三元组达到90 508条,规模较大、覆盖品类较为广泛,能够为农业知识问答、推荐系统等人机交互友好的智能应用研发提供基础数据支撑;同时,在生成式大模型中融入农业领域知识图谱,有助于在垂直领域上实现更为高效、精准的信息检索和智能决策。

数据摘要:

项目 描述
数据库(集)名称 农业知识图谱构建数据集
所属学科 计算机科学与技术(520);农学其他学科(210.99)
研究主题 农业知识图谱;数据挖掘;人工智能
数据时间范围 2020年-2023年
数据地理空间覆盖 中国
数据类型与技术格式 *.JSON
数据库(集)组成 农业知识图谱数据包括粮食作物、经济作物、水果、蔬菜等11个农业大类、共计8481个小类的条目数据,具体包括粮食作物461种、经济作物2208种、水果1294种、蔬菜257种、食用菌118种、花木1161种、水产142种、农药113种、农作物病虫害1605种、兽药519种、中草药603种。每个大类的数据以JSON格式的文件分别保存。
数据量 14.6 MB
主要数据指标 作物类别;三元组数量
数据可用性 CSTR:17058.11.sciencedb.agriculture.00016
DOI:10.57760/sciencedb.agriculture.00016
https://doi.org/10.57760/sciencedb.agriculture.00016
经费支持 国家自然科学基金项目(32071901,32271981);国家基础学科公共科学数据中心课题(NBSDC-DB-20)

关键词: 农业数据, 网络挖掘, 知识图谱, 数据集

Abstract:

Improving the efficiency of agricultural production and optimizing the problems in agricultural production through information technology is crucial for the development of agriculture in China. At present, the development of information technology has generated massive amounts of data, which are mostly distributed on the Internet in fragmented and unstructured forms. Especially in the domain of agriculture, using traditional search engines for information retrieval is difficult to efficiently and accurately obtain valuable agricultural information, often requiring a lot of time and effort to collect and organize secondary data from massive unorganized data. To address the above issues, this paper utilizes web crawler technology to mine data from publicly available agricultural websites. Through automatic or semi-automatic data cleaning, denoising, and other processes, unstructured data are recombined into structured data, which is ultimately stored in the form of a knowledge graph. The dataset for constructing agricultural knowledge graph includes item data for 11 agricultural categories, such as grain crops, cash crops, fruits, vegetables, etc. Specifically, it includes 461 types of grain crops, 2 208 types of cash crops, 1 294 types of fruits, 257 types of vegetables, 118 types of edible fungi, 1 161 types of flowers and trees, 142 types of aquatic products, 113 types of pesticides, 1 605 types of crop diseases and pests, 519 types of veterinary drugs, and 603 types of Chinese herbal medicines, totaling 8 481 subcategories. The agricultural knowledge graph constructed based on this dataset has 90 508 triplets, which can provide basic data support for the development of human-machine interactive intelligent applications such as agricultural knowledge Q&A and recommendation systems. Meanwhile, integrating agricultural knowledge graph into generative large language models can help achieve more efficient and accurate information retrieval and intelligent decision-making in vertical domains.

Data summary:

Items Description
Dataset name A Dataset for Constructing Agricultural Knowledge Graph
Specific subject area Computer Science and Technology; Other disciplines in Agronomy
Research topic Agricultural knowledge graph; Data mining; Artificial intelligence
Time range 2020 - 2023
Geographical scope China
Data types and technical formats *.JSON
Dataset structure The constructed agricultural knowledge graph includes item data for 11 agricultural categories, such as grain crops, cash crops, fruits, vegetables, etc. Specifically, it includes 461 types of grain crops, 2208 types of cash crops, 1294 types of fruits, 257 types of vegetables, 118 types of edible fungi, 1161 types of flowers and trees, 142 types of aquatic products, 113 types of pesticides, 1605 types of crop diseases and pests, 519 types of veterinary drugs, and 603 types of Chinese herbal medicines, totaling 8481 subcategories. The data of each major category are saved separately in JSON format files.
Volume of data 14.6 MB
Key index in dataset Category of crops; Number of triples
Data accessibility DOI:10.57760/sciencedb.agriculture.00016
CSTR:17058.11.sciencedb.agriculture.00016
https://doi.org/10.57760/sciencedb.agriculture.00016
Financial support National Natural Science Foundation of China (Grants No. 32071901, 32271981) and the Database in National Basic Science Data Center (NO. NBSDC-DB-20)

Key words: agricultural data, network mining, knowledge mapping, datasets