作物性状调控基因知识图谱数据集

doi:10.19788/j.issn.2096-6369.100051

摘要/Abstract

摘要：

当前，作物育种相关的多维度科学数据呈指数级增长，这些半结构化和结构化的科学数据分布在不同领域科学数据库中，缺少跨物种多维度科学数据的关联融合数据集，阻碍了已有作物育种知识的迁移复用与作物育种科学数据价值的最大化发挥，这为作物性状调控基因知识发现带来了挑战。本研究基于数据的可靠性、实用性、易用性等原则，选取PubMed文献数据库与Phytozome、Ensembl plants、UniProt、RGAP、STRING、Pfam、KEGG和GO作为数据获取来源，采用多路径知识抽取的方式对不同数据格式的科学数据分别进行实体及关系的抽取。面向结构化数据的映射知识抽取；面向XML半结构化数据，采用基于Kettle进行数据解析的知识抽取；面向FASTA半结构化数据，采用基于BLAST模型计算的知识抽取。面向Text非结构化数据，采用基于大语言模型的知识抽取。在完成以上实体和关系抽取的基础上，进一步基于实体映射和特定属性关联的方式，实现多源作物育种知识的关联融合。形成了作物性状调控基因知识图谱数据集，并以.csv格式存储为结构化数据。该数据集包含13个实体数据集和14个语义关系数据集。为了验证该数据集的有效性，采用Neo4j图数据库进行数据集存储。最终，形成了涵盖约13万个节点和55万条语义关系的作物性状调控基因知识图谱，可有效支撑跨物种基因知识的关联检索。作物性状调控基因知识图谱数据集已为优异多效基因发现、跨物种基因功能预测与通路基因网络潜在发现等作物育种知识发现提供了关键的语义模型和重要的数据基础。相关科研和生产单位可基于本数据集构建作物性状调控基因知识库，为作物育种知识发现服务平台的构建提供关键的知识资源底座。

数据摘要：

项目	描述
数据集名称	作物性状调控基因知识图谱数据集
所属学科	农学其他学科（21099）
研究主题	作物；性状调控基因知识图谱；数据挖掘
数据类型与技术格式	.csv
数据库（集）组成	27个表格文件，包含水稻、玉米、小麦、拟南芥跨物种关联融合的13个实体数据集与14个语义关系数据集。
数据量	32.18 MB
主要数据指标	转录组名称、功能描述、物理位置、物种等
数据可用性	CSTR: 17058.11.sciencedb.agriculture.00175; https://cstr.cn/17058.11.sciencedb.agriculture.00175 DOI: 10.57760/sciencedb.agriculture.00175; https://doi.org/10.57760/sciencedb.agriculture.00175
经费支持	中国农业科学院科技创新工程（CAAS-ASTIP-2016-AII）

关键词: 作物, 知识图谱, 育种知识发现, 优异多效基因

Abstract:

As the cornerstone of ensuring national food security and the effective supply of important agricultural products, the seed industry has always been the direction of breeders' efforts to cultivate new crop varieties with the aggregation of a variety of excellent traits. Therefore, the excavation of pleiotropic genes that regulate multiple excellent traits such as drought resistance and disease resistance will effectively contribute to the scientific research of crop breeding. At present, with the accelerated application of information technology in the field of crop breeding, the multi-dimensional scientific data related to crop breeding has increased exponentially. These semi-structured and structured scientific data are distributed in scientific databases in different fields, and there is a lack of cross-species and multi-dimensional scientific data correlation and fusion datasets, which hinders the migration and reuse of existing crop breeding knowledge and maximizes the value of crop breeding scientific data, which brings challenges to the discovery of crop trait regulation gene knowledge. Based on the reliability, practicability, and ease of use of the data, PubMed literature database, Phytozome, Ensembl plants, UniProt, RGAP, STRING, Pfam, KEGG and GO were selected as the data acquisition sources, and the entities and relationships of scientific data in different data formats were extracted by multi-path knowledge extraction. It is mainly oriented to the mapping knowledge extraction of structured data; For XML semi-structured data, knowledge extraction based on Kettle data analysis is adopted. For FASTA semi-structured data, knowledge extraction based on BLAST model is adopted. For Text unstructured data, knowledge extraction based on large language models is adopted. On the basis of the above entity and relationship extraction, the association and integration of multi-source crop breeding knowledge were further realized based on the entity mapping and specific attribute association. Finally, a knowledge graph dataset of crop trait regulatory genes was formed, which was stored as structured data in.csv format. The dataset consists of 13 entity datasets and 14 semantic relationship datasets. In order to verify the validity of the dataset, the Neo4j graph database was used for dataset storage. Finally, a knowledge graph of crop trait regulatory genes covering 130,000 nodes and 550,000 semantic relationships was formed, which could effectively support the association retrieval of cross-species gene knowledge. The knowledge graph dataset of crop trait regulatory genes has provided a key semantic model and an important data basis for the discovery of crop breeding knowledge such as excellent pleiotropic gene discovery, cross-species gene function prediction and pathway gene network potential discovery. Based on this dataset, relevant scientific research and production units can construct a knowledge base of crop trait regulatory genes, which provides a key knowledge resource base for the construction of a crop breeding knowledge discovery service platform.

Data summary：

Items	Description
Dataset name	Crop Trait Regulating-genes Knowledge Graph Datasets
Specific subject area	Other disciplines of agriculture
Research topic	Crops; trait-regalating gene knowledge graph; data mining
Data types and technical formats	.csv
Dataset structure	This dataset is a 27-table file, contains 13 entity datasets and 14 semantic relationship datasets across rice, maize, wheat, and Arabidopsis thaliana.
Volume of dataset	32.18 MB
Key index in dataset	Transcriptome name, functional description, physical location, species, etc.
Data accessibility	CSTR: 17058.11.sciencedb.agriculture.00175; https://cstr.cn/17058.11.sciencedb.agriculture.00175 DOI: 10.57760/sciencedb.agriculture.00175; https://doi.org/10.57760/sciencedb.agriculture.00175
Financial support	Chinese Academy of Agricultural Sciences Science and Technology Innovation Project (CAAS-ASTIP-2016-AII)

Key words: crops, knowledge graph, crop breeding knowledge discovery, elite polyphenotype genes

张丹丹, 赵瑞雪, 宼远涛, 鲜国建. 作物性状调控基因知识图谱数据集[J]. 农业大数据学报, 2025, 7(2): 220-226.

ZHANG DanDan, ZHAO RuiXue, KOU YuanTao, XIAN GuoJian. Crop Trait Regulating-genes Knowledge Graph Datasets[J]. Journal of Agricultural Big Data, 2025, 7(2): 220-226.

图/表 8

表1

表2

表3

表4

表5

表6

表7

图1

参考文献 11

[1]	ZHAO H, LI J, YANG L, et al. An inferred functional impact map of genetic variants in rice. Molecular Plant, 2021, 14(9): 1584-1599. DOI: 10.1016/j.molp.2021.06.025. pmid: 34214659
[2]	PORTWOOD J L, WOODHOUSE M R, CANNON E K, et al. MaizeGDB 2018: the maize multi-genome genetics and genomics database. Nucleic Acids Research, 2018, 47(D1):D1146-D1154. DOI:10.1093/nar/gky1046.
[3]	APPELS R, EVERSOLE K, FEUILLET C, et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science, 2018, 361(6403): eaar7191. DOI: 10.1126/science.aar7191.
[4]	GOODSTEIN D M, SHU S, HOWSON R, et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Research, 2012, 40: D1178-86. DOI: 10.1093/nar/gkr944.
[5]	CONSORTIUM T U. The Universal Protein Resource (UniProt). Nucleic Acids Research, 2007, 35: 193-197. DOI: 10.1093/nar/gkl929. pmid: 17142230
[6]	CHEN L, ZHANG Y H, WANG S P, et al. Prediction and analysis of essential genes using the enrichments of gene ontology and KEGG pathways. PLoS ONE, 2017, 12(9): e0184129. DOI: 10.1371/journal.pone.0184129.
[7]	LAN Y, HE S, LIU K, et al. Path-based knowledge reasoning with textual semantic information for medical knowledge graph completion. BMC Medical Informatics and Decision Making, 2021, 21(Suppl 9): 335. DOI: 10.1186/s12911-021-01622-7. pmid: 34844576
[8]	YANG R, YE Q, CHENG C, et al. Decision-making system for the diagnosis of syndrome based on traditional Chinese medicine knowledge graph. Evidence-based complementary and alternative medicine,eCAM 2022, 8693937. DOI: 10.1155/2022/8693937.
[9]	张丹丹. 基于知识图谱的作物性状调控基因知识发现研究. 北京: 中国农业科学院, 2024.
[10]	张丹丹, 赵瑞雪, 鲜国建, 等. 融合跨物种科学数据的性状调控基因本体模型构建及应用. 生物技术通报, 2024, 40(2):313-324. doi: 10.13560/j.cnki.biotech.bull.1985.2023-0748
[11]	ZHANG D, ZHAO R, XIAN G, et al. A new model construction based on the knowledge graph for mining elite polyphenotype genes in crops. Frontiers in Plant Science, 2024, 20(15):1361716.

数据类型维度 Data type dimension	实体名称 Entity name	数据属性 Data attribute
基因水平 Gene level	基因 Gene	基因标识符；物种；物理位置 Gene identity; species; location
蛋白水平 Protein level	蛋白 Protein	蛋白标识符；物种；功能描述 Protein identity; species; function description
富集通路水平 Enrichment pathways level	生物学过程 Biological process	GO编号；名称 GO identity; name
性状水平 Trait level	性状 Trait	名称；类型 Name; type
Note: GO: gene ontology

GeneID	Species	Location
AT1G56060	Arabidopsis thaliana	1:20966528-20967180
AT2G40220	Arabidopsis thaliana	2:16796247-16797585
TraesCS1A02G223400	Triticum aestivum	1A:393612319-393613101
Zm00001d038001	Zea mays	6:151835642-151836388
LOC_Os05g28350	Oryza sativa Japonica Group	5:16600597-16601813

ProteinID	Species	Date_of_Creation
A0A384KK78	Arabidopsis thaliana	2018/11/7
A0A178WF56	Arabidopsis thaliana	2022/2/23
A0A3B5Y0P5	Triticum aestivum	2018/12/5
A0A3L6EAB2	Zea mays	2019/2/13
Q10R18	Oryza sativa Japonica Group	2010/10/5

Trait	Type
drought_resistance	stress_resistance
salt_resistance	stress resistance
disease_resistance	disease_insect_resistance
insect_resistance	disease_insect_resistance
plant_height_reduce	growth_and_development
grain_weight_increase	economy

对象属性名称 (object attributes)	描述对象 (object)	释义 (description)
与……有关 (associates with)	蛋白 (protein); 性状 (trait)	描述蛋白和性状之间的关联关系
与…同源 (homologous to)	蛋白 (protein)；蛋白 (protein)	描述蛋白与蛋白之间的同源关系
与......互作 (interacts with)	蛋白 (protein)；蛋白 (protein)	描述蛋白与蛋白间相互作用的关系
与……相对应 (corresponding to)	蛋白 (protein); 基因 (gene)	描述基因和蛋白间的对应关系