Journal of Agricultural Big Data >
Crop Trait Regulating-genes Knowledge Graph Datasets
Received date: 2024-12-07
Accepted date: 2025-01-08
Online published: 2025-06-23
As the cornerstone of ensuring national food security and the effective supply of important agricultural products, the seed industry has always been the direction of breeders' efforts to cultivate new crop varieties with the aggregation of a variety of excellent traits. Therefore, the excavation of pleiotropic genes that regulate multiple excellent traits such as drought resistance and disease resistance will effectively contribute to the scientific research of crop breeding. At present, with the accelerated application of information technology in the field of crop breeding, the multi-dimensional scientific data related to crop breeding has increased exponentially. These semi-structured and structured scientific data are distributed in scientific databases in different fields, and there is a lack of cross-species and multi-dimensional scientific data correlation and fusion datasets, which hinders the migration and reuse of existing crop breeding knowledge and maximizes the value of crop breeding scientific data, which brings challenges to the discovery of crop trait regulation gene knowledge. Based on the reliability, practicability, and ease of use of the data, PubMed literature database, Phytozome, Ensembl plants, UniProt, RGAP, STRING, Pfam, KEGG and GO were selected as the data acquisition sources, and the entities and relationships of scientific data in different data formats were extracted by multi-path knowledge extraction. It is mainly oriented to the mapping knowledge extraction of structured data; For XML semi-structured data, knowledge extraction based on Kettle data analysis is adopted. For FASTA semi-structured data, knowledge extraction based on BLAST model is adopted. For Text unstructured data, knowledge extraction based on large language models is adopted. On the basis of the above entity and relationship extraction, the association and integration of multi-source crop breeding knowledge were further realized based on the entity mapping and specific attribute association. Finally, a knowledge graph dataset of crop trait regulatory genes was formed, which was stored as structured data in.csv format. The dataset consists of 13 entity datasets and 14 semantic relationship datasets. In order to verify the validity of the dataset, the Neo4j graph database was used for dataset storage. Finally, a knowledge graph of crop trait regulatory genes covering 130,000 nodes and 550,000 semantic relationships was formed, which could effectively support the association retrieval of cross-species gene knowledge. The knowledge graph dataset of crop trait regulatory genes has provided a key semantic model and an important data basis for the discovery of crop breeding knowledge such as excellent pleiotropic gene discovery, cross-species gene function prediction and pathway gene network potential discovery. Based on this dataset, relevant scientific research and production units can construct a knowledge base of crop trait regulatory genes, which provides a key knowledge resource base for the construction of a crop breeding knowledge discovery service platform.
Data summary:
| Items | Description |
|---|---|
| Dataset name | Crop Trait Regulating-genes Knowledge Graph Datasets |
| Specific subject area | Other disciplines of agriculture |
| Research topic | Crops; trait-regalating gene knowledge graph; data mining |
| Data types and technical formats | .csv |
| Dataset structure | This dataset is a 27-table file, contains 13 entity datasets and 14 semantic relationship datasets across rice, maize, wheat, and Arabidopsis thaliana. |
| Volume of dataset | 32.18 MB |
| Key index in dataset | Transcriptome name, functional description, physical location, species, etc. |
| Data accessibility | CSTR: 17058.11.sciencedb.agriculture.00175; https://cstr.cn/17058.11.sciencedb.agriculture.00175 DOI: 10.57760/sciencedb.agriculture.00175; |
| Financial support | Chinese Academy of Agricultural Sciences Science and Technology Innovation Project (CAAS-ASTIP-2016-AII) |
ZHANG DanDan , ZHAO RuiXue , KOU YuanTao , XIAN GuoJian . Crop Trait Regulating-genes Knowledge Graph Datasets[J]. Journal of Agricultural Big Data, 2025 , 7(2) : 220 -226 . DOI: 10.19788/j.issn.2096-6369.100051
| [1] | ZHAO H, LI J, YANG L, et al. An inferred functional impact map of genetic variants in rice. Molecular Plant, 2021, 14(9): 1584-1599. DOI: 10.1016/j.molp.2021.06.025. |
| [2] | PORTWOOD J L, WOODHOUSE M R, CANNON E K, et al. MaizeGDB 2018: the maize multi-genome genetics and genomics database. Nucleic Acids Research, 2018, 47(D1):D1146-D1154. DOI:10.1093/nar/gky1046. |
| [3] | APPELS R, EVERSOLE K, FEUILLET C, et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science, 2018, 361(6403): eaar7191. DOI: 10.1126/science.aar7191. |
| [4] | GOODSTEIN D M, SHU S, HOWSON R, et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Research, 2012, 40: D1178-86. DOI: 10.1093/nar/gkr944. |
| [5] | CONSORTIUM T U. The Universal Protein Resource (UniProt). Nucleic Acids Research, 2007, 35: 193-197. DOI: 10.1093/nar/gkl929. |
| [6] | CHEN L, ZHANG Y H, WANG S P, et al. Prediction and analysis of essential genes using the enrichments of gene ontology and KEGG pathways. PLoS ONE, 2017, 12(9): e0184129. DOI: 10.1371/journal.pone.0184129. |
| [7] | LAN Y, HE S, LIU K, et al. Path-based knowledge reasoning with textual semantic information for medical knowledge graph completion. BMC Medical Informatics and Decision Making, 2021, 21(Suppl 9): 335. DOI: 10.1186/s12911-021-01622-7. |
| [8] | YANG R, YE Q, CHENG C, et al. Decision-making system for the diagnosis of syndrome based on traditional Chinese medicine knowledge graph. Evidence-based complementary and alternative medicine,eCAM 2022, 8693937. DOI: 10.1155/2022/8693937. |
| [9] | 张丹丹. 基于知识图谱的作物性状调控基因知识发现研究. 北京: 中国农业科学院, 2024. |
| [10] | 张丹丹, 赵瑞雪, 鲜国建, 等. 融合跨物种科学数据的性状调控基因本体模型构建及应用. 生物技术通报, 2024, 40(2):313-324. |
| [11] | ZHANG D, ZHAO R, XIAN G, et al. A new model construction based on the knowledge graph for mining elite polyphenotype genes in crops. Frontiers in Plant Science, 2024, 20(15):1361716. |
/
| 〈 |
|
〉 |