Journal of Agricultural Big Data ›› 2026, Vol. 8 ›› Issue (1): 128-134.doi: 10.19788/j.issn.2096-6369.100064

Previous Articles     Next Articles

Arabidopsis Interacting Protein Knowledge Graph Dataset

ZHANG DanDan1,2(), ZHAO RuiXue1,2,*(), KOU YuanTao1,2,*(), XIAN GuoJian1,3, LIU JianGuo1,2   

  1. 1 Institute of Agricultural Information, Chinese Academy of Agricultural Sciences, Beijing 100081, China
    2 Key Laboratory of Knowledge Mining and Knowledge Service for Agricultural Convergence Publishing, Beijing 100081, China
    3 Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
  • Received:2025-05-29 Accepted:2025-09-28 Online:2026-03-26 Published:2026-04-01
  • Contact: ZHAO RuiXue, KOU YuanTao

Abstract:

In crop breeding research, protein complexes formed through protein-protein interactions often bind to the promoters of downstream genes to regulate gene transcription, playing a crucial role in biological functions. Therefore, the potential discovery of protein complexes is essential for revealing the structure of protein-protein interaction networks, identifying downstream regulatory genes, and better understanding the molecular regulatory mechanisms of traits, which is key to developing high-quality, high-yield, and multi-resistant new varieties. However, existing methods for predicting protein-protein interactions lack deep semantic associations of multi-dimensional data and are limited to considering only a single influencing factor, making it difficult to discover the structure of crop protein complexes. Based on the principles of data reliability, practicality, and ease of use, this study selected the PlaPPISite database and the Uniprot database as data sources and used mapping knowledge extraction to achieve the integration and association of protein-related datasets. Ultimately, a knowledge graph dataset of Arabidopsis thaliana protein-protein interactions was formed and stored as structured data in.csv format. This dataset includes 11 entity datasets and 11 entity semantic relationship datasets. To verify the effectiveness of this dataset, Neo4j graph database was used for data storage. Finally, an Arabidopsis thaliana protein-protein interaction knowledge graph covering approximately 68,713 nodes and 109,496 semantic relationships was formed, which can effectively support hierarchical knowledge association and discovery centered on protein entities. The Arabidopsis thaliana protein-protein interaction knowledge graph dataset can provide a key semantic model and important data foundation for the discovery of protein complexes. Relevant research and production units can build an Arabidopsis thaliana protein-protein interaction knowledge base based on this dataset, providing a critical knowledge resource base for the construction of a crop breeding knowledge discovery service platform.

Data summary:

Item Description
Dataset name Arabidopsis Interacting Protein Knowledge Graph Dataset
Specific subject area Other disciplines of agriculture
Research topic Crops; Arabidopsis thaliana interacting protein knowledge graph; Data mining
Geographical scope Globe
Data types and technical formats .csv
Dataset structure This dataset is text data, which contains 11 entity datasets and 11 semantic relationship datasets, which are stored in.csv format. The entity dataset covers a total of 11 entity datasets including genes, proteins, traits, signaling pathways, gene symbols, protein families, domains, subcellular localization, cell components, molecular functions, and biological processes. The semantic relationship dataset covers a total of 10 semantic relationship datasets, including related, interacting, corresponding, consistent, participating, expressing in, having protein domains, belonging, exercising functions, and participating, and the data content includes entity-relation-entity triples.
Volume of dataset 17.32 MB
Key index in dataset Transcriptome name, functional description, physical location, species, etc.
Data accessibility CSTR:17058.11.sciencedb.agriculture.00253; https://cstr.cn/17058.11.sciencedb.agriculture.00253
DOI:10.57760/sciencedb.agriculture.00253; https://doi.org/10.57760/sciencedb.agriculture.00253
Financial support Central Public-interest Scientific Institution Basal Research Fund (No. JBYW-AII-2025-20).

Key words: knowledge graph, breeding knowledge discovery, protein complex