农业大数据学报 ›› 2026, Vol. 8 ›› Issue (1): 128-134.doi: 10.19788/j.issn.2096-6369.100064

• 数据资源 • 上一篇    下一篇

拟南芥互作蛋白知识图谱数据集

张丹丹1,2(), 赵瑞雪1,2,*(), 宼远涛1,2,*(), 鲜国建1,3, 刘建国1,2   

  1. 1 中国农业科学院农业信息研究所北京 100081
    2 农业融合出版知识挖掘与知识服务重点实验室北京 100081
    3 农业农村部农业大数据重点实验室北京 100081
  • 收稿日期:2025-05-29 接受日期:2025-09-28 出版日期:2026-03-26 发布日期:2026-04-01
  • 通讯作者: 赵瑞雪,E-mail:zhaoruixue@caas.cn
    寇远涛,E-mail:kouyuantao@caas.cn
  • 作者简介:张丹丹,E-mail:zhangdandan01@caas.cn
    数据作者分工职责

    张丹丹,数据分析、质量控制及论文撰写。

    赵瑞雪、寇远涛,组织与管理,论文指导。

    鲜国建,数据收集整理、质量控制。

    刘建国,数据处理。

  • 基金资助:
    中央级公益性科研院所基本科研业务费专项(JBYW-AII-2025-20)

Arabidopsis Interacting Protein Knowledge Graph Dataset

ZHANG DanDan1,2(), ZHAO RuiXue1,2,*(), KOU YuanTao1,2,*(), XIAN GuoJian1,3, LIU JianGuo1,2   

  1. 1 Institute of Agricultural Information, Chinese Academy of Agricultural Sciences, Beijing 100081, China
    2 Key Laboratory of Knowledge Mining and Knowledge Service for Agricultural Convergence Publishing, Beijing 100081, China
    3 Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
  • Received:2025-05-29 Accepted:2025-09-28 Published:2026-03-26 Online:2026-04-01

摘要:

在作物育种科学研究中,蛋白质通过相互作用所形成的蛋白质复合体往往结合下游基因的启动子来调控基因转录,在生命体中发挥重要的生物学功能。因此,蛋白复合体的潜在发现有助于揭示蛋白质-蛋白质相互作用网络结构、挖掘下游调控基因,更好地阐明性状的分子调控机制,是助力优质、高产、多抗新品种培育的关键。然而,现有蛋白互作关系预测方法缺少多维度数据深层次语义关联,仅限于单一影响因素的考量,难以发现作物蛋白复合体结构。本研究基于数据的可靠性、实用性、易用性等原则,选取PlaPPISite数据库与Uniprot数据库作为数据获取来源,采用映射知识抽取方式实现蛋白相关数据集的关联融合。最终,形成了拟南芥互作蛋白知识图谱数据集,并以.csv格式存储为结构化数据。该数据集包含11个实体数据集和11个实体语义关系数据集。为了验证该数据集的有效性,本研究采用Neo4j图数据库进行数据集存储。最终,形成了涵盖约68 713个节点和109 496条语义关系的拟南芥互作蛋白知识图谱,可有效支撑以蛋白为中心实体的层级知识关联检索与发现。拟南芥互作蛋白知识图谱数据集可以为蛋白复合体发现提供关键的语义模型和重要的数据基础。相关科研和生产单位可基于本数据集构建拟南芥互作蛋白知识库,为作物育种知识发现服务平台的构建提供关键的知识资源底座。

数据摘要:

项目 描述
数据集名称 拟南芥互作蛋白知识图谱数据集
所属学科 农学其他学科
研究主题 作物;拟南芥互作蛋白知识图谱;数据挖掘
数据地理空间覆盖 全球
数据类型与技术格式 .csv
数据库(集)组成


本数据集为文本数据,共包含11个实体数据集与11个语义关系数据集,以.csv格式存储。实体数据集涵盖基因、蛋白、性状、信号通路、基因符号、蛋白家族、结构域、亚细胞定位、细胞组分、分子功能、生物学过程共计11个实体数据集,数据内容包含实体名称以及根据实体特征提取的共性高频数据属性。语义关系数据集涵盖有关、互作、相对应、一致、参与、表达于、有……蛋白结构域、属于、行使功能、参与共计10个语义关系数据集,数据内容包含实体-关系-实体三元组。
数据量 17.32 MB
主要数据指标 转录组名称、功能描述、物理位置、物种等
数据可用性 CSTR:17058.11.sciencedb.agriculture.00253; https://cstr.cn/17058.11.sciencedb.agriculture.00253
DOI:10.57760/sciencedb.agriculture.00253; https://doi.org/10.57760/sciencedb.agriculture.00253
经费支持 中央级公益性科研院所基本科研业务费专项(JBYW-AII-2025-20)。

关键词: 知识图谱, 育种知识发现, 蛋白复合体

Abstract:

In crop breeding research, protein complexes formed through protein-protein interactions often bind to the promoters of downstream genes to regulate gene transcription, playing a crucial role in biological functions. Therefore, the potential discovery of protein complexes is essential for revealing the structure of protein-protein interaction networks, identifying downstream regulatory genes, and better understanding the molecular regulatory mechanisms of traits, which is key to developing high-quality, high-yield, and multi-resistant new varieties. However, existing methods for predicting protein-protein interactions lack deep semantic associations of multi-dimensional data and are limited to considering only a single influencing factor, making it difficult to discover the structure of crop protein complexes. Based on the principles of data reliability, practicality, and ease of use, this study selected the PlaPPISite database and the Uniprot database as data sources and used mapping knowledge extraction to achieve the integration and association of protein-related datasets. Ultimately, a knowledge graph dataset of Arabidopsis thaliana protein-protein interactions was formed and stored as structured data in.csv format. This dataset includes 11 entity datasets and 11 entity semantic relationship datasets. To verify the effectiveness of this dataset, Neo4j graph database was used for data storage. Finally, an Arabidopsis thaliana protein-protein interaction knowledge graph covering approximately 68,713 nodes and 109,496 semantic relationships was formed, which can effectively support hierarchical knowledge association and discovery centered on protein entities. The Arabidopsis thaliana protein-protein interaction knowledge graph dataset can provide a key semantic model and important data foundation for the discovery of protein complexes. Relevant research and production units can build an Arabidopsis thaliana protein-protein interaction knowledge base based on this dataset, providing a critical knowledge resource base for the construction of a crop breeding knowledge discovery service platform.

Data summary:

Item Description
Dataset name Arabidopsis Interacting Protein Knowledge Graph Dataset
Specific subject area Other disciplines of agriculture
Research topic Crops; Arabidopsis thaliana interacting protein knowledge graph; Data mining
Geographical scope Globe
Data types and technical formats .csv
Dataset structure This dataset is text data, which contains 11 entity datasets and 11 semantic relationship datasets, which are stored in.csv format. The entity dataset covers a total of 11 entity datasets including genes, proteins, traits, signaling pathways, gene symbols, protein families, domains, subcellular localization, cell components, molecular functions, and biological processes. The semantic relationship dataset covers a total of 10 semantic relationship datasets, including related, interacting, corresponding, consistent, participating, expressing in, having protein domains, belonging, exercising functions, and participating, and the data content includes entity-relation-entity triples.
Volume of dataset 17.32 MB
Key index in dataset Transcriptome name, functional description, physical location, species, etc.
Data accessibility CSTR:17058.11.sciencedb.agriculture.00253; https://cstr.cn/17058.11.sciencedb.agriculture.00253
DOI:10.57760/sciencedb.agriculture.00253; https://doi.org/10.57760/sciencedb.agriculture.00253
Financial support Central Public-interest Scientific Institution Basal Research Fund (No. JBYW-AII-2025-20).

Key words: knowledge graph, breeding knowledge discovery, protein complex