拟南芥互作蛋白知识图谱数据集

doi:10.19788/j.issn.2096-6369.100064

农业大数据学报 ›› 2026, Vol. 8 ›› Issue (1): 128-134.doi: 10.19788/j.issn.2096-6369.100064

拟南芥互作蛋白知识图谱数据集

张丹丹¹^,²(), 赵瑞雪¹^,²^,^*(), 宼远涛¹^,²^,^*(), 鲜国建¹^,³, 刘建国¹^,²

¹ 中国农业科学院农业信息研究所，北京 100081
² 农业融合出版知识挖掘与知识服务重点实验室，北京 100081
³ 农业农村部农业大数据重点实验室，北京 100081

收稿日期:2025-05-29 接受日期:2025-09-28 出版日期:2026-03-26 发布日期:2026-04-01
通讯作者: 赵瑞雪，E-mail：zhaoruixue@caas.cn。
寇远涛，E-mail：kouyuantao@caas.cn。
作者简介:张丹丹，E-mail：zhangdandan01@caas.cn。
数据作者分工职责
张丹丹，数据分析、质量控制及论文撰写。
赵瑞雪、寇远涛，组织与管理，论文指导。
鲜国建，数据收集整理、质量控制。
刘建国，数据处理。
基金资助:
中央级公益性科研院所基本科研业务费专项(JBYW-AII-2025-20)

Arabidopsis Interacting Protein Knowledge Graph Dataset

ZHANG DanDan¹^,²(), ZHAO RuiXue¹^,²^,^*(), KOU YuanTao¹^,²^,^*(), XIAN GuoJian¹^,³, LIU JianGuo¹^,²

¹ Institute of Agricultural Information, Chinese Academy of Agricultural Sciences, Beijing 100081, China
² Key Laboratory of Knowledge Mining and Knowledge Service for Agricultural Convergence Publishing, Beijing 100081, China
³ Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China

Received:2025-05-29 Accepted:2025-09-28 Published:2026-03-26 Online:2026-04-01

摘要/Abstract

摘要：

在作物育种科学研究中，蛋白质通过相互作用所形成的蛋白质复合体往往结合下游基因的启动子来调控基因转录，在生命体中发挥重要的生物学功能。因此，蛋白复合体的潜在发现有助于揭示蛋白质-蛋白质相互作用网络结构、挖掘下游调控基因，更好地阐明性状的分子调控机制，是助力优质、高产、多抗新品种培育的关键。然而，现有蛋白互作关系预测方法缺少多维度数据深层次语义关联，仅限于单一影响因素的考量，难以发现作物蛋白复合体结构。本研究基于数据的可靠性、实用性、易用性等原则，选取PlaPPISite数据库与Uniprot数据库作为数据获取来源，采用映射知识抽取方式实现蛋白相关数据集的关联融合。最终，形成了拟南芥互作蛋白知识图谱数据集，并以.csv格式存储为结构化数据。该数据集包含11个实体数据集和11个实体语义关系数据集。为了验证该数据集的有效性，本研究采用Neo4j图数据库进行数据集存储。最终，形成了涵盖约68 713个节点和109 496条语义关系的拟南芥互作蛋白知识图谱，可有效支撑以蛋白为中心实体的层级知识关联检索与发现。拟南芥互作蛋白知识图谱数据集可以为蛋白复合体发现提供关键的语义模型和重要的数据基础。相关科研和生产单位可基于本数据集构建拟南芥互作蛋白知识库，为作物育种知识发现服务平台的构建提供关键的知识资源底座。

数据摘要：

项目	描述
数据集名称	拟南芥互作蛋白知识图谱数据集
所属学科	农学其他学科
研究主题	作物；拟南芥互作蛋白知识图谱；数据挖掘
数据地理空间覆盖	全球
数据类型与技术格式	.csv
数据库（集）组成	本数据集为文本数据，共包含11个实体数据集与11个语义关系数据集，以.csv格式存储。实体数据集涵盖基因、蛋白、性状、信号通路、基因符号、蛋白家族、结构域、亚细胞定位、细胞组分、分子功能、生物学过程共计11个实体数据集，数据内容包含实体名称以及根据实体特征提取的共性高频数据属性。语义关系数据集涵盖有关、互作、相对应、一致、参与、表达于、有……蛋白结构域、属于、行使功能、参与共计10个语义关系数据集，数据内容包含实体-关系-实体三元组。
数据量	17.32 MB
主要数据指标	转录组名称、功能描述、物理位置、物种等
数据可用性	CSTR:17058.11.sciencedb.agriculture.00253; https://cstr.cn/17058.11.sciencedb.agriculture.00253 DOI:10.57760/sciencedb.agriculture.00253; https://doi.org/10.57760/sciencedb.agriculture.00253
经费支持	中央级公益性科研院所基本科研业务费专项（JBYW-AII-2025-20）。

关键词: 知识图谱, 育种知识发现, 蛋白复合体

Abstract:

In crop breeding research, protein complexes formed through protein-protein interactions often bind to the promoters of downstream genes to regulate gene transcription, playing a crucial role in biological functions. Therefore, the potential discovery of protein complexes is essential for revealing the structure of protein-protein interaction networks, identifying downstream regulatory genes, and better understanding the molecular regulatory mechanisms of traits, which is key to developing high-quality, high-yield, and multi-resistant new varieties. However, existing methods for predicting protein-protein interactions lack deep semantic associations of multi-dimensional data and are limited to considering only a single influencing factor, making it difficult to discover the structure of crop protein complexes. Based on the principles of data reliability, practicality, and ease of use, this study selected the PlaPPISite database and the Uniprot database as data sources and used mapping knowledge extraction to achieve the integration and association of protein-related datasets. Ultimately, a knowledge graph dataset of Arabidopsis thaliana protein-protein interactions was formed and stored as structured data in.csv format. This dataset includes 11 entity datasets and 11 entity semantic relationship datasets. To verify the effectiveness of this dataset, Neo4j graph database was used for data storage. Finally, an Arabidopsis thaliana protein-protein interaction knowledge graph covering approximately 68,713 nodes and 109,496 semantic relationships was formed, which can effectively support hierarchical knowledge association and discovery centered on protein entities. The Arabidopsis thaliana protein-protein interaction knowledge graph dataset can provide a key semantic model and important data foundation for the discovery of protein complexes. Relevant research and production units can build an Arabidopsis thaliana protein-protein interaction knowledge base based on this dataset, providing a critical knowledge resource base for the construction of a crop breeding knowledge discovery service platform.

Data summary:

Item	Description
Dataset name	Arabidopsis Interacting Protein Knowledge Graph Dataset
Specific subject area	Other disciplines of agriculture
Research topic	Crops; Arabidopsis thaliana interacting protein knowledge graph; Data mining
Geographical scope	Globe
Data types and technical formats	.csv
Dataset structure	This dataset is text data, which contains 11 entity datasets and 11 semantic relationship datasets, which are stored in.csv format. The entity dataset covers a total of 11 entity datasets including genes, proteins, traits, signaling pathways, gene symbols, protein families, domains, subcellular localization, cell components, molecular functions, and biological processes. The semantic relationship dataset covers a total of 10 semantic relationship datasets, including related, interacting, corresponding, consistent, participating, expressing in, having protein domains, belonging, exercising functions, and participating, and the data content includes entity-relation-entity triples.
Volume of dataset	17.32 MB
Key index in dataset	Transcriptome name, functional description, physical location, species, etc.
Data accessibility	CSTR:17058.11.sciencedb.agriculture.00253; https://cstr.cn/17058.11.sciencedb.agriculture.00253 DOI:10.57760/sciencedb.agriculture.00253; https://doi.org/10.57760/sciencedb.agriculture.00253
Financial support	Central Public-interest Scientific Institution Basal Research Fund (No. JBYW-AII-2025-20).

Key words: knowledge graph, breeding knowledge discovery, protein complex

张丹丹, 赵瑞雪, 宼远涛, 鲜国建, 刘建国. 拟南芥互作蛋白知识图谱数据集[J]. 农业大数据学报, 2026, 8(1): 128-134.

ZHANG DanDan, ZHAO RuiXue, KOU YuanTao, XIAN GuoJian, LIU JianGuo. Arabidopsis Interacting Protein Knowledge Graph Dataset[J]. Journal of Agricultural Big Data, 2026, 8(1): 128-134.

图/表 9

表1

表2

表3

表4

表5

表6

图1

图2

图3

参考文献 11

[1]	GUO Y Z, LI Y Z, ZHANG Y Q, et al. A novel method to predict protein-protein interactions based on the information of protein- protein interaction networks and protein sequence. Protein & Peptide Letters, 2011, 18(9): 906-911. DOI:10.2174/092986611796011482.
[2]	CONSORTIUM T U, BATEMAN A, MARTIN M J, et al. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research, 2021, 49(D1):D480-D489. DOI: 10.1093/nar/gkaa1100. pmid: 33237286
[3]	SZKLARCZYK D, GABLE A L, NASTOU K C, et al. The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Research, 2021, 49(D1):D605-D612. DOI: 10.1093/nar/gkaa1074. pmid: 33237311
[4]	YANG X D, YANG S P, QI H, et al. PlaPPISite: a comprehensive resource for plant protein-protein interaction sites. BMC Plant Biology, 2020, 20:61. DOI:10.1186/s12870-020-2254-4. pmid: 32028878
[5]	JUETTEMANN T, GERLOFF D L. BISC: Binary SubComplexes in proteins database. Nucleic Acids Research, 2011,39: D705-D711. DOI: 10.1093/nar/gkq859.
[6]	KANEHISA M. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 2000, 28(1):27-30. DOI:10.1093/nar/28.1.27. pmid: 10592173
[7]	YANG R, YE Q, CHENG C L, et al. Decision-making system for the diagnosis of syndrome based on traditional Chinese medicine knowledge graph. Evidence-Based Complementary and Alternative Medicine, 2022, 2022:8693937. DOI: 10.1155/2022/8693937.
[8]	HASSANI-PAK K, SINGH A, BRANDIZI M, et al. KnetMiner: a comprehensive approach for supporting evidence-based gene discovery and complex trait analysis across species. Plant Biotechnology Journal, 2021, 19(8):1670-1678. DOI: 10.1111/pbi.13583.
[9]	CHOI W, LEE H. Inference of biomedical relations among chemicals, genes, diseases, and symptoms using knowledge representation learning. IEEE Access, 2019, 7: 179373-179384. DOI:10.1109/ACCESS.2019.2957812
[10]	SANTOS A, COLAÇO A R, NIELSEN A B, et al. A knowledge graph to interpret clinical proteomics data. Nature Biotechnology, 2022, 40(5): 692-702. DOI: 10.1038/s41587-021-01145-6.
[11]	WANG Y H, ZHAO E P, WANG W. A knowledge graph completion method based on fusing association information. IEEE Access, 2022, 10: 50500-50507. DOI: 10.1109/ACCESS.2022.3174110.

数据类型维度 Data type dimension	实体名称 Entity name	数据属性 Data attribute
基因水平 Gene level	基因符号 Gene Symbol	功能描述 Function description
蛋白水平 Protein level	蛋白家族 Protein Family	蛋白家族编号；名字 Protein Family identity; name
富集通路水平 Enrichment pathways level	细胞组分 Cellular Component	GO编号；名称 GO identity; name
性状水平 Trait level	性状 Trait	名称；类型 Name; type
Note: GO: gene ontology

Trait	Type
Drought resistance	Stress resistance
Insect resistance	Disease insect resistance
Plant height	Growth and development
Grain weight	Economy

对象属性名称 (Object attributes)	描述对象 (Object)	释义 (Description)
与......互作 (interacts with)	蛋白 (protein)；蛋白 (protein)	描述蛋白与蛋白间相互作用的关系
参与……(involves in)	蛋白 (protein)；信号通路( signal pathway)	描述蛋白与信号通路间的参与关系
属于…… (belong to)	蛋白 (protein); 蛋白家族(protein family)	描述蛋白和蛋白家族间的归属关系

Protein ID	Signal pathway
A0A0A7EPL0	Protein modification; protein sumoylation
A7XDQ9	Protein modification; protein glycosylation
F4IAG2	Glycan biosynthesis; starch biosynthesis
F4JKK0	Protein modification; protein ubiquitination
P50318	Carbohydrate biosynthesis; Calvin cycle

拟南芥互作蛋白知识图谱数据集

Arabidopsis Interacting Protein Knowledge Graph Dataset

RichHTML

PDF (PC)

赞

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 11

相关文章 7

Metrics

本文评价

推荐阅读 0

[1]	李佳乐, 贺子康, 姚琼, 赵晓燕, 周国民, 张建华. 农业科学数据在线分析引擎设计与应用[J]. 农业大数据学报, 2025, 7(4): 458-467.
[2]	高卓君, 张丹丹, 陈荣宇. 2016—2023年广东省主要农作物审定品种知识图谱构建数据集[J]. 农业大数据学报, 2025, 7(2): 261-268.
[3]	张丹丹, 赵瑞雪, 宼远涛, 鲜国建. 作物性状调控基因知识图谱数据集[J]. 农业大数据学报, 2025, 7(2): 220-226.
[4]	王悦悦, 陈祖刚, 武新乾. 我国科学数据中心评价的知识图谱分析[J]. 农业大数据学报, 2024, 6(3): 373-379.
[5]	陈雷, 周娜, 朱芃璇, 袁媛. 农业知识图谱构建数据集[J]. 农业大数据学报, 2024, 6(1): 1-8.
[6]	刘玉洁, 廉小亲, 赵峙尧, 李悦, 张新. 基于区块链的食品安全知识图谱可信管理探究[J]. 农业大数据学报, 2023, 5(3): 69-82.
[7]	陈佳云, 徐向英, 章永龙, 周烨, 汪红江, 谭昌伟. 多模态知识图谱在农业中的研究进展[J]. 农业大数据学报, 2022, 4(3): 126-134.

Protein Family
PIAL protein ligase family
Peroxiredoxin-like PRXL2 family, PRXL2C subfamily
RING-type zinc finger family, LOG2 subfamily
Zinc-containing alcohol dehydrogenase family

Cellular Component
cytosol [GO:0005829]; plasma membrane [GO:0005886]
membrane [GO:0016020]; perinuclear region of cytoplasm [GO:0048471]
membrane [GO:0016020]; plasma membrane [GO:0005886]
membrane [GO:0016020]; nucleus [GO:0005634]