农业数据库文本多特征融合关键词抽取方法

doi:10.19788/j.issn.2096-6369.000153

摘要/Abstract

摘要：

农业数据库文本的自动化关键词抽取，是实现其智能化利用与服务的重要环节之一。针对传统关键词抽取方法难以挖掘文本深层语义关联，而基于语义嵌入的方法又易受语义表征偏差与关键信息稀释影响等问题，本研究通过模型创新提出适合农业数据库文本的更精准的关键词抽取方法。通过引入共现词分析增强TextRank词图的语义关联与边权精度，构建特征统计模块提取候选关键词；同时融合Bert-base-Chinese预训练模型实现文本的向量化编码并通过向量相似度计算提取候选关键词；最后采用多源融合决策根据两个模块的输出结果及词语位置经融合加权生成最终的关键词列表，形成融合BERT语义嵌入与TextRank词图及共现词分析多特征的关键词抽取方法（BWE-COW-TR）。用该方法在农业科技文献数据集上进行关键词抽取实验获得的精确率（49.83%）、召回率（58.29%）和F1值（0.5373），均显著高于基线模型。其F1值比KeyBERT、TF-IDF和TextRank分别提高了70.90%、51.74%和45.77%。研究结果表明，本研究提出的方法在农业数据库文本关键词抽取效果上显著优于目前常用的KeyBERT、TF-IDF和TextRank方法。

关键词: 农业信息, TextRank, BERT, 共现词, 抽取, 语义嵌入模型

Abstract:

Automated keyword extraction from agricultural database texts is a crucial step in achieving intelligent utilization and services. This study addresses the challenges faced by traditional keyword extraction methods, which struggle to mine deep semantic associations within texts, as well as the issues with semantic embedding-based approaches that are susceptible to semantic representation bias and dilution of key information. By innovating on the model, we propose a more precise keyword extraction method tailored for agricultural database texts. We enhance the semantic associations and edge weight accuracy of the TextRank word graph by incorporating co-occurring word analysis and construct a feature statistics module to extract candidate keywords. Simultaneously, we integrate the Bert-base-Chinese pre-trained model for vectorized encoding of texts and extract candidate keywords through vector similarity calculations. Finally, a multi-source fusion decision-making process is employed to generate the final keyword list by fusing and weighting the outputs from the two modules, along with word positions, resulting in a keyword extraction method(BWE-COW-TR) that combines BERT semantic embedding with TextRank word graph and co-occurring word analysis features. The precision(49.83%), recall(58.29%), and F1 score(0.5373) obtained from keyword extraction experiments conducted on an agricultural science and technology literature dataset using this method are all significantly higher than those of the baseline model. Its F1 score has improved by 70.90%, 51.74%, and 45.77% respectively compared to the F1 scores of KeyBERT, TF-IDF, and TextRank. The research results demonstrate that the proposed method outperforms the commonly used KeyBERT, TF-IDF, and TextRank methods in keyword extraction from agricultural database texts.

Key words: agricultural information, TextRank, BERT, co-occurrence, extraction, semantic embedding model

杜若鹏, 张洁, 寇远涛. 农业数据库文本多特征融合关键词抽取方法[J]. 农业大数据学报, 2026, 8(2): 155-162.

DU RuoPeng, ZHANG Jie, KOU YuanTao. Multi-feature Fusion Based Keyword Extraction Method for Agricultural Databases[J]. Journal of Agricultural Big Data, 2026, 8(2): 155-162.

图/表 4

图1

图2

表1

表2

参考文献 31

[1]	FIROOZEH N, NAZARENKO A, ALIZON F, et al. Keyword extraction: Issues and methods. Natural Language Engineering, 2020, 26(3):259-291. doi: 10.1017/S1351324919000457
[2]	BHARTI S K, BABU K S. Automatic keyword extraction for text summarization: A survey. arXiv preprint arXiv:1704.03242. 2017.
[3]	常耀成, 张宇翔, 王红, 等. 特征驱动的关键词提取算法综述. 软件学报, 2018, 29(7): 2046-2070.
	CHANG Y C, ZHANG Y X, WANG H, et al. Features oriented survey of state-of-the-art keyphrase extraction algorithms. Journal of Software, 2018, 29(7): 2046-2070.
[4]	杨冬菊, 胡成富. 基于改进TextRank的科技文本关键词抽取方法. 计算机应用, 2024, 44(6): 1720-1726. doi: 10.11772/j.issn.1001-9081.2023060845
	YANG D J, HU C F. Keyword extraction method for scientific text based on improved TextRank. Journal of Computer Applications, 2024, 44(6): 1720-1726. doi: 10.11772/j.issn.1001-9081.2023060845
[5]	赵瑞雪, 杨晨雪, 郑建华, 等. 农业智能知识服务研究现状及展望. 智慧农业(中英文), 2022, 4(4): 105-120.
	ZHAO R X, YANG C X, ZHENG J H, et al. Agricultural intelligent knowledge service: overview and future perspectives. Smart Agriculture, 2022, 4(4): 105-120. doi: 10.12133/j.smartag.SA202207009
[6]	TURNEY P D. Learning algorithms for keyphrase extraction. Information Retrieval Journal, 2000, 2(4):303-336. doi: 10.1023/A:1009976227802
[7]	ZHANG C. Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems, 2008, 3(4): 1169-1180.
[8]	崔洪振, 张龙豪, 彭云峰, 等. 关键词提取算法研究综述. 中文信息学报, 2024, 38(2): 1-14, 24.
	CUI H Z, ZHANG L H, PENG Y F, et al. A survey for keyword extraction algorithms. Journal of Chinese Information Processing. 2024, 38(2): 1-14, 24.
[9]	WANG X Y, NING H Y. TF-IDF keyword extraction method combining context and semantic classification// Proceedingsof 2020 3rd International Conference on Data Science and Information Technology (DSIT’20). Xiamen, China, 2020.
[10]	WARTENA C, BRUSSEE R, SLAKHORST W. Keyword extraction using word co-occurrence// 21st International Conference on Database and Expert Systems Applications (DEXA). Spain, 2010: 54-58.
[11]	BLEI D M, NG Y A, JORDAN M I. 2003. Latent dirichlet allocation. Journal of machine Learning research, 2003, 3(1):993-1022.
[12]	PAN S, LI Z, DAI J. An improved TextRank keywords extraction algorithm// Proceedings of the ACM Turing Celebration Conference-China. Chengdu, China 2019:1-7.
[13]	GROOTENDORST M. KeyBERT: Minimal Keyword Extraction with BERT[R]. Zenodo, 2020. DOI:10.5281/zenodo.4461265.
[14]	HASSAN Z, SOOMR G M, et al. Advanced keypoint(s) recognition with KeyBERT(+): A comparative study// Mohamad H, Hasan M H, Abdulkadir S J, Shafiq N. (Eds.). Proceedings of the International Conference on Smart Cities - Volume 2. Kota Kinabalu, Malaysia, 2024. 2024.
[15]	MANSOUR N B, RAHIMI H, ALRAHABI M. How well do large language models extract keywords? A systematic evaluation on scientific corpora// Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities. Vienna, Austria, 2022:13-21.
[16]	MIHALCEA R, TARAU P. Textrank: Bringing order into text// Proceedings of the 2004 conference on empirical methods in natural language processing. Barcelona, spain: Association for Comptutational Linguistics, 2004: 404-411.
[17]	WAN X J, XIAO J G. Single documentkeyphrase extraction using neighborhood knowledge. Association for the Advancement of Artificial Intelligence, 2008, 8:855-860.
[18]	BOUDIN F. Unsupervised keyphrase extraction with multipartite graphs. Natural Language Engineering, 2020, 26(1): 93-111.
[19]	BOUGOUIN A, BOUDIN F, DAILLE B. Topicrank:Graph-based topic ranking for keyphrase extraction// Proceedings of the 6th International Joint Conference on Natural Language Processing. Nagoya, Japan: Asian Federation of Natural Language Processing, 2013: 543-551.
[20]	FLORESCU C, CARAGEA C. PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents// Proceedings of the 55th annual meeting of the association forcomputational linguistics, 2017, 1: 1105-1115.
[21]	ZHOU Q, FANG Y, SHANG Z, et al. Keyword extractionmethod for complex nodes based on textrank algorithm// International Conference on Computer Engineering and Application (ICCEA), Guangzhou, China, 2020: 359-363.
[22]	HUANG Z X, XIE Z P. A patent keywords extraction method using TextRank model with prior public knowledge. Complex & Intelligent Systems, 2022, 8: 1-12.
[23]	夏天. 词语位置加权TextRank的关键词抽取研究. 现代图书情报技术, 2013(9): 30-34.
	XIA T. Study on keyword extraction using word position weighted TextRank. Xiandai Tushu Qingbao Jishu, 2013(9): 30-34.
[24]	王昊, 刘丹, 刘硕. 基于句法分析及主题分布的关键词抽取模型. 计算机应用研究, 2022, 39(9): 2603-2607.
	WANG H, LIU D, LIU S. Keyword extraction model based on syntactic analysis and topic distribution. Application Research of Compute, 2022, 39(9):2603-2607.
[25]	李俊, 吕学强. 融合BERT语义加权与网络图的关键词抽取方法. 计算机工程, 2020, 46(9): 89-94. doi: 10.19678/j.issn.1000-3428.0055368
	LI J, LÜ X Q. Keyword extraction method based on BERT semantic weighting and network graph. Computer Engineering, 2020, 46(9): 89-94. doi: 10.19678/j.issn.1000-3428.0055368
[26]	XIONG A, LIU D R, TIAN H K, et al. News keyword extraction algorithm based on semantic clustering and word graph mode. Tsinghua Science and Technology, 2021, 26(6): 886-893. doi: 10.26599/TST.2020.9010051
[27]	苏凯晟. 基于TextRank的中文文本摘要抽取算法研究[D]. 成都: 西南财经大学. 2023.
	SU K S. Research on chinese text summarization extraction algorithm based on TextRank[D]. Chengdu: Southwestern University of Finance and Economics. 2023.
[28]	方萍, 徐宁. 基于BERT双向预训练的图模型摘要抽取算法. 计算机应用研究, 2021, 38(9): 2657-2661.
	FANG P, XU N. Graphmodel summary extraction algorithm based on BERT bidirectional pretraining. Application Research of Compute, 2021, 38(9): 2657-2661.
[29]	PRIYANSHU A, VIJAY S. AdaptKeyBERT: An Attention-Based approach towards Few-Shot & Zero-Shot Domain Adaptation of KeyBERT. arXiv preprint arXiv:2211.07499. 2022.
[30]	杜若鹏, 张洁, 寇远涛. 基于共现词分析的专业科技信息平台用户画像主题标注方法改进. 数字图书馆论坛, 2023, 19(9): 58-63.
	DU R P, ZHANG J, KOU Y T. Improvement of topic annotation method for professional science and technology information platform user proﬁle based on co-occurrence word analysis. Digital Library Forum, 2023, 19(9): 58-63.
[31]	李佳乐, 林佳, 贺子康, 等. 农业科学数据在线分析挖掘平台设计与应用. 农业大数据学报, 2025, 7(2):183-192. DOI: 10.19788/j.issn.2096-6369.000045.
	LI J L, LIN J, HE Z K, et al. Design and application of online analysis and mining platform for agricultural science data. Journal of Agricultural Big Data, 2025, 7(2):183-192. DOI: 10.19788/j.issn.2096-6369.000045.

硬件环境	软件环境
CPU：Intel i5 13490	操作系统：64位Windows 10 专业版
内存：32GB	开发语言：JDK 1.8.0 & Python 3.8
硬盘：500GB	开发工具：Netbeans8 Pycharm2022
显卡：RTX A5000 24GB	开源框架：Pytorch2.1.2
	分词器：Jieba

模型名称	P@3	P@5	P@10	R@3	R@5	R@10	F1@3	F1@5	F1@10
KeyBERT	32.36	29.16	23.61	22.71	34.11	55.19	26.69	31.44	33.08
TF-IDF	38.18	32.84	25.01	26.79	38.41	58.44	31.48	35.41	35.02
TextRank	41.93	34.19	24.19	29.42	39.99	56.55	34.58	36.86	33.89
Co-TextRank	48.41	43.79	29.64	33.96	51.22	65.83	39.91	47.22	40.87
BWE-COW-TR	59.39	49.83	32.29	41.67	58.29	75.48	48.98	53.73	45.23