农业大数据学报 ›› 2026, Vol. 8 ›› Issue (2): 155-162.doi: 10.19788/j.issn.2096-6369.000153

• 数据处理与分析 • 上一篇    下一篇

农业数据库文本多特征融合关键词抽取方法

杜若鹏(), 张洁, 寇远涛*()   

  1. 中国农业科学院农业信息研究所北京 100081
  • 收稿日期:2026-01-22 接受日期:2026-03-09 出版日期:2026-06-26 发布日期:2026-06-26
  • 通讯作者: 寇远涛,E-mail:kouyuantao@caas.cn
  • 作者简介:杜若鹏,E-mail:duruopeng@caas.cn
  • 基金资助:
    科技部科技创新2030——新一代人工智能重大项目(2021ZD0113705);中国农业科学院科技创新工程大数据与知识服务创新团队项目(CAAS-ASTIP-2021-AI-06)

Multi-feature Fusion Based Keyword Extraction Method for Agricultural Databases

DU RuoPeng(), ZHANG Jie, KOU YuanTao*()   

  1. Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
  • Received:2026-01-22 Accepted:2026-03-09 Published:2026-06-26 Online:2026-06-26

摘要:

农业数据库文本的自动化关键词抽取,是实现其智能化利用与服务的重要环节之一。针对传统关键词抽取方法难以挖掘文本深层语义关联,而基于语义嵌入的方法又易受语义表征偏差与关键信息稀释影响等问题,本研究通过模型创新提出适合农业数据库文本的更精准的关键词抽取方法。通过引入共现词分析增强TextRank词图的语义关联与边权精度,构建特征统计模块提取候选关键词;同时融合Bert-base-Chinese预训练模型实现文本的向量化编码并通过向量相似度计算提取候选关键词;最后采用多源融合决策根据两个模块的输出结果及词语位置经融合加权生成最终的关键词列表,形成融合BERT语义嵌入与TextRank词图及共现词分析多特征的关键词抽取方法(BWE-COW-TR)。用该方法在农业科技文献数据集上进行关键词抽取实验获得的精确率(49.83%)、召回率(58.29%)和F1值(0.5373),均显著高于基线模型。其F1值比KeyBERT、TF-IDF和TextRank分别提高了70.90%、51.74%和45.77%。研究结果表明,本研究提出的方法在农业数据库文本关键词抽取效果上显著优于目前常用的KeyBERT、TF-IDF和TextRank方法。

关键词: 农业信息, TextRank, BERT, 共现词, 抽取, 语义嵌入模型

Abstract:

Automated keyword extraction from agricultural database texts is a crucial step in achieving intelligent utilization and services. This study addresses the challenges faced by traditional keyword extraction methods, which struggle to mine deep semantic associations within texts, as well as the issues with semantic embedding-based approaches that are susceptible to semantic representation bias and dilution of key information. By innovating on the model, we propose a more precise keyword extraction method tailored for agricultural database texts. We enhance the semantic associations and edge weight accuracy of the TextRank word graph by incorporating co-occurring word analysis and construct a feature statistics module to extract candidate keywords. Simultaneously, we integrate the Bert-base-Chinese pre-trained model for vectorized encoding of texts and extract candidate keywords through vector similarity calculations. Finally, a multi-source fusion decision-making process is employed to generate the final keyword list by fusing and weighting the outputs from the two modules, along with word positions, resulting in a keyword extraction method(BWE-COW-TR) that combines BERT semantic embedding with TextRank word graph and co-occurring word analysis features. The precision(49.83%), recall(58.29%), and F1 score(0.5373) obtained from keyword extraction experiments conducted on an agricultural science and technology literature dataset using this method are all significantly higher than those of the baseline model. Its F1 score has improved by 70.90%, 51.74%, and 45.77% respectively compared to the F1 scores of KeyBERT, TF-IDF, and TextRank. The research results demonstrate that the proposed method outperforms the commonly used KeyBERT, TF-IDF, and TextRank methods in keyword extraction from agricultural database texts.

Key words: agricultural information, TextRank, BERT, co-occurrence, extraction, semantic embedding model