农业大数据学报 ›› 2019, Vol. 1 ›› Issue (2): 114-120.doi: 10.19788/j.issn.2096-6369.190210

• 应用研究 • 上一篇    

基于词向量的检索扩展方法与农业领域实证

吴蕾1,梁晓贺1,乌吉斯古楞1,王瑞2,*()   

  1. 1.中国农业科学院农业信息研究所,北京 100081
    2.中国铁道科学研究院集团有限公司电子计算技术研究所,100081
  • 收稿日期:2019-04-05 出版日期:2019-06-26 发布日期:2019-08-21
  • 通讯作者: 王瑞 E-mail:13811805186@163.com
  • 作者简介:吴蕾,女,博士,研究方向:情报分析;E-mail: wulei@caas.cn
  • 基金资助:
    国家社科基金青年项目“基于图模型的农业领域多源知识迁移研究”(18CTQ028);国家自科基金面上项目“农业大数据环境下多粒度知识融合方法研究”(31671588);中央科研院所基本科研业务费项目“农业重点学科领域发展态势分析”(Y2017ZK05)

Method and Agricultural Empirical Study of Query Reformulation Based on Word Embedding

Lei Wu1,Xiaohe Liang1,Jisiguleng Wu1,Rui Wang2,*()   

  1. 1.Agricultural Information Institute of CAAS, Beijing 100081
    2.Institute of Computer Technology ,CARS,Beijing 100081
  • Received:2019-04-05 Online:2019-06-26 Published:2019-08-21
  • Contact: Rui Wang E-mail:13811805186@163.com

摘要:

【目的】 目前,科技文献大数据中存在着主题词不规范、不统一的情况,因此在构建检索式时很难既满足查全要求,又满足查准要求。针对这一问题,本文提出了一种基于词向量的检索扩展方法,并在“基于多组学大数据的功能基因挖掘与辅助育种技术”领域进行了实验验证。【方法】首先清洗数据集,并将文本映射成词向量,一篇文章可以用其所有词向量的平均向量表示;然后用训练集中文章的平均词向量训练随机森林分类器;最后在测试集中对文本进行分类,从而得到正例数据即检索数据集。【结果】针对“基于多组学大数据的功能基因挖掘与辅助育种技术”领域构建检索式,通过对比检索式所提取的小数据集、扩展检索式数据集和该方法提取的检索数据集的主题词云并对其进行主题共现聚类,结果发现相比小数据集和扩展检索式数据集,检索数据集能够表现另外两个数据集所体现的主题词和主题聚类,同时能够展现更多属于该主题领域的主题词和主题聚类。【结论】结果表明该方法具有较好的查全性和查准性,构建了满足分析的“基于多组学大数据的功能基因挖掘与辅助育种技术”领域数据集,同时在构建其他领域数据集时具有可扩展性,在未来研究中可以被应用到其他目标领域的数据集构建中。

关键词: 大数据, 检索扩展, 词向量, 随机森林, 数据挖掘, 自然语言处理, 机器学习, 深度学习

Abstract:

[Objective] Terms in the scientific and technological literature are neither standardized nor unified. Therefore, it is difficult to meet the needs of both recall and precision when constructing a search strategy. A query reformulation method, based on word embedding, is proposed in this paper to solve the problem of inconsistency between terms in a search strategy and terms in a massive literature dataset. Experiments have been conducted in the application field. [Methods] First, the dataset was cleaned and the text was mapped to word vectors. Each article was represented by the average of all of its word vectors. Second, a random forest classifier was trained with average word vectors from the training literature. Finally, the test set was classified by the classifier, and the positive data were obtained, which we named the retrieval dataset. [Results] We analyzed topic clouds and topic co-occurrence clusters, which were extracted from the small dataset, the extended retrievable dataset, and the retrieval dataset of functional gene mining and assisted breeding technology based on multi-group large data. Compared with the small dataset and the extended retrievable dataset, the retrieval dataset could represent the topics and topic clusters embodied by the other two datasets, and could display more topics and topic clusters belonging to the topic field. [Conclusion] The results show that the method has good recall and precision. A domain dataset for functional gene mining and assisted breeding technology based on multi-group large data was constructed. In the future, the method can be extended to other fields and applied to other tasks.

Key words: big data, query reformulation, word embedding, random forest, data mining, natural language processing, machine learning, deep learning

中图分类号: 

  • G354.2