应用研究

基于词向量的检索扩展方法与农业领域实证

展开
  • 1.中国农业科学院农业信息研究所,北京 100081
    2.中国铁道科学研究院集团有限公司电子计算技术研究所,100081
吴蕾,女,博士,研究方向:情报分析;E-mail: wulei@caas.cn

收稿日期: 2019-04-05

  网络出版日期: 2019-08-21

基金资助

国家社科基金青年项目“基于图模型的农业领域多源知识迁移研究”(18CTQ028);国家自科基金面上项目“农业大数据环境下多粒度知识融合方法研究”(31671588);中央科研院所基本科研业务费项目“农业重点学科领域发展态势分析”(Y2017ZK05)

Method and Agricultural Empirical Study of Query Reformulation Based on Word Embedding

Expand
  • 1.Agricultural Information Institute of CAAS, Beijing 100081
    2.Institute of Computer Technology ,CARS,Beijing 100081

Received date: 2019-04-05

  Online published: 2019-08-21

摘要

【目的】 目前,科技文献大数据中存在着主题词不规范、不统一的情况,因此在构建检索式时很难既满足查全要求,又满足查准要求。针对这一问题,本文提出了一种基于词向量的检索扩展方法,并在“基于多组学大数据的功能基因挖掘与辅助育种技术”领域进行了实验验证。【方法】首先清洗数据集,并将文本映射成词向量,一篇文章可以用其所有词向量的平均向量表示;然后用训练集中文章的平均词向量训练随机森林分类器;最后在测试集中对文本进行分类,从而得到正例数据即检索数据集。【结果】针对“基于多组学大数据的功能基因挖掘与辅助育种技术”领域构建检索式,通过对比检索式所提取的小数据集、扩展检索式数据集和该方法提取的检索数据集的主题词云并对其进行主题共现聚类,结果发现相比小数据集和扩展检索式数据集,检索数据集能够表现另外两个数据集所体现的主题词和主题聚类,同时能够展现更多属于该主题领域的主题词和主题聚类。【结论】结果表明该方法具有较好的查全性和查准性,构建了满足分析的“基于多组学大数据的功能基因挖掘与辅助育种技术”领域数据集,同时在构建其他领域数据集时具有可扩展性,在未来研究中可以被应用到其他目标领域的数据集构建中。

本文引用格式

吴蕾,梁晓贺,乌吉斯古楞,王瑞 . 基于词向量的检索扩展方法与农业领域实证[J]. 农业大数据学报, 2019 , 1(2) : 114 -120 . DOI: 10.19788/j.issn.2096-6369.190210

Abstract

[Objective] Terms in the scientific and technological literature are neither standardized nor unified. Therefore, it is difficult to meet the needs of both recall and precision when constructing a search strategy. A query reformulation method, based on word embedding, is proposed in this paper to solve the problem of inconsistency between terms in a search strategy and terms in a massive literature dataset. Experiments have been conducted in the application field. [Methods] First, the dataset was cleaned and the text was mapped to word vectors. Each article was represented by the average of all of its word vectors. Second, a random forest classifier was trained with average word vectors from the training literature. Finally, the test set was classified by the classifier, and the positive data were obtained, which we named the retrieval dataset. [Results] We analyzed topic clouds and topic co-occurrence clusters, which were extracted from the small dataset, the extended retrievable dataset, and the retrieval dataset of functional gene mining and assisted breeding technology based on multi-group large data. Compared with the small dataset and the extended retrievable dataset, the retrieval dataset could represent the topics and topic clusters embodied by the other two datasets, and could display more topics and topic clusters belonging to the topic field. [Conclusion] The results show that the method has good recall and precision. A domain dataset for functional gene mining and assisted breeding technology based on multi-group large data was constructed. In the future, the method can be extended to other fields and applied to other tasks.

参考文献

[1] S.A. Mcllaith, T.C. Son, H.L. Zeng . Semantic Web Services[J]. IEEE Intelligent Systems, 2001,16(2):46-53.
[2] 张贝妮, 王军 . 数字图书馆中的检索式扩展方法研究[J]. 计算机应用研究, 2006,23(4):71-73.
[2] Zhang B N, Wang J . Research of Query Reformulation in Digital Library[J]. Application Research of Computers, 2006,23(4):71-73.
[3] M. Almasri, C. Berrut, J.P. Chevallet . A Comparison of Deep Learning Based Query Expansion with Pseudo-Relevance Feedback and Mutual Information[M] // Advances in Information Retrieval. Springer International Publishing, 2016.
[4] F. Diaz, B. Mitra, N. Craswell . Query Expansion with Locally-Trained Word Embeddings[J]. 2016.
[5] 林睿 . 谷歌学术搜索的缺陷——基于检索式、专利及引用功能的抽样分析[J]. 现代情报, 2014,34(2):103-106.
[5] Lin R . Deficiencies in the Use of Google Scholar——Analysis on the sample of Retrieval Strategy , Patents and Reference Functions[J]. Journal of Modern Information, 2014,34(2):103-106.
[6] 毛媛媛 . 基于语义扩展的中文信息检索系统设计与实现[D]. 电子科技大学, 2013.
[6] Mao Y Y . Based on the Semantic Extension in Chinese Information Retrieval System Design and Implentation[D]. University of Electronic Science and Technology of China. 2013.
[7] J.F. Gao, G,. Xu, J.X. Xu . Query expansion using path-constrained random walks [C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, Jul 28-Aug 1,2013. New York, NY, USA:ACM. 2013: 563-572.
[8] S. Riezler, Y. Liu . Query rewriting using monolingual statistical machine translation[J]. Computational Linguistics, 2010,36(3):569-582.
[9] G.E. Hinton . Learning distributed representations of concepts [C]//Proceedings of the eighth annual conference of the cognitive science society. 1986,1:12.
[10] 唐明, 朱磊, 邹显春 . 基于Word2Vec的一种文档向量表示[J]. 计算机科学, 2016,43(6):214-217, 269.
[10] Tang M, Zhu L, Zou X C . Document Vector Representation Based on Word2Vec[J]. Computer Science. 2016,43(6):214-217, 269.
[11] T. Mikolov, K, Chen, G. Corrado , et al. Efficient estimation of word representations in vector space[J]. ar Xiv preprint ar Xiv: 1301.3781, 2013.
[12] F. Morin, Y. Bengio . Hierarchical probalilistic neural network language model [C] //Proceedings of the international workshop on artificial intelligence and statistics. 2005,5:246-252.
[13] R. Collobert, J. Weston, L. Bottou , et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011,12(Aug):2493-2537.
[14] T. Mikolov, I. Sutskever, K. Chen , et al. Distributed representations of words and phrases and their compositionality [C]//Advances in neural information processing systems. 2013: 3111-3119.
[15] Maaten L V D, G . Hinton Visualizing Data using t-SNE[J]. Journal of Machine Learning Research, 2017,9(2605):2579-2605.
[16] D. Farkas . Wordsift: A tool for developing academic vocabulary in science. California[J]. Science Teachers Association California Classroom Science, 2009,21(2).
[17] 程妍 . 国外交叉学科研究现状分析——基于学术期刊的视角[J]. 学术界, 2014,189(2):204-211.
[17] Cheng Y . Analysis of abroad cross disciplinary research——From the perspective of academic journals[J]. Academics in China, 2014,189(2):204-211.
文章导航

/