基于词向量的检索扩展方法与农业领域实证

doi:10.19788/j.issn.2096-6369.190210

Abstract

Abstract:

[Objective] Terms in the scientific and technological literature are neither standardized nor unified. Therefore, it is difficult to meet the needs of both recall and precision when constructing a search strategy. A query reformulation method, based on word embedding, is proposed in this paper to solve the problem of inconsistency between terms in a search strategy and terms in a massive literature dataset. Experiments have been conducted in the application field. [Methods] First, the dataset was cleaned and the text was mapped to word vectors. Each article was represented by the average of all of its word vectors. Second, a random forest classifier was trained with average word vectors from the training literature. Finally, the test set was classified by the classifier, and the positive data were obtained, which we named the retrieval dataset. [Results] We analyzed topic clouds and topic co-occurrence clusters, which were extracted from the small dataset, the extended retrievable dataset, and the retrieval dataset of functional gene mining and assisted breeding technology based on multi-group large data. Compared with the small dataset and the extended retrievable dataset, the retrieval dataset could represent the topics and topic clusters embodied by the other two datasets, and could display more topics and topic clusters belonging to the topic field. [Conclusion] The results show that the method has good recall and precision. A domain dataset for functional gene mining and assisted breeding technology based on multi-group large data was constructed. In the future, the method can be extended to other fields and applied to other tasks.

Key words: big data, query reformulation, word embedding, random forest, data mining, natural language processing, machine learning, deep learning

CLC Number:

G354.2

Lei Wu,Xiaohe Liang,Jisiguleng Wu,Rui Wang. Method and Agricultural Empirical Study of Query Reformulation Based on Word Embedding[J].Journal of Agricultural Big Data, 2019, 1(2): 114-120.

Figures/Tables 8

Fig.1

Table 1

Fig.2

Fig.3

Fig.4

Table 2

Table 3

Table 4

References 17

[1]	S.A. Mcllaith, T.C. Son, H.L. Zeng . Semantic Web Services[J]. IEEE Intelligent Systems, 2001,16(2):46-53.
[2]	张贝妮, 王军 . 数字图书馆中的检索式扩展方法研究[J]. 计算机应用研究, 2006,23(4):71-73.
	Zhang B N, Wang J . Research of Query Reformulation in Digital Library[J]. Application Research of Computers, 2006,23(4):71-73.
[3]	M. Almasri, C. Berrut, J.P. Chevallet . A Comparison of Deep Learning Based Query Expansion with Pseudo-Relevance Feedback and Mutual Information[M] // Advances in Information Retrieval. Springer International Publishing, 2016.
[4]	F. Diaz, B. Mitra, N. Craswell . Query Expansion with Locally-Trained Word Embeddings[J]. 2016.
[5]	林睿 . 谷歌学术搜索的缺陷——基于检索式、专利及引用功能的抽样分析[J]. 现代情报, 2014,34(2):103-106.
	Lin R . Deficiencies in the Use of Google Scholar——Analysis on the sample of Retrieval Strategy , Patents and Reference Functions[J]. Journal of Modern Information, 2014,34(2):103-106.
[6]	毛媛媛 . 基于语义扩展的中文信息检索系统设计与实现[D]. 电子科技大学, 2013.
	Mao Y Y . Based on the Semantic Extension in Chinese Information Retrieval System Design and Implentation[D]. University of Electronic Science and Technology of China. 2013.
[7]	J.F. Gao, G,. Xu, J.X. Xu . Query expansion using path-constrained random walks [C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, Jul 28-Aug 1,2013. New York, NY, USA:ACM. 2013: 563-572.
[8]	S. Riezler, Y. Liu . Query rewriting using monolingual statistical machine translation[J]. Computational Linguistics, 2010,36(3):569-582.
[9]	G.E. Hinton . Learning distributed representations of concepts [C]//Proceedings of the eighth annual conference of the cognitive science society. 1986,1:12.
[10]	唐明, 朱磊, 邹显春 . 基于Word2Vec的一种文档向量表示[J]. 计算机科学, 2016,43(6):214-217, 269.
	Tang M, Zhu L, Zou X C . Document Vector Representation Based on Word2Vec[J]. Computer Science. 2016,43(6):214-217, 269.
[11]	T. Mikolov, K, Chen, G. Corrado , et al. Efficient estimation of word representations in vector space[J]. ar Xiv preprint ar Xiv: 1301.3781, 2013.
[12]	F. Morin, Y. Bengio . Hierarchical probalilistic neural network language model [C] //Proceedings of the international workshop on artificial intelligence and statistics. 2005,5:246-252.
[13]	R. Collobert, J. Weston, L. Bottou , et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011,12(Aug):2493-2537.
[14]	T. Mikolov, I. Sutskever, K. Chen , et al. Distributed representations of words and phrases and their compositionality [C]//Advances in neural information processing systems. 2013: 3111-3119.
[15]	Maaten L V D, G . Hinton Visualizing Data using t-SNE[J]. Journal of Machine Learning Research, 2017,9(2605):2579-2605.
[16]	D. Farkas . Wordsift: A tool for developing academic vocabulary in science. California[J]. Science Teachers Association California Classroom Science, 2009,21(2).
[17]	程妍 . 国外交叉学科研究现状分析——基于学术期刊的视角[J]. 学术界, 2014,189(2):204-211.
	Cheng Y . Analysis of abroad cross disciplinary research——From the perspective of academic journals[J]. Academics in China, 2014,189(2):204-211.

序号	检索式	文章数	取舍原因
1.1	TS= (big data OR large data)	565792	用于选择大数据集
1.2	TS= (omics big data OR omics large data OR big data OR large data)	565792	说明 TS=(big data OR large data)包含TS=( omics big data OR omics large data)
1.3	TS= (omics big data OR omics large data)	606	用于选择小数据集
1.4	TS= (big data OR large data) AND TS= (omics)	606	同1.3
1.5	TS= (big data OR large data) AND TS= (multi-omics)	58	选择的数据集过于严格
2.1	TS= (function* gene* mining)	5418	选择的数据集过于严格
2.2	TS= (function* gene* OR gene* mining)	1170759	用于选择小数据集和大数据集
2.3	TS= (function* gene* mining OR function* gene* ORgen e* mi ni n g)	1170759	说明 TS=( function* gene* OR gene* mining)包括 TS=( function* gene* mining)
2.4	TS= (function* gene* OR gene* mining) NOT TS= (function* gene* mining) TS= (assisted breeding techni* OR breeding techni* OR	1165341	说明 2.2 包括 2.1
3.1	assisted reproductive techni* OR reproductive	16232	选择的数据集过于严格
	techni*)
3.2	TS=(breeding techni* OR reproductive techni*)	16232	同3.1
3.3	TS=(breeding OR reproductive)	305575	用于选择小数据集和大数据集

类别	聚类含义
红色类	基因与疾病关联机制研究与治疗
蓝色类	生物数据分析
绿色类	基因标记与整合
黄色类	基因与环境地域关联
粉色类	生物体机能、表型与基因型

类别	聚类含义
红色类	生物体机能、表型与基因型
绿色类	基因标记与整合
深蓝色类	代谢组学
黄色类	基因与疾病关联机制研究与治疗
紫色类	生物数据分析
浅蓝色类	微阵列基因表达数据的基因特异性

类别	聚类含义
蓝色类	分子设计育种和全基因组选择研究
绿色类	大规模生物数据分析和数据可视化研究
红色类	系统生物学研究
黄色类	基因层面抗病研究

Method and Agricultural Empirical Study of Query Reformulation Based on Word Embedding

RichHTML

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 17

Related Articles 15

Metrics

Comments

Recommended 0

[1]	XU Jia, KANG Guiling, YU Linsong, ZHAO Yuyan, LIU Jingbing. Construction of Provincial Agricultural Geological Survey Information System Platform for Land Quality Evaluation [J]. Journal of Agricultural Big Data, 2023, 5(1): 116-125.
[2]	GUAN Bolun, DONG Wei, ZHANG Liping, YANG Qianjin, WANG Yan. Development and Application of Traceability Tracking Platform for Ratooning Rice [J]. Journal of Agricultural Big Data, 2023, 5(1): 55-67.
[3]	Xin Wang, Leifeng Guo. Application and Construction of Big Data Fusion Framework for Anti-poverty Monitoring: A Systematic View of Data, Models, and Applications [J]. Journal of Agricultural Big Data, 2022, 4(2): 108-118.
[4]	Bo Li, Wenjun Ma, Zhongming Wang, Jiaojiao Wang. Establishment and Application of Scientific Big Data Platform for Forest and Grass [J]. Journal of Agricultural Big Data, 2022, 4(2): 69-77.
[5]	Yuxiao Sun, Yanli Li, Feng Li, Qian Chen. Research and Development Suggestions on Scientific Data Sharing at Home and Abroad [J]. Journal of Agricultural Big Data, 2022, 4(2): 88-98.
[6]	Yun Tao, Xiefeng Cheng. Comparative Study on Regional Big Data Development and Regional Agricultural Big Data Construction Level [J]. Journal of Agricultural Big Data, 2022, 4(1): 125-135.
[7]	Qian Zhang, Yihui Tian, Wen Xiao, Yan Lu. Application of Big Data Technology in Cold Chain Logistics of Agricultural Products [J]. Journal of Agricultural Big Data, 2022, 4(1): 55-61.
[8]	Nuojuan Ling, Yuan Rao. Design and Implementation of a Big Data Platform for Cloud Server Farm Smart Services [J]. Journal of Agricultural Big Data, 2021, 3(4): 10-19.
[9]	Zhipeng Li, Jian Zhao, Miaomiao Wang, Hong Chen, Xiaodang Gao. Construction and Implementation of Fujian Provincial Science and Technology Commissioner Service Cloud Platform Based on Big Data [J]. Journal of Agricultural Big Data, 2021, 3(4): 59-69.
[10]	Peisen Yuan, Mingjia Xue, Yingjun Xiong, Zhaoyu Zhai, Huanliang Xu. Analysis and Application of High-throughput Plant Phenotypic Big Data Collected from Unmanned Aerial Vehicles [J]. Journal of Agricultural Big Data, 2021, 3(3): 62-75.
[11]	Huijuan Wang, Qian Xu, Ailian Zhou, Xiaohe Liang, Nengfu Xie, Xiaoyu Li, Saisai Wu. The Development of Blockchain and Its Application in Agriculture [J]. Journal of Agricultural Big Data, 2021, 3(3): 76-86.
[12]	Ayitula Maimaitizunong, Shuai Yanju, Haodong Wei, Zhen He, Qinxi Xiao, Qiong Hu, Baodong Xu, Liangzhi You, Cougui Cao, Lin Ling. Evaluation of Green Development of Rice-Based Cropping Systems Using Remote Sensing Data and the DNDC Model: Case Study of Qianjiang City [J]. Journal of Agricultural Big Data, 2021, 3(3): 33-44.
[13]	Muhan Xue, Shuo Xu, Feng Lu, Yong Zhu, Jianguang Wu, Yigang Wang. Construction and Application of a Comprehensive Management Service Platform for Fishing Vessels and Fishing Ports [J]. Journal of Agricultural Big Data, 2021, 3(3): 45-54.
[14]	Yide Li, Feng Lu, Yong Zhu, Shuo Xu, Lu Sun. Data Mining for Fishing Vessel Purchase Based on Gradient Boosting Decision Tree Algorithm [J]. Journal of Agricultural Big Data, 2021, 3(3): 55-61.
[15]	Mingxu Zhang, Ru Zhang, Tuya Xilin, Yuan Chen, Yaqiong Bi, Chunhong Zhang, Taotao Wu, Minhui Li. Application and Prospects for Big Data of Traditional Chinese Medicine Resources in Inner Mongolia [J]. Journal of Agricultural Big Data, 2021, 3(2): 42-53.