Journal of Agricultural Big Data ›› 2019, Vol. 1 ›› Issue (2): 114-120.doi: 10.19788/j.issn.2096-6369.190210

Previous Articles    

Method and Agricultural Empirical Study of Query Reformulation Based on Word Embedding

Lei Wu1,Xiaohe Liang1,Jisiguleng Wu1,Rui Wang2,*()   

  1. 1.Agricultural Information Institute of CAAS, Beijing 100081
    2.Institute of Computer Technology ,CARS,Beijing 100081
  • Received:2019-04-05 Online:2019-06-26 Published:2019-08-21
  • Contact: Rui Wang E-mail:13811805186@163.com

Abstract:

[Objective] Terms in the scientific and technological literature are neither standardized nor unified. Therefore, it is difficult to meet the needs of both recall and precision when constructing a search strategy. A query reformulation method, based on word embedding, is proposed in this paper to solve the problem of inconsistency between terms in a search strategy and terms in a massive literature dataset. Experiments have been conducted in the application field. [Methods] First, the dataset was cleaned and the text was mapped to word vectors. Each article was represented by the average of all of its word vectors. Second, a random forest classifier was trained with average word vectors from the training literature. Finally, the test set was classified by the classifier, and the positive data were obtained, which we named the retrieval dataset. [Results] We analyzed topic clouds and topic co-occurrence clusters, which were extracted from the small dataset, the extended retrievable dataset, and the retrieval dataset of functional gene mining and assisted breeding technology based on multi-group large data. Compared with the small dataset and the extended retrievable dataset, the retrieval dataset could represent the topics and topic clusters embodied by the other two datasets, and could display more topics and topic clusters belonging to the topic field. [Conclusion] The results show that the method has good recall and precision. A domain dataset for functional gene mining and assisted breeding technology based on multi-group large data was constructed. In the future, the method can be extended to other fields and applied to other tasks.

Key words: big data, query reformulation, word embedding, random forest, data mining, natural language processing, machine learning, deep learning

CLC Number: 

  • G354.2