基于参数高效微调的跨模态枸杞虫害识别模型D-PAG

doi:10.19788/j.issn.2096-6369.000067

Abstract

Abstract:

With the development of multimodal foundation models (large models), efficiently transferring them to specific domains or tasks has become a current hot topic. This study uses the multimodal large model CLIP as the base model and employs parameter-efficient fine-tuning methods, such as Prompt and Adapter, to adapt CLIP to the task of goji berry pest identification. It introduces a cross-modal parameter-efficient fine-tuning model for goji berry pest recognition, named D-PAG. Firstly, learnable Prompts and Adapters are embedded in the input or hidden layers of the CLIP encoder to capture pest features. Then, gated units are utilized to integrate the Prompt and Adapter, further balancing the learning capacity. A GCS-Adapter is designed within the Adapter to enhance the attention mechanism for cross-modal semantic information fusion. To validate the effectiveness of the method, experiments were conducted on the goji berry pest dataset and the fine-grained dataset IP102. The experimental results indicate that with only 20% of the sample size, an accuracy of 98.8% was achieved on the goji dataset, and an accuracy of 99.5% was reached with 40% of the samples. On IP102, an accuracy of 75.6% was attained, comparable to ViT. This approach allows for efficient transfer of the foundational knowledge of multimodal large models to the specific domain of pest recognition with minimal additional parameters, providing a new technical solution for efficiently addressing agricultural image processing problems.

Key words: wolfberry, pest identification, parameter-efficient fine-tuning, large model, CLIP

XING JiaLu, LIU JianPing, ZHOU GuoMin, LIU LiBo, WANG Jian. D-PAG: Cross-modal Wolfberry Pest Recognition Model Based on Parameter-Efficient Fine-Tuning[J].Journal of Agricultural Big Data, 2024, 6(4): 509-521.

Figures/Tables 12

Table 1

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Table 2

Fig. 6

Table 3

Fig. 7

Table 4

Table 5

References 35

[1]	DAI G, FAN J, TIAN Z, et al. PPLC-Net: Neural network-based plant disease identification model supported by weather data augmentation and multi-level attention mechanism[J]. Journal of King Saud University - Computer and Information Sciences, 2023, 35(5):101555.https://doi.org/10.1016/j.jksuci.2023.101555.
[2]	周国民. 我国农业大数据应用进展综述[J]. 农业大数据学报, 2019, 1(1):16-23.DOI:10.19788/j.issn.2096-6369.190102.
[3]	张凌栩, 韩锐, 李文明, 等. 大数据深度学习系统研究进展与典型农业应用[J]. 农业大数据学报, 2019, 1(2):88-104. DOI:10.19788/j.issn.2096-6369.190208.
[4]	HUANG M L, CHUANG T C, LIAO Y C. Application of transfer learning and image augmentation technology for tomato pest identification[J]. Sustainable Computing: Informatics and Systems, 2022, 33:100646. https://doi.org/10.1016/j.suscom.2021.100646.
[5]	SAPNA N, RAJNI J, SUDEEP M, et al. Deep transfer learning model for disease identification in wheat crop[J]. Ecological Informatics, 2023, 75:102068. https://doi.org/10.1016/j.ecoinf.2023.02068.
[6]	BAO W, CHENG T, ZHOU X G, et al. An improved DenseNet model to classify the damage caused by cotton aphid[J]. Computers and Electronics in Agriculture, 2022, 203:107485.https://doi.org/10.1016/j.compag.2022.107485.
[7]	SHENG Y, LI X, QILEI H. Inception convolutional vision transformers for plant disease identification[J]. Internet of Things, 2023, 21:100650. https://doi.org/10.1016/j.iot.2022.100650.
[8]	SUDHESH K M, SOWMYA V, SAINAMOLE KURIAN P, et al. AI based rice leaf disease identification enhanced by Dynamic Mode Decomposition[J]. Engineering Applications of Artificial Intelligence, 2023, 120:105836. https://doi.org/10.1016/j.engappai.2023.105836.
[9]	CHODEY M D, SHARIFF N C. Pest detection via hybrid classification model with fuzzy C-means segmentation and proposed texture feature[J]. Biomedical Signal Processing and Control, 2023, 84:104710.
[10]	梁炜健, 郭庆文, 王春桃, 等. 基于空间注意力增强ResNeSt-101网络和迁移元学习的小样本害虫分类(英文)[J]. 农业工程学报, 2024, 40(6):285-297.
[11]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]// International conference on machine learning, PMLR. 2021:8748-8763. arXiv. 2103.00020.
[12]	COULIBALY S, KAMSU-FOGUEM B, KAMISSOKO D, et al. Explainable deep convolutional neural networks for insect pest recognition[J]. Journal of Cleaner Production, 2022, 371:133638. https://doi.org/10.1016/j.jclepro.2022.133638.
[13]	NIGAM S, JAIN R, MARWAHA S, et al. Deep transfer learning model for disease identification in wheat crop. Ecological Informatics, 2023, 75, 102068. https://doi.org/10.1016/j.ecoinf.2023.102068.
[14]	ZHOU C, ZHONG Y, ZHOU S, et al. Rice leaf disease identification by residual-distilled transformer[J]. Engineering Applications of Artificial Intelligence, 2023, 121:106020. https://doi.org/10.1016/j.engappai.2023.106020.
[15]	DAI G, FAN J, DEWI C. ITF-WPI: Image and text based cross-modal feature fusion model for wolfberry pest recognition[J]. Computers and Electronics in Agriculture, 2023, 212:108129. https://doi.org/10.1016/j.compag.2023.108129.
[16]	SZEGEDY C, ZAREMBA W, SUTSKEVER I, et al. Intriguing properties of neural networks[OL]. arXiv.1312.6199.
[17]	TRIPATHY S, TABASUM M. Autoencoder: An unsupervised deep learning approach[M]//Dutta P, Chakrabarti S, Bhattacharya A, et al(Eds.). Emerging Technologies in Data Mining and Information Security. Springer, 2023:261-267.
[18]	KINGMA D P, WELLING M. Auto-encoding variational bayes[OL]. arXiv:1312.6114.
[19]	HE K, CHEN X, XIE S, et al. 2021. Masked autoencoders are scalable vision learners[OL]. 2021. arXiv:2111.06377.
[20]	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pretraining of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers), 2019:4171-4186. DOI:10.18653/v1/N19-1423.
[21]	ZHONG Z, FRIEDMAN D, CHEN D. Factual probing is [MASK]: Learning vs. learning to recall[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, 2021: 5017-5033. DOI:10.18653/v1/2021.naacl-main.398.
[22]	HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter efficient transfer learning for NLP[C]// International Conference on Machine Learning, PMLR. 2019: 2790-2799. https://proceedings.mlr.press/v97/houlsby19a.html.
[23]	LIU H, TAM D, MUQEETH M, et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning[C]// Proceedings of the 36th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA. 2024. DOI:10.5555/3600270.3600412.
[24]	BEN ZAKEN E, GOLDBERG Y, RAVFOGEL S. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models[C]// MURESAN S, NAKOV P, VILLAVICENCIO A (Eds.). Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers), Dublin, Ireland. 2022:1-9. DOI:10.18653/v1/2022.acl-short.1.
[25]	HU E J, SHEN Y, WALLIS P, et al. Lora: Low-rank adaptation of large language models[OL]. 2021. arXiv.2106.09685.
[26]	ZHOU K, YANG J, LOY C C, et al. Learning to prompt for vision-language models[J]. International Journal of Computer Vision, 2022, 130:2337-2348. https://doi.org/10.1007/s11263-022-01653-1.
[27]	JIA M, TANG L, CHEN B C, et al. Visual prompt tuning[C]// Computer Vision - ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIII, Springer-Verlag: 709-727. DOI:10.1007/978-3-031-19827-4_41.
[28]	XING J, LIU J, WANG J, et al. A survey of efficient fine-tuning methods for vision-language models - prompt and adapter[J]. Computers Graphics, 2024, 119: 103885. DOI: 10.1016/j.cag.2024.01.012.
[29]	ROY S, ETEMAD A. Consistency-guided prompt learning for vision-language models. 2024. arXiv:2306.01195.
[30]	陈磊, 刘立波, 王晓丽. 2020年宁夏枸杞虫害图文跨模态检索数据集[J]. 中国科学数据(中英文网络版), 2022, 7(3):149-156.
[31]	WU X, ZHAN C, LAI Y K, et al. Ip102: A large-scale benchmark dataset for insect pest recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019: 8779 - 8788. DOI:10.1109/CVPR.2019.00899.
[32]	GAO P, GENG S, ZHANG R, et al. Clip-adapter: Better vision- language models with feature adapters[J]. International Journal of Computer Vision, 2021. DOI:10. 1007/s11263-023-01891-x.
[33]	KHATTAK M U, RASHEED H, MAAZ M, et al. Maple: Multi-modal prompt learning[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023:19113 - 19122. DOI: 10.1109/CVPR52729.2023.01832.
[34]	HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: 770 - 778. DOI:10.1109/CVPR.2016.90.
[35]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[OL]. arXiv:2010. 11929.

迁移方式	研究工作	研究对象	预训练模型介绍（基础模型+预训练方式）	模态信息
基于模型的迁移	2023^[9]	棉花虫害+水稻虫害	LSTM+RNN，无预训练	单模态
基于预训练-微调的迁移	2022^[4]	番茄虫害	CNN，有监督预训练	单模态
	2022^[12]	多虫害(IP102)	Inceptionv3 (CNN) ，有监督预训练	单模态
	2022^[6]	棉花虫害	DenseNet (CNN) ，有监督预训练	单模态
	2022^[7]	多植物病害	ICVT (CNN+Transformer) ，有监督或无监督预训练	单模态
	2023^[13]	小麦病害	EfficientNet (CNN) ，有监督预训练	单模态
	2023^[14]	水稻叶片病害	Transformer，有监督或无监督预训练	单模态
	2023^[15]	枸杞虫害	CNN+LSTM+Transformer，有监督或无监督预训练	多模态 (图像+文本)
	2024^[10]	多植物病害	ResNet-101，无监督预训练	单模态

方法类别	模型	宁夏枸杞虫害数据集
		40%			66%			83%			All
		F1	Acc		F1		Acc	F1	Acc		F1		Acc
非PEFT	ITF-WPI^[15]	69.97	76.36		86.88		84.90	92.91	97.89		92.97		97.94
非PEFT	CLIP（zero-shot）	F1：3.1%						Acc：3.3%
		Shots
		64 (10%)				96 (15%)				128 (20%)
		F1		Acc		F1		Acc		F1		Acc
PEFT	CoOp^[26]	91.5		92.2		93.5		94.2		95.2		95.6
	CLIP-Adapter^[32]	92.0		92.8		94.0		94.7		95.2		95.8
	Dual-LoRA*	92.7		93.6		96.2		96.6		97.5		97.7
	MaPLe^[33]	92.8		93.5		94.2		94.7		95.7		96.2
	D-PAG	95.6		96.1		96.6		97.2		98.5		98.8

方法类别	模型	Backbone	IP102
方法类别	模型	Backbone	F1				Acc
非PEFT	ResNet-50^[34]	ResNet-50	40.1				49.4
	ViT^[35]	ViT-B	—				75.6
	CLIP（zero-shot）	ViT-B	4.6%				12.0%
PEFT			Shots
			128（40%）		256（57%）		384（66%）		All
			F1	Acc	F1	Acc	F1	Acc	F1	Acc
	CoOp^[26]	ViT-B	55.5	65.7	57.0	68.2	57.5	69.5	56	70.8
	CLIP-Adapter^[32]	ViT-B	51.5	65.1	52.3	67.3	52.3	69.5	52.9	71.5
	Dual-LoRA*	ViT-B	53.0	65.2	54.8	69.7	56.0	71.2	55.6	72.5
	MaPLe^[33]	ViT-B	51.8	63.7	53.3	66.9	54.1	68.8	58.2	73.8
	D-PAG	ViT-B	57.0	68.4	62.0	71.7	60.8	72.9	62.5	75.6

模型	宁夏枸杞虫害数据集			IP102
	样本数量
		10%	15%	20%	100%
Prompt+Adapter	94.2	97.0	97.7	75.1
Prompt+CS-Adapter	94.3	97.4	97.9	74.5
Gate (Prompt+GCS-Adapter)(本研究)	96.1	97.2	98.8	75.6

模型			宁夏枸杞虫害数据集	参数		推理时间
Dual-Prompt	CS-Adapter	Gate	准确率 (%, 20%样本)	增加的参数（M）	参数占比CLIP	ms
CLIP Zero-shot			3.3%	0	0	5.45
√			96.9	0.23	0.15	5.85
√	√		97.9	6.41	4.29	6.50
√	√	√	98.8	6.42	4.29	6.64

D-PAG: Cross-modal Wolfberry Pest Recognition Model Based on Parameter-Efficient Fine-Tuning

RichHTML

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 35

Related Articles 2

Metrics

Comments

Recommended 0

[1]	SUN LuLu, LIU JianPing, ZHOU GuoMin, WANG Jian, LIU LiBo. Spatial Feature Fusion-Based ViT Method for Fine-Grained Classification of Wolfberry Pests [J]. Journal of Agricultural Big Data, 2024, 6(4): 522-531.
[2]	Jiayun Chen, Xiangying Xu, Yonglong Zhang, Ye Zhou, Hongjiang Wang, Changwei Tan. Research Progress of Multimodal Knowledge Graph in Agriculture [J]. Journal of Agricultural Big Data, 2022, 4(3): 126-134.