Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models

Zhang, Jiarui; Khayatkhoei, Mahyar; Chhikara, Prateek; Ilievski, Filip

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2306

Computer Science > Computer Vision and Pattern Recognition

Title: Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models

Authors: Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski

(Submitted on 31 May 2023)

Abstract: Visual Question Answering is a challenging task, as it requires seamless interaction between perceptual, linguistic, and background knowledge systems. While the recent progress of visual and natural language models like BLIP has led to improved performance on this task, we lack understanding of the ability of such models to perform on different kinds of questions and reasoning types. As our initial analysis of BLIP-family models revealed difficulty with answering fine-detail questions, we investigate the following question: Can visual cropping be employed to improve the performance of state-of-the-art visual question answering models on fine-detail questions? Given the recent success of the BLIP-family models, we study a zero-shot and a fine-tuned BLIP model. We define three controlled subsets of the popular VQA-v2 benchmark to measure whether cropping can help model performance. Besides human cropping, we devise two automatic cropping strategies based on multi-modal embedding by CLIP and BLIP visual QA model gradients. Our experiments demonstrate that the performance of BLIP model variants can be significantly improved through human cropping, and automatic cropping methods can produce comparable benefits. A deeper dive into our findings indicates that the performance enhancement is more pronounced in zero-shot models than in fine-tuned models and more salient with smaller bounding boxes than larger ones. We perform case studies to connect quantitative differences with qualitative observations across question types and datasets. Finally, we see that the cropping enhancement is robust, as we gain an improvement of 4.59% (absolute) in the general VQA-random task by simply inputting a concatenation of the original and gradient-based cropped images. We make our code available to facilitate further innovation on visual cropping methods for question answering.

Comments:	16 pages, 5 figures, 7 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2306.00228 [cs.CV]
	(or arXiv:2306.00228v1 [cs.CV] for this version)

Submission history

From: Jiarui Zhang [view email]
[v1] Wed, 31 May 2023 22:48:27 GMT (12175kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2306.00228

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models

Submission history