Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Enhancing clip with gpt-4: harnessing visual descriptions as prompts

Maniparambil, Mayug, Vorster, Chris, Molloy, Derek, Murphy, Noel, McGuinness, Kevin orcid logoORCID: 0000-0003-1336-6477 and O'Connor, Noel E. orcid logoORCID: 0000-0002-4033-9135 (2023) Enhancing clip with gpt-4: harnessing visual descriptions as prompts. In: International Conference on Computer Vision Workshops 2023, 2-6 Oct 2023, Paris.

Abstract
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing prompts that are relevant to the dataset. Such prompt engineering makes use of domain expertise and a validation dataset. Meanwhile, recent developments in generative pretrained models like GPT-4 mean they can be used as advanced internet search tools. They can also be manipulated to provide visual information in any structure. In this work, we show that GPT-4 can be used to generate text that is visually descriptive and how this can be used to adapt CLIP to downstream tasks. We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD (~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt. We also design a simple few-shot adapter that learns to choose the best possible sentences to construct generalizable classifiers that outperform the recently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized fine-grained datasets. The code, prompts, and auxiliary text dataset is available at this https URL.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Workshop
Refereed:Yes
Subjects:Engineering > Electronic engineering
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Electronic Engineering
Publisher:Computer Vision Foundation
Official URL:https://openaccess.thecvf.com/content/ICCV2023W/MM...
Copyright Information:© 2023 The Authors.
Funders:Science Foundation Ireland under Grant number 18/CRT/6183
ID Code:28982
Deposited On:02 Oct 2023 10:32 by Mayug Maniparambil . Last Modified 16 Nov 2023 14:02
Documents

Full text available as:

[thumbnail of enhancing clip with gpt-4: harnessing visual descriptions as prompts]
Preview
PDF (enhancing clip with gpt-4: harnessing visual descriptions as prompts) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 4.0
733kB
[thumbnail of enhancing clip with gpt-4: harnessing visual descriptions as prompts supplementary]
Preview
PDF (enhancing clip with gpt-4: harnessing visual descriptions as prompts supplementary) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 4.0
470kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record