The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless interaction with arbitrary text prompts. However, its lack of spatial awareness makes it unsuitable for dense computer vision tasks, e.g., semantic segmentation, without an additional fine-tuning step that often uses annotations and can potentially suppress its original open-vocabulary properties. Meanwhile, self-supervised representation methods have demonstrated good localization properties without human-made annotations nor explicit supervision. In this work, we take the best of both worlds and propose a zero-shot open-vocabulary semantic segmentation method, which does not require any annotations. We propose to locally improve dense MaskCLIP features, computed with a simple modification of CLIP's last pooling layer, by integrating localization priors extracted from self-supervised features. By doing so, we greatly improve the performance of MaskCLIP and produce smooth outputs. Moreover, we show that the used self-supervised feature properties can directly be learnt from CLIP features therefore allowing us to obtain the best results with a single pass through CLIP model. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference, no extra supervision nor extra memory and reaches state-of-the-art results on challenging and fine-grained benchmarks such as COCO, Pascal Context, Cityscapes and ADE20k.
Overview of CLIP-DINOiser which leverages the quality of self-supervised features to improve the notoriously noisy MaskCLIP feature maps. An input image is forwarded through CLIP image backbone and MaskCLIP layer. The produced features are then improved with our pooling strategy which is guided by correlations predicted with a convolutional layer applied on CLIP. If the queries contain a ‘background’ class which aims at collecting unmatched stuff -like patches, we further pass MaskCLIP features (noted with blue arrows) in a second convolutional layer which produces an objectness mask which is used to improve the quality of the ‘background’ detection. We use here DINO as a teacher which ‘teaches’ CLIP how to extract localization information through light convolutional layers. At inference time, our method requires a single pass through CLIP with the addition of two light convolutional layers.
@article{wysoczanska2024clipdino,
title={CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation},
author={Wysocza{\'n}ska, Monika and Sim{\'e}oni, Oriane and Ramamonjisoa, Micha{\"e}l and Bursuc, Andrei and Trzci{\'n}ski, Tomasz and P{\'e}rez, Patrick},
journal={ECCV},
year={2024}
}
This work was supported by the National Centre of Science (Poland) Grant No. 2022/45/B/ST6/02817 and by the grant from NVIDIA providing one RTX A5000 24GB used for this project.