CLIP-DINOiser: Teaching CLIP a few DINO tricks

Monika Wysoczańska, Oriane Siméoni, Michael Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński , Patrick Pérez

The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless interaction with arbitrary text prompts. However, its lack of spatial awareness makes it unsuitable for dense computer vision tasks, e.g., semantic segmentation, without an additional fine-tuning step that often uses annotations and can potentially suppress its original open-vocabulary properties. Meanwhile, self-supervised representation methods have demonstrated good localization properties without human-made annotations nor explicit supervision. In this work, we take the best of both worlds and propose a zero-shot open-vocabulary semantic segmentation method, which does not require any annotations. We propose to locally improve dense MaskCLIP features, computed with a simple modification of CLIP's last pooling layer, by integrating localization priors extracted from self-supervised features. By doing so, we greatly improve the performance of MaskCLIP and produce smooth outputs. Moreover, we show that the used self-supervised feature properties can directly be learnt from CLIP features therefore allowing us to obtain the best results with a single pass through CLIP model. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference, no extra supervision nor extra memory and reaches state-of-the-art results on challenging and fine-grained benchmarks such as COCO, Pascal Context, Cityscapes and ADE20k.

title = {CLIP-DINOiser: Teaching CLIP a few DINO tricks},
author = {Wysocza{\'{n}}ska, Monika and
Sim{\'{e}}oni, Oriane and
Ramamonjisoa, Micha{\"{e}}l and
Bursuc, Andrei and
Trzci{\'{n}}ski, Tomasz and
P{\'{e}}rez, Patrick},
journal = {arXiv},
year = {2023}

Published in arXiv, 2024

Tell Me What Is Good About This Property: Leveraging Reviews For Segment-Personalized Image Collection Summarization

Monika Wysoczanska, Moran Beladev, Karen Lastmann Assaraf, Fengjun Wang, Ofri Kleinfeld, Gil Amsalem, Hadas Harush Boker

Image collection summarization techniques aim to present a compact representation of an image gallery through a carefully selected subset of images that captures its semantic content. When it comes to web content, however, the ideal selection can vary based on the user's specific intentions and preferences. This is particularly relevant at, where presenting properties and their visual summaries that align with users' expectations is crucial. To address this challenge, we consider user intentions in the summarization of property visuals by analyzing property reviews and extracting the most significant aspects mentioned by users. By incorporating the insights from reviews in our visual summaries, we enhance the summaries by presenting the relevant content to a user. Moreover, we achieve it without the need for costly annotations. Our experiments, including human perceptual studies, demonstrate the superiority of our cross-modal approach, which we coin as mph{CrossSummarizer} over the no-personalization and image-based clustering baselines.

@misc{wysoczanska2023tell, title={Tell Me What Is Good About This Property: Leveraging Reviews For Segment-Personalized Image Collection Summarization}, author={Monika Wysoczanska and Moran Beladev and Karen Lastmann Assaraf and Fengjun Wang and Ofri Kleinfeld and Gil Amsalem and Hadas Harush Boker}, year={2023}, eprint={2310.19743}, archivePrefix={arXiv}, primaryClass={cs.LG} }

Published in AAAI 2024 - Innovative Applications in AI, 2023

CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free

Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, Oriane Siméoni

The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.

title={CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free},
author={Wysoczanska, Monika and Ramamonjisoa, Michael and Trzcinski, Tomasz and Simeoni, Oriane},
booktitle={Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)},
year={2024} }

Published in WACV, 2023

Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?

Monika Wysoczanska, Tom Monnier, Tomasz Trzcinski and David Picard

Recent advances in visual representation learning allowed to build an abundance of powerful off-the-shelf features that are ready-to-use for numerous downstream tasks. This work aims to assess how well these features preserve information about the objects, such as their spatial location, their visual properties and their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module which is trained on the frozen visual representations to be evaluated, in a spirit similar to standard feature evaluations relying on shallow networks. We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth. Our main findings are two-fold. First, despite excellent performances on classical proxy tasks, such representations fall short for solving complex reasoning problem. Second, object-centric features better preserve the critical information necessary to perform visual reasoning. In our proposed framework we show how to methodologically approach this evaluation.

doi = {10.48550/ARXIV.2212.10292},
url = {},
author = {Wysoczańska, Monika and Monnier, Tom and Trzciński, Tomasz and Picard, David},
title = {Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?},
publisher = {arXiv},
year = {2022}}

Published in NeurIPS 2022 SSL Workshop, 2022

EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale

Jacek Komorowski, Monika Wysoczanska and Tomasz Trzcinski

The letter presents a deep neural network-based method for global and local descriptors extraction from a point cloud acquired by a rotating 3D LiDAR. The descriptors can be used for two-stage 6DoF relocalization. First, a course position is retrieved by finding candidates with the closest global descriptor in the database of geo-tagged point clouds. Then, the 6DoF pose between a query point cloud and a database point cloud is estimated by matching local descriptors and using a robust estimator such as RANSAC. Our method has a simple, fully convolutional architecture based on a sparse voxelized representation. It can efficiently extract a global descriptor and a set of keypoints with local descriptors from large point clouds with tens of thousand points. Our code and pretrained models are publicly available on the project website.

author={Komorowski, Jacek and Wysoczanska, Monika and Trzcinski, Tomasz},
journal={IEEE Robotics and Automation Letters},
title={EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale},

Published in Robotics and Automation Letters -> ICRA, 2022

MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition

Jacek Komorowski, Monika Wysoczanska, and Tomasz Trzcinski

We present a discriminative multimodal descriptor based on a pair of sensor readings: a point cloud from a LiDAR and an image from an RGB camera. Our descriptor, named MinkLoc++, can be used for place recognition, re-localization and loop closure purposes in robotics or autonomous vehicles applications. We use late fusion approach, where each modality is processed separately and fused in the final part of the processing pipeline. The proposed method achieves state-of-the-art performance on standard place recognition benchmarks. We also identify dominating modality problem when training a multimodal descriptor. The problem manifests itself when the network focuses on a modality with a larger overfit to the training data. This drives the loss down during the training but leads to suboptimal performance on the evaluation set. In this work we describe how to detect and mitigate such risk when using a deep metric learning approach to train a multimodal neural network.

author={Komorowski, Jacek and Wysoczańska, Monika and Trzcinski, Tomasz},
booktitle={2021 International Joint Conference on Neural Networks (IJCNN)},
title={MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition},

Published in International Joint Conference on Neural Networks (IJCNN 2021), 2021

Multimodal Dance Recognition

Monika Wysoczanska and Tomasz Trzcinski

Video content analysis is still an emerging technology, and the majority of work in this area extends from the still image domain. Dance videos are especially difficult to analyse and recognise as the performed human actions are highly dynamic. In this work, we introduce a multimodal approach for dance video recognition. Our proposed method combines visual and audio information, by fusing their representations, to improve classification accuracy. For the visual part, we focus on motion representation, as it is the key factor in distinguishing dance styles. For audio representation, we put the emphasis on capturing long-term dependencies, such as tempo, which is a crucial dance discriminator. Finally, we fuse two distinct modalities using a late fusion approach. We compare our model with corresponding unimodal approaches, by giving exhaustive evaluation on the Let’s Dance dataset. Our method yields significantly better results than each single-modality approach. Results presented in this work not only demonstrate the strength of integrating complementary sources of information in the recognition task, but also indicate the potential of applying multimodal approaches within specific research areas.

Published in VISAPP2020, 2020