Cześć!
I am a PhD student in Computer Vision and Machine Learning in the Computer Vision Lab of Warsaw University of Technology. I have also been lucky enough to spend 10 months as a visiting researcher in valeo.ai team in Paris and I have had long term attachments with Imagine group of Ecole des Ponts.
My research revolves around Multimodal Learning. In particular, I’m interested in multimodal alignment, emphasising text + vision. I investigate it through means of various multimodal tasks. I work under supervision of Tomasz Trzcinski.
I obtained my master’s degree in Computer Science at Warsaw University of Technology and my BSc from Wroclaw University of Science and Technology in Systems Engineering.
News
- 07/2024: Very excited to start as Student Researcher at Google DeepMind! I’ll be working in the team led by Cordelia Schmid in Grenoble ⛰️
- 07/2024: Gave a talk on Open-Vocabulary Semantic Segmentation in my favorite academic lab - Imagine ❤️
- 07/2024: Our study of single-query scenario for Open-world semantic segmentation availale on arxiv!
- 07/2024: CLIP-DINOiser accepted to ECCV 2024 🍕 Very lucky to have worked with such great researchers: Oriane, Mic, Andrei, Tomasz and Patrick, thank you.
- 02/2024: Attending AAAI-24 conference in Vancouver to present my Booking.com internship paper.
- 01/2024: I presented CLIP-DIY at WACV 2024 🌴 Honored to be part of the Doctoral Consortium and be mentored by Michael Black!
- 10/2023: We organized Women in Computer Vision Workshop at ICCV 2023 in Paris and it was something! Thanks to everyone who participated in the event!
- 07/2023: I graduated from Neuromatch Academy Summer School in Computational Neuroscience 🧠 It was a great experience!
- 01/2023: EgoNN in ICRA2023 as an oral presentation!
- 12/2022: Our research proposal titled “Dynamic neural networks for efficient machine learning” received 3-year funding from the Polish National Science Centre. 2.5 years into my PhD and I have funding - no more jobs on the side 👩🏭
- 10/2022: Our paper Towards Unsupervised VQA: Do off-the-shelf features know how to reason? got accepted to NeurIPS 2022 Workshop: Self-Supervised Learning - Theory and Practice! Thanks, Tom, Tomasz, and David!
- 09/2022: I received Campus France scholarship for another 2-month visit in IMAGINE team of Ecole des Ponts Paristech in Paris this fall to work on unsupervised image representations for VQA with David Picard.
- 07/2022: Starting my summer research internship at Booking.com in Amsterdam!
- 03/2022: My project proposal got accepted to NVIDIA Academic Hardware Grant Program and I received a GPU card to support my research. Thanks, NVIDIA!
Publications
CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation
Monika Wysoczańska, Oriane Siméoni, Michael Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński , Patrick Pérez
The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless interaction with arbitrary text prompts. However, its lack of spatial awareness makes it unsuitable for dense computer vision tasks, e.g., semantic segmentation, without an additional fine-tuning step that often uses annotations and can potentially suppress its original open-vocabulary properties. Meanwhile, self-supervised representation methods have demonstrated good localization properties without human-made annotations nor explicit supervision. In this work, we take the best of both worlds and propose a zero-shot open-vocabulary semantic segmentation method, which does not require any annotations. We propose to locally improve dense MaskCLIP features, computed with a simple modification of CLIP's last pooling layer, by integrating localization priors extracted from self-supervised features. By doing so, we greatly improve the performance of MaskCLIP and produce smooth outputs. Moreover, we show that the used self-supervised feature properties can directly be learnt from CLIP features therefore allowing us to obtain the best results with a single pass through CLIP model. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference, no extra supervision nor extra memory and reaches state-of-the-art results on challenging and fine-grained benchmarks such as COCO, Pascal Context, Cityscapes and ADE20k.
@article{wysoczanska2024clipdino,
title = {CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation},
author = {Wysocza{\'{n}}ska, Monika and
Sim{\'{e}}oni, Oriane and
Ramamonjisoa, Micha{\"{e}}l and
Bursuc, Andrei and
Trzci{\'{n}}ski, Tomasz and
P{\'{e}}rez, Patrick},
booktitle = {ECCV},
year = {2024}
}
Published in ECCV, 2024
A Study of Test-time Contrastive Concepts for Open-world, Open-vocabulary Semantic Segmentation
Monika Wysoczańska, Antonin Vobecky, Amaia Cardiel, Tomasz Trzciński , Renaud Marlet , Andrei Bursuc, Oriane Siméoni
Recent VLMs, pre-trained on large amounts of image-text pairs to align both modalities, have opened the way to open-vocabulary semantic segmentation. Given an arbitrary set of textual queries, image regions are assigned the closest query in feature space. However, the usual setup expects the user to list all possible visual concepts that may occur in the image, typically all classes of benchmark datasets, that act as negatives to each other. We consider here the more challenging scenario of segmenting a single concept, given a textual prompt and nothing else. To achieve good results, besides contrasting with the generic 'background' text, we study different ways to generate query-specific test-time contrastive textual concepts, which leverage either the distribution of text in the VLM's training set or crafted LLM prompts. We show the relevance of our approach using a new, specific metric.
@article{wysoczańska2024studycc, title={A Study of Test-time Contrastive Concepts for Open-world, Open-vocabulary Semantic Segmentation}, author={Monika Wysoczańska and Antonin Vobecky and Amaia Cardiel and Tomasz Trzciński and Renaud Marlet and Andrei Bursuc and Oriane Siméoni}, year={2024}, journal={arXiv}, }
Published in arxiv, 2024
Tell Me What Is Good About This Property: Leveraging Reviews For Segment-Personalized Image Collection Summarization
Monika Wysoczanska, Moran Beladev, Karen Lastmann Assaraf, Fengjun Wang, Ofri Kleinfeld, Gil Amsalem, Hadas Harush Boker
Image collection summarization techniques aim to present a compact representation of an image gallery through a carefully selected subset of images that captures its semantic content. When it comes to web content, however, the ideal selection can vary based on the user's specific intentions and preferences. This is particularly relevant at Booking.com, where presenting properties and their visual summaries that align with users' expectations is crucial. To address this challenge, we consider user intentions in the summarization of property visuals by analyzing property reviews and extracting the most significant aspects mentioned by users. By incorporating the insights from reviews in our visual summaries, we enhance the summaries by presenting the relevant content to a user. Moreover, we achieve it without the need for costly annotations. Our experiments, including human perceptual studies, demonstrate the superiority of our cross-modal approach, which we coin as mph{CrossSummarizer} over the no-personalization and image-based clustering baselines.
@article{wysoczanska2024crosssummarizer, title={Tell Me What Is Good about This Property: Leveraging Reviews for Segment-Personalized Image Collection Summarization}, journal={AAAI}, author={Wysoczanska, Monika and Beladev, Moran and Lastmann Assaraf, Karen and Wang, Fengjun and Kleinfeld, Ofri and Amsalem, Gil and Harush Boke, Hadas}, year={2024} }
Published in AAAI - Innovative Applications in AI, 2024
CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free
Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, Oriane Siméoni
The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.
@article{wysoczanska2023clipdiy,
title={CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free},
author={Wysoczanska, Monika and Ramamonjisoa, Michael and Trzcinski, Tomasz and Simeoni, Oriane},
booktitle={Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)},
year={2024} }
Published in WACV, 2024
Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?
Monika Wysoczanska, Tom Monnier, Tomasz Trzcinski and David Picard
Recent advances in visual representation learning allowed to build an abundance of powerful off-the-shelf features that are ready-to-use for numerous downstream tasks. This work aims to assess how well these features preserve information about the objects, such as their spatial location, their visual properties and their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module which is trained on the frozen visual representations to be evaluated, in a spirit similar to standard feature evaluations relying on shallow networks. We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth. Our main findings are two-fold. First, despite excellent performances on classical proxy tasks, such representations fall short for solving complex reasoning problem. Second, object-centric features better preserve the critical information necessary to perform visual reasoning. In our proposed framework we show how to methodologically approach this evaluation.
@misc{https://doi.org/10.48550/arxiv.2212.10292,
doi = {10.48550/ARXIV.2212.10292},
url = {https://arxiv.org/abs/2212.10292},
author = {Wysoczańska, Monika and Monnier, Tom and Trzciński, Tomasz and Picard, David},
title = {Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?},
publisher = {arXiv},
year = {2022}}
Published in NeurIPS 2022 SSL Workshop, 2022
EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale
Jacek Komorowski, Monika Wysoczanska and Tomasz Trzcinski
The letter presents a deep neural network-based method for global and local descriptors extraction from a point cloud acquired by a rotating 3D LiDAR. The descriptors can be used for two-stage 6DoF relocalization. First, a course position is retrieved by finding candidates with the closest global descriptor in the database of geo-tagged point clouds. Then, the 6DoF pose between a query point cloud and a database point cloud is estimated by matching local descriptors and using a robust estimator such as RANSAC. Our method has a simple, fully convolutional architecture based on a sparse voxelized representation. It can efficiently extract a global descriptor and a set of keypoints with local descriptors from large point clouds with tens of thousand points. Our code and pretrained models are publicly available on the project website.
@ARTICLE{9645340,
author={Komorowski, Jacek and Wysoczanska, Monika and Trzcinski, Tomasz},
journal={IEEE Robotics and Automation Letters},
title={EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale},
year={2022},
volume={7},
number={2},
pages={722-729},
doi={10.1109/LRA.2021.3133593}}
Published in Robotics and Automation Letters -> ICRA, 2022
MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition
Jacek Komorowski, Monika Wysoczanska, and Tomasz Trzcinski
We present a discriminative multimodal descriptor based on a pair of sensor readings: a point cloud from a LiDAR and an image from an RGB camera. Our descriptor, named MinkLoc++, can be used for place recognition, re-localization and loop closure purposes in robotics or autonomous vehicles applications. We use late fusion approach, where each modality is processed separately and fused in the final part of the processing pipeline. The proposed method achieves state-of-the-art performance on standard place recognition benchmarks. We also identify dominating modality problem when training a multimodal descriptor. The problem manifests itself when the network focuses on a modality with a larger overfit to the training data. This drives the loss down during the training but leads to suboptimal performance on the evaluation set. In this work we describe how to detect and mitigate such risk when using a deep metric learning approach to train a multimodal neural network.
@INPROCEEDINGS{9533373,
author={Komorowski, Jacek and Wysoczańska, Monika and Trzcinski, Tomasz},
booktitle={2021 International Joint Conference on Neural Networks (IJCNN)},
title={MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition},
year={2021},
doi={10.1109/IJCNN52387.2021.9533373}}
Published in International Joint Conference on Neural Networks (IJCNN 2021), 2021
Multimodal Dance Recognition
Monika Wysoczanska and Tomasz Trzcinski
Video content analysis is still an emerging technology, and the majority of work in this area extends from the still image domain. Dance videos are especially difficult to analyse and recognise as the performed human actions are highly dynamic. In this work, we introduce a multimodal approach for dance video recognition. Our proposed method combines visual and audio information, by fusing their representations, to improve classification accuracy. For the visual part, we focus on motion representation, as it is the key factor in distinguishing dance styles. For audio representation, we put the emphasis on capturing long-term dependencies, such as tempo, which is a crucial dance discriminator. Finally, we fuse two distinct modalities using a late fusion approach. We compare our model with corresponding unimodal approaches, by giving exhaustive evaluation on the Let’s Dance dataset. Our method yields significantly better results than each single-modality approach. Results presented in this work not only demonstrate the strength of integrating complementary sources of information in the recognition task, but also indicate the potential of applying multimodal approaches within specific research areas.
Published in VISAPP2020, 2020