Publications
Tell Me What Is Good About This Property: Leveraging Reviews For Segment-Personalized Image Collection Summarization
Monika Wysoczanska, Moran Beladev, Karen Lastmann Assaraf, Fengjun Wang, Ofri Kleinfeld, Gil Amsalem, Hadas Harush Boker
Image collection summarization techniques aim to present a compact representation of an image gallery through a carefully selected subset of images that captures its semantic content. When it comes to web content, however, the ideal selection can vary based on the user's specific intentions and preferences. This is particularly relevant at Booking.com, where presenting properties and their visual summaries that align with users' expectations is crucial. To address this challenge, we consider user intentions in the summarization of property visuals by analyzing property reviews and extracting the most significant aspects mentioned by users. By incorporating the insights from reviews in our visual summaries, we enhance the summaries by presenting the relevant content to a user. Moreover, we achieve it without the need for costly annotations. Our experiments, including human perceptual studies, demonstrate the superiority of our cross-modal approach, which we coin as mph{CrossSummarizer} over the no-personalization and image-based clustering baselines.
@misc{wysoczanska2023tell, title={Tell Me What Is Good About This Property: Leveraging Reviews For Segment-Personalized Image Collection Summarization}, author={Monika Wysoczanska and Moran Beladev and Karen Lastmann Assaraf and Fengjun Wang and Ofri Kleinfeld and Gil Amsalem and Hadas Harush Boker}, year={2023}, eprint={2310.19743}, archivePrefix={arXiv}, primaryClass={cs.LG} }
Published in AAAI 2024 - Innovative Applications in AI, 2023
CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free
Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, Oriane Siméoni
The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.
@article{wysoczanska2023clipdiy,
title={CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free},
author={Wysoczanska, Monika and Ramamonjisoa, Michael and Trzcinski, Tomasz and Simeoni, Oriane},
booktitle={Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)},
year={2024} }
Published in WACV, 2023
Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?
Monika Wysoczanska, Tom Monnier, Tomasz Trzcinski and David Picard
Recent advances in visual representation learning allowed to build an abundance of powerful off-the-shelf features that are ready-to-use for numerous downstream tasks. This work aims to assess how well these features preserve information about the objects, such as their spatial location, their visual properties and their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module which is trained on the frozen visual representations to be evaluated, in a spirit similar to standard feature evaluations relying on shallow networks. We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth. Our main findings are two-fold. First, despite excellent performances on classical proxy tasks, such representations fall short for solving complex reasoning problem. Second, object-centric features better preserve the critical information necessary to perform visual reasoning. In our proposed framework we show how to methodologically approach this evaluation.
@misc{https://doi.org/10.48550/arxiv.2212.10292,
doi = {10.48550/ARXIV.2212.10292},
url = {https://arxiv.org/abs/2212.10292},
author = {Wysoczańska, Monika and Monnier, Tom and Trzciński, Tomasz and Picard, David},
title = {Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?},
publisher = {arXiv},
year = {2022}}
Published in NeurIPS 2022 SSL Workshop, 2022
EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale
Jacek Komorowski, Monika Wysoczanska and Tomasz Trzcinski
The letter presents a deep neural network-based method for global and local descriptors extraction from a point cloud acquired by a rotating 3D LiDAR. The descriptors can be used for two-stage 6DoF relocalization. First, a course position is retrieved by finding candidates with the closest global descriptor in the database of geo-tagged point clouds. Then, the 6DoF pose between a query point cloud and a database point cloud is estimated by matching local descriptors and using a robust estimator such as RANSAC. Our method has a simple, fully convolutional architecture based on a sparse voxelized representation. It can efficiently extract a global descriptor and a set of keypoints with local descriptors from large point clouds with tens of thousand points. Our code and pretrained models are publicly available on the project website.
@ARTICLE{9645340,
author={Komorowski, Jacek and Wysoczanska, Monika and Trzcinski, Tomasz},
journal={IEEE Robotics and Automation Letters},
title={EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale},
year={2022},
volume={7},
number={2},
pages={722-729},
doi={10.1109/LRA.2021.3133593}}
Published in Robotics and Automation Letters -> ICRA, 2022
MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition
Jacek Komorowski, Monika Wysoczanska, and Tomasz Trzcinski
We present a discriminative multimodal descriptor based on a pair of sensor readings: a point cloud from a LiDAR and an image from an RGB camera. Our descriptor, named MinkLoc++, can be used for place recognition, re-localization and loop closure purposes in robotics or autonomous vehicles applications. We use late fusion approach, where each modality is processed separately and fused in the final part of the processing pipeline. The proposed method achieves state-of-the-art performance on standard place recognition benchmarks. We also identify dominating modality problem when training a multimodal descriptor. The problem manifests itself when the network focuses on a modality with a larger overfit to the training data. This drives the loss down during the training but leads to suboptimal performance on the evaluation set. In this work we describe how to detect and mitigate such risk when using a deep metric learning approach to train a multimodal neural network.
@INPROCEEDINGS{9533373,
author={Komorowski, Jacek and Wysoczańska, Monika and Trzcinski, Tomasz},
booktitle={2021 International Joint Conference on Neural Networks (IJCNN)},
title={MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition},
year={2021},
doi={10.1109/IJCNN52387.2021.9533373}}
Published in International Joint Conference on Neural Networks (IJCNN 2021), 2021
Multimodal Dance Recognition
Monika Wysoczanska and Tomasz Trzcinski
Video content analysis is still an emerging technology, and the majority of work in this area extends from the still image domain. Dance videos are especially difficult to analyse and recognise as the performed human actions are highly dynamic. In this work, we introduce a multimodal approach for dance video recognition. Our proposed method combines visual and audio information, by fusing their representations, to improve classification accuracy. For the visual part, we focus on motion representation, as it is the key factor in distinguishing dance styles. For audio representation, we put the emphasis on capturing long-term dependencies, such as tempo, which is a crucial dance discriminator. Finally, we fuse two distinct modalities using a late fusion approach. We compare our model with corresponding unimodal approaches, by giving exhaustive evaluation on the Let’s Dance dataset. Our method yields significantly better results than each single-modality approach. Results presented in this work not only demonstrate the strength of integrating complementary sources of information in the recognition task, but also indicate the potential of applying multimodal approaches within specific research areas.
Published in VISAPP2020, 2020