Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?

Monika Wysoczanska, Tom Monnier, Tomasz Trzcinski and David Picard

Recent advances in visual representation learning allowed to build an abundance of powerful off-the-shelf features that are ready-to-use for numerous downstream tasks. This work aims to assess how well these features preserve information about the objects, such as their spatial location, their visual properties and their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module which is trained on the frozen visual representations to be evaluated, in a spirit similar to standard feature evaluations relying on shallow networks. We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth. Our main findings are two-fold. First, despite excellent performances on classical proxy tasks, such representations fall short for solving complex reasoning problem. Second, object-centric features better preserve the critical information necessary to perform visual reasoning. In our proposed framework we show how to methodologically approach this evaluation.

doi = {10.48550/ARXIV.2212.10292},
url = {},
author = {Wysoczańska, Monika and Monnier, Tom and Trzciński, Tomasz and Picard, David},
title = {Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?},
publisher = {arXiv},
year = {2022}}

Published in NeurIPS 2022 SSL Workshop, 2022

EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale

Jacek Komorowski, Monika Wysoczanska and Tomasz Trzcinski

The letter presents a deep neural network-based method for global and local descriptors extraction from a point cloud acquired by a rotating 3D LiDAR. The descriptors can be used for two-stage 6DoF relocalization. First, a course position is retrieved by finding candidates with the closest global descriptor in the database of geo-tagged point clouds. Then, the 6DoF pose between a query point cloud and a database point cloud is estimated by matching local descriptors and using a robust estimator such as RANSAC. Our method has a simple, fully convolutional architecture based on a sparse voxelized representation. It can efficiently extract a global descriptor and a set of keypoints with local descriptors from large point clouds with tens of thousand points. Our code and pretrained models are publicly available on the project website.

author={Komorowski, Jacek and Wysoczanska, Monika and Trzcinski, Tomasz},
journal={IEEE Robotics and Automation Letters},
title={EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale},

Published in Robotics and Automation Letters -> ICRA 2023, 2022

MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition

Jacek Komorowski, Monika Wysoczanska, and Tomasz Trzcinski

We present a discriminative multimodal descriptor based on a pair of sensor readings: a point cloud from a LiDAR and an image from an RGB camera. Our descriptor, named MinkLoc++, can be used for place recognition, re-localization and loop closure purposes in robotics or autonomous vehicles applications. We use late fusion approach, where each modality is processed separately and fused in the final part of the processing pipeline. The proposed method achieves state-of-the-art performance on standard place recognition benchmarks. We also identify dominating modality problem when training a multimodal descriptor. The problem manifests itself when the network focuses on a modality with a larger overfit to the training data. This drives the loss down during the training but leads to suboptimal performance on the evaluation set. In this work we describe how to detect and mitigate such risk when using a deep metric learning approach to train a multimodal neural network.

author={Komorowski, Jacek and Wysoczańska, Monika and Trzcinski, Tomasz},
booktitle={2021 International Joint Conference on Neural Networks (IJCNN)},
title={MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition},

Published in International Joint Conference on Neural Networks (IJCNN 2021), 2021

Multimodal Dance Recognition

Monika Wysoczanska and Tomasz Trzcinski

Video content analysis is still an emerging technology, and the majority of work in this area extends from the still image domain. Dance videos are especially difficult to analyse and recognise as the performed human actions are highly dynamic. In this work, we introduce a multimodal approach for dance video recognition. Our proposed method combines visual and audio information, by fusing their representations, to improve classification accuracy. For the visual part, we focus on motion representation, as it is the key factor in distinguishing dance styles. For audio representation, we put the emphasis on capturing long-term dependencies, such as tempo, which is a crucial dance discriminator. Finally, we fuse two distinct modalities using a late fusion approach. We compare our model with corresponding unimodal approaches, by giving exhaustive evaluation on the Let’s Dance dataset. Our method yields significantly better results than each single-modality approach. Results presented in this work not only demonstrate the strength of integrating complementary sources of information in the recognition task, but also indicate the potential of applying multimodal approaches within specific research areas.

Published in VISAPP2020, 2020