GTC 2020 S22193
Presenters: Shalini De Mello,NVIDIA; Siva Mustikovela,University of Heidelberg
Learning-based methods for viewpoint estimation of object categories (for example, faces or cars) require many images with labeled viewpoints. Viewpoint annotations are cumbersome to acquire and often contain errors. On the other hand, it is relatively easy to mine large collections of unlabelled images of a category from the internet. We investigate whether such image collections can be used to successfully train viewpoint-estimation networks purely via self-supervision, where the only ground-truth label available is the image itself. We design a framework that leverages the analysis-by-synthesis paradigm and couples the viewpoint network with a viewpoint-aware synthesis network to supervise it. We additionally propose various losses that enforce symmetry, realism, and better disengagement of the latent space of the image synthesizer to further supervise the viewpoint network. For faces, cars, buses, and trains, our technique performs competitively to the existing fully-supervised approaches.
Watch this session
Join in the conversation below.