Understanding Aesthetics with Deep Learning

Originally published at: https://developer.nvidia.com/blog/understanding-aesthetics-deep-learning/

To me, photography is the simultaneous recognition, in a fraction of a second, of the significance of an event. — Henri larartier Bresson As a child I waited anxiously for the arrival of each new issue of National Geographic Magazine. The magazine had amazing stories from around the world, but the stunningly beautiful photographs were more important…

Great post! I can't find aesthetic sorting on EyeEM's search results today (March 8th, 2016) - where can I try this?

Hi Appu Shaji.

Great article and very interesting work. You may remember me from IIT days :)

I have a technical question here. Why use triplet hinge loss ? Why not use a classifier to separate images as aesthetic vs not. May be the idea is to also find similar images to high quality images using the embedding and allow image search.


Hi Vineeth,

Two major reason in using triplet hinge loss:

1. Often there is no correct classification if an image is aesthetics or not, and bucketing it into a class is non-trivial. Further, our understanding of aesthetics is perceptual and relative, so this has to be approached as a ranking problem, rather than a classification problem. Hence the motivation for using implicit regression/ranking framework. This is true for other cases, like image similarity, where the boundaries are fuzzy. Further aesthetics is deeply rooted to context/story embedded in a photo. Our sampling scheme is based on similarity based on keyword detections, this indirectly enforces a context; so aesthetics is judged relatively (ranked ) to other samples within a context.

2. Further, there is an order of O(N^3) samples we can generate from a training set of N images. For example, a dataset of 2N images, split into N good and N bad ones, there are Combination(N,2) * Combination(N,1) samples available.

And of course, as you mentioned, they are interesting applications of discovery ( like similarity ) inside a embedding space.


Great post! may I ask why you use three images as input instead of two images as input?


Photography is art
More than Art
I love it


Hi Appu Shaji,

Thank you for sharing information. Could you please answer these questions regarding to the steps when you generating training dataset?

1. How do judges label the images in detail? Is it 1.1 or 1.2?
1.1 Given 10 images having the same context (by a similarity measure), judges label the images by giving score in range 1-5.
1.2 Given a pair of images having the same context, judges label the images by selecting a better one.

2. How does the process of sampling works? below is my understanding, is that correct?
- Let's say we have total 100K images. Firstly, images are grouped based on keyword detection.
- We then sample the images for each group separately because we need the examples to have the same context.


Interesting !! do you know how AI can Assist Radiologists in Pneumonia Detection? https://blog.skyl.ai/how-ai... this article is very interesting relater to healthcare.