|▲ ‘Google AI Forum 10th Round: AI Innovation and Computational Photography’ was held.|
According to the announcement, Google presented a ‘portrait mode’ that combines machine learning and computational photography technique with a new Pixel smartphone. This 'portrait mode' automatically applies a soft out-focus effect to the background so that the person can be highlighted. This helps the camera focus more on the subject than on the cluttered background and allows the photographer to take more artistic pictures. This 'portrait mode' greatly improves the photographs through four steps. By using each process efficiently through AI, it shows better results to users.
The first step is to create an HDR+ image through a photo shoot. HDR+ is Google's computational photography technique to improve the quality of photographed pictures. The way to prevent losing highlights, HDR+ captures several under-exposed images, aligns, averages and merges the frames of the captured images to reduce noise in the shadows.
As a method of reducing global contrast while preserving local contrast, it also amplifies these shadows to get pictures with high dynamic range, low noise, and sharp details even in dim lighting. The idea of aligning frames to reduce noise has been known for decades, but Google introduced that its implementation is a bit different as it is achieved on a photo by handheld camera.
|▲ HDR + is Google's computational photography technique to improve the quality of photographed photos.|
The second stage is 'machine learning-based foreground-background segmentation'. Generally, which pixels belong to the foreground, typically a person, and which belong to the background are decided. Here, there is a tricky problem, because the background cannot be assumed as a certain color such as green or blue, unlike the chroma keying (a.k.a. green screening) in the movie industry. Instead, Google applied machine learning.
Google has trained the process of estimating which pixels are human and which pixels are not human by using Convolutional Neural Network (CNN) written in TensorFlow. 'Convolution' means that the learned components of the network are organized in the form of filters (sum of weights of neighboring pixels around each pixel), so you can think of networks as simply filtering images, then filtering the filtered images.
The ‘skip connections’ allows information to easily flow from the early stages in the network where it reasons about color and edges up to later stages of the network where it reasons about high-level features (faces and body parts). Combining stages like this is important when you need to not just determine if a photo has a person in it, but to identify exactly which pixels belong to that person. CNN was trained on almost a million pictures of peoples with hats, sunglasses, and ice cream cones.
The third stage goes through ‘calculation of depth using stereo algorithm’. The Pixel 2 doesn't have dual cameras, but it does have a technology called Phase-Detect Auto-Focus (PDAF) pixels, sometimes called dual-pixel autofocus (DPAF). It works in a way of splitting every pixel on the image sensor chip into two smaller side-by-side pixels and reading them from the chip separately. Unlike many cameras including DSLR that use PDAF technology to focus faster on video recordings, it uses to compute depth maps.
PDAF pixels give you views through the left and right sides of the lens in a single snapshot and uses left-side and right-side images (or top and bottom) as input data to a stereo algorithm like that used in Google’s Jump system panorama stitcher. This algorithm first performs subpixel-accurate tile-based alignment to produce a low-resolution depth map, then interpolates it to high resolution using a bilateral solver.
|▲ It goes through ‘calculation of depth using stereo algorithm’.|
Lastly, the fourth stage ‘puts all together to render the final image’. This step is to combine the segmentation mask computed in the second step with the depth map computed in the third step to decide how much to blur each pixel in the HDR+ picture from the first step.
The rough idea is that the subject considered to be a person stays sharp, and the subject considered to be the background is blurred in proportion to how far it is from the in-focus plane, where these distances are taken from the depth map. The application of blur is replaced with translucent disks with various sizes. By compositing all these disks in depth order, it is possible to get the approximation to real optical blur.
On the other hand, engineer Marc Levoy presented tips for shooting a nice portrait. First, close enough to the subject that head fills the frame, and for group shots, the subject must be placed at the same distance from the camera. Also, for the blur effect, you should put a little distance between the subject and the background, and you should take off dark sunglasses, a wide brim hat, and a big scarf. In addition, when taking close-ups, the focus should be adjusted so that the subject of interest remains sharp.
After the lecture, Marc Levoy said, “It is true that mobile phones are yet impossible to completely replace professional cameras due to technical and mechanical limitations, but it is possible to show users a certain level of photos. This is important for widening the user's choices, and the machine learning and computational photography technique are at the center.”