Augmenting Your Reality with Deep Learning

By: Gordon Cooper, Embedded Vision Product Marketing Manager, Synopsys

Augmented reality (AR) is gaining momentum thanks to the proliferation of cameras in mobile devices and to the improvements in silicon processing efficiency and more advanced artificial intelligence (AI) algorithms. AR is benefitting from the emerging deep learning techniques associated with AI and embedded vision. Applications for AR include education, gaming, industrial, and even self-driving cars. Camera-captured images and video are an important aspect of AR systems and are also commonly used as inputs into embedded vision processing. Deep learning techniques like convolutional neural networks (CNNs) can be applied to the images generated in AR systems to create immersive experiences. Developing AR systems requires designers to address the performance, power, and area impact of deep learning, which can be mitigated by using embedded vision processors as a companion to the host CPU.

AR, VR & Mixed Reality

AR and virtual reality (VR) differ in fundamental ways. VR seeks to create an entirely new and immersive environment for a VR headset wearer that uses images and sounds. Your VR headset (Figure 1) could play back a recording of a café in Budapest or could use entirely new images to propel you into your favorite video game. A 100% simulated VR environment requires graphic engines to build the virtual worlds. AR, on the other hand, is only partially simulated and combines generated images or graphics with the real world. In the future, you could walk past that café in Budapest wearing AR goggles and see a list of the daily specials overlaid on their front window. The real-world aspect of AR requires computer vision to see and recognize surroundings so the virtual world can be added. 

Figure 1: VR goggles create an all-immersive experience

Figure 1: VR goggles create an all-immersive experience

Mixed reality falls somewhere between AR and VR. Mixed reality might still be 100% simulated, but it can combine the positions of real world elements in the virtual world – recognizing your hands so your cartoon self can hold a wand in a wizarding game or recognizing your furniture and replacing it with cartoon furniture or perhaps furniture sized rocks from an alien landscape. Like AR, mixed reality will require computer vision techniques to locate elements in the real world. 

SLAM for Localization and Mapping

Today, robotics and headsets or goggles are the most common hardware devices requiring AR/VR/mixed reality. Significant research is being done to add AR to mobile phones, tablets, and automobiles as well. For hardware devices to see the world around them and add or augment that reality with inserted graphics or images, they should be able to determine their position in space and map the surrounding environment. In a controlled environment, markers – two dimensional symbols like QR codes – allow a camera (attached to goggles or embedded in a smart phone) to determine its position and rotation relative to a surface. However, applications like automotive – where you can’t insert markers along every stretch of road – must work in a markerless environment.

Simultaneous localization and mapping (SLAM) algorithms come from robotics research and provide a geometric position for the AR system. SLAM algorithms can build 3D maps of an environment while tracking the location and position of the camera in that environment. The algorithms estimate the position of the sensor (built into the camera, cellphone, goggles, etc.) while modeling the environment to create a map (Figure 2). Knowing the sensor’s position and pose combined with the generated 3D map of the environment lets the device (and the user looking through the device) move through the environment in reality. 

Figure 2: In markerless environments, SLAM algorithms build a 3D map of the surroundings by identifying points and edges of objects and performing plane extraction from the data

Figure 2: In markerless environments, SLAM algorithms build a 3D map of the surroundings by identifying points and edges of objects and performing plane extraction from the data

SLAM can be implemented in multiple ways. Visual SLAM is a camera-only version that doesn’t rely on fancy inertial measurement units (IMUs) or expensive laser sensors. Monocular visual SLAM – which has become very popular – relies on one camera like the one in a mobile phone. A typical implementation of monocular visual SLAM includes several key tasks:

  1. Feature extraction or the identification of distinct landmarks (like the lines forming the edge of a table). Feature extraction is often done with algorithms like ORB, SIFT, FAST, SURF, etc.
  2. Feature matching between frames to determine how the motion of the camera has changed.
  3. Camera motion estimation, including loop detection and loop closure (addressing the challenge of recognizing a previously visited location).

These tasks use many calculations and will have an impact on choosing the best hardware for an AR system.

Deep Learning for Perception

While SLAM provides the ability to determine a camera’s location in the environment and a 3D model of the environment, perceiving and recognizing items in that environment require deep learning algorithms like CNNs. CNNs, the current state-of-the-art for implementing deep neural networks for vision, complement SLAM algorithms in AR systems by enhancing the user’s AR experience or adding new capabilities to the AR system.

CNNs can be very accurate when performing object recognition tasks – which include localization (identifying the location of an object in an image) and classification (identifying the image class – i.e., dog vs cat, Labrador vs German Shepard) based on pre-training of the neural network’s coefficients. While SLAM can help a camera move through an environment without running into objects, CNN can identify that the object is a couch, refrigerator, or desk and highlight where it is in the field of view. Popular CNN graphs for real-time object detection – which include classification and localization – are YOLO v2, Faster R-CNN, and Single shot multibox detector (SSD).

CNN object detection graphs can be specialized to detect faces or hands. With CNN-based facial detection and recognition, AR systems can add a name and social media information above a person’s face in the AR environment. Using CNN to detect the user’s hands allow game developers to place a device or instrument needed in the game player’s virtual hand. Detecting a hand’s existence is easier than determining the hand positioning. Some CNN-based solutions require a depth camera output as well as R-G-B sensor output to train and execute a CNN graph.

CNNs can also be applied successfully to semantic segmentation. Unlike object detection, which only cares about the pixels in an image that could be an object of interest, semantic segmentation is concerned about every pixel. For example, in an automotive scene, a semantic segmentation CNN would label all the pixels of the sky, road, buildings, individual cars as a group, which is critical for self-driving car navigation. Applied to AR, semantic segmentation can find ceilings, walls, and the floor as well as furniture or other objects in the space. Semantic knowledge of a scene enables realistic interactions between the real and virtual objects. 

Hardware Implementations

Both SLAM and CNN algorithms require a significant amount of computations per camera-captured image (frame). Making a seamless environment for the AR user – to merge the real world with the virtual without significant latency – requires a video frame rate of 20-30 frames per second (fps). That means the AR system has about 33 to 40ms to capture, process, render and display results to the user. The faster it can complete those tasks, the faster the frame rate is and the more natural the AR feels.

Considering a mononuclear (single camera) SLAM system for a System-on-Chip (SoC), computational efficiency and memory optimization are both critical design concerns. If the camera captures a 4k image at 30 fps, that means 8,294,400 pixels a frame or 248,832,000 pixel a second need to be stored and processed. Most embedded vision systems store each frame in an external DDR and then – as efficiently as possible – transfer portions of that image for vision processing (Figure 3). 

Figure 3: Vision data is stored in off-chip memory and transferred to the processor over the AXI bus

Figure 3: Vision data is stored in off-chip memory and transferred to the processor over the AXI bus

Processing the algorithms necessary for advanced AR systems on a CPU – such as a mobile phone’s application process – is an inefficient approach. Offloading to a GPU, which is present in an AR system for drawing graphics, will speed up SLAM and CNN calculations compared to the CPU. However, while the performance advancements provided by GPUs helped usher in the era of AI and deep learning computing, implementing a deep learning algorithm on a GPU could require 100W of power or more. The most optimized approach is to allocate embedded vision processing to dedicated cores.

Performance and power efficiency can be achieved by pairing a flexible CNN engine with a vector DSP. The vector DSP is designed to handle applications like SLAM, while the dedicated CNN engine can support all common CNN operations (convolutions, pooling, elementwise) and will offer the smallest area and power consumption because it is custom-designed for these parameters.

For a SoC designer of an AR system, Synopsys’ EV6x Embedded Vision Processor IP provides an optimized solution to address performance/power concerns. The DesignWare® EV61, EV62, and EV64 Embedded Vision Processors integrate a high-performance 32-bit scalar core with a 512-bit vector DSP, and an optimized CNN engine fast for accurate object detection, classification, and scene segmentation. The vector DSPs are ideal for implementing the SLAM algorithm and run independently of the CNN engine. The EV6x family delivers up to 4.5 TeraMACs/sec of CNN performance when implemented in 16-nm processes under typical conditions and support multiple camera inputs with resolutions up to 4K. The processors are fully programmable and configurable and combine the flexibility of software solutions with the high performance and low power consumption of dedicated hardware.

Summary

Deep learning algorithms like CNN will make new and improved AR systems possible, opening up new experiences in gaming, education, autonomous vehicles, and more. Building complex AR systems with stringent performance, power, and area requirements can be simplified by using embedded vision processors, like the EV6x Embedded Vision Processor IP, as a companion to the host CPU. EV processors provide AR system developers with the ability to combine deep learning and evolving SLAM techniques.

 

For more information: