Contextual scene analysis sounds a bit complicated, but it represents an important revolution that’s shaping the world of video analytics. But, let’s first set the stage. 

Let’s say that you’re responsible for video surveillance at a big box home improvement store. You have a lot on your plate. You need to bolster your loss prevention capabilities. You need to keep known bad actors out of your store especially given the recent spike in organized retail crime. And you want to minimize legal risk by mitigating against bogus slip and fall claims (when someone deliberately stages an accident in order to file a false slip & fall insurance claim).    

The challenge with most video surveillance systems is that requires operators to physically watch all the video footage and this usually happens after an event occurs (someone steals something, an act of violence occurs, or there is an insurance claim filed).   

But, what if technology — namely video analytics — could make sense of a scene from video footage in real-time and alert security proactively.   

This is being done in some isolated use cases. For example, real-time watch list alerting can identify known shoplifters when they enter the store using your existing security cameras. There’s also people intelligence tools that leverage video analytics to track in-store foot traffic in real-time and help keep customer queues short. But, each of these different types of video analytics is what I would call single-threaded AI use cases.  You create AI to recognize queues or to recognize specific faces.   

Increasingly, retail stores — as well as other commercial establishments such as stadiums, corporate buildings, airports and even casinos — will want to move from single threaded AI to multi-threaded AI where real-time video analysis is able to make sense of a scene (raw video footage) and determine if there are any anomalies. This type of multi-threaded AI is what we call contextual scene analysis and it will enable precision alerts based on a more holistic understanding of live video surveillance.  

For example, when someone falls down in a store, this could mean any number of things: Were they actually bending over to tie their shoes? Were they pushed by another person? Was it a staged "slip & fall" to generate a fraudulent claim? Is the person having a legitimate medical emergency? How an organization responds in real-time is based on the timeliness and quality of the alert. 

Unfortunately, this technology does not exist today but it’s being developed as we speak and fueled by a number of converging technologies, including neural networks, edge computing and semantic segmentation.   

  • Neural Networks: Machine learning and, in particular, the spectacular development of deep learning approaches, has revolutionized video analytics. The use of deep neural networks (DNNs) has made it possible to train video analysis systems that mimic human behavior, resulting in a paradigm shift. In fact, the neural networks can be trained to quickly and accurately recognize individuals on a watchlist even when they are not looking directly at the camera, when they are wearing a mask or glasses or when the camera is situated high on a wall or ceiling (creating extreme angles).
      
  • Edge Computing: Edge computing is enabling leading video analytics companies to embed their algorithms (neural networks) onto near-edge devices, such as a NVIDIA Jetson Xavier NX system on module, or directly on the chip of a smart camera. In order to pack more AI functionality — more multi-threaded AI capabilities — video analytics vendors will need to harness the power of the edge and be able to do it in a power-efficient way in order to realize the vision of contextual scene analysis.  
  • Semantic Segmentation: Leading Vision AI companies are starting to exploit the power of semantic segmentation which classifies every pixel in a video image from a predefined set of classes in real-time. In a retail context, the pixels belonging to the shelves are classified in the class “shelves”, the pixels corresponding to the aisles are labeled as “aisle” — and this all happens in real-time.  If there’s a puddle in aisle 7, semantic segmentation would label it as “puddle,” ideally before a customer slips and falls.  

By moving from single-threaded AI to multi-threaded AI, we can start to get some powerful real-time contextual analytics and achieve a more holistic understanding of a scene without a video operator having to watch screens around the clock.   

According to Professor Marios Savvides, Director of Carnegie Mellon University’s CyLab Biometrics Center and Oosto Chief AI Scientist. “We can also start to capture more metadata to gain greater context about what’s occurring in real-time and ensure that the right types of alerts are sent to the right personnel. Instead of relying on surveillance professionals to monitor video footage 24×7, security teams can use their skills to respond to very precise alerts. 

Until now, it has been difficult for video analytics to distinguish a security threat from normal activity in real-time. Contextual scene analysis changes this.  

Contextual scene analysis adds the ability of video analytics software to deeply understand the context of a scene — the location, demographics of the person in the frame, anomalous patterns of behavior, and the interactions between objects in the scene can all be used to evaluate security risks. This near human-level perception of video data makes it possible to automatically trigger appropriate action with granular alerts and to do so with accuracy.