NEW DELHI :
Atul Rai, the co-founder and chief executive of Gurugram-based Staqu Technologies, is eyeing the tender for a Lucknow smart city project for audio and video surveillance to improve security.
Rai already has a product called Jarvis that is used by Uttar Pradesh Police and other state police forces, featuring closed circuit cameras (CCTVs) and artificial intelligence (AI)-based facial recognition.
In its new edition, Jarvis doesn’t just use cameras to watch crimes happen, it also employs microphones to listen to what’s going on in the city. “We have used audio analytics to detect incidents such as prison fights in Uttar Pradesh. Our target is to implement it in smart cities,” said Rai. The audio analytics tool is also being used by organizations in retail and manufacturing to detect distress sounds and accidents.
Staqu is one of the few companies in India that offer AI-based audio analytics tools. These systems can identify sounds like gunshots, a person’s scream or specific words that indicate distress. They use ‘convolutional neural networks’ (CNNs) to identify sound types. CNNs are typically used for image and video recognition, but here, they’re being used to discern patterns in sounds. Potentially, an audio surveillance system should be able to alert the nearest hospital if an accident occurs, or contact the police if a group of people are planning a crime. “Every camera is capable of sending audio data using a mic. If a crime is being committed out of the field of view of this camera, audio can help in identifying if someone is in distress and needs help,” explained Rai.
According to Rai, there are many ways to use audio analysis for security. One is to identify a scene using audio, such as fight, violence or screaming. Another is to identify a person from their voice if they are not facing the camera. It can help in identifying people with prior criminal records through their voice even when they are out of prison.
Rai said the Lucknow Smart City project has expressed interest in an audio and video solution and demos will be conducted soon. Jarvis is ‘language-independent’ and looks for specific sound symbols that can indicate distress or an accident, said Rai.
According to Rai, Jarvis’ accuracy has been tested against VoxCeleb—one of the largest audio visual datasets for human speech. He claimed the system is 98.7% accurate. The company is also working on a new natural language processing (NLP)- based feature that will allow users to ask Jarvis for information, prompting Jarvis to scan data across all the cameras.
The use of audio symbols or voices for law enforcement has been gaining traction globally. In Europe, Interpol built a speaker identification solution to identify criminals from voice samples back in 2018, while police forces in the US have reportedly been building databases of criminals’ voice samples.
That said, solutions such as these come with significant privacy concerns. Pam Dixon, founder and executive director of the World Privacy Forum, a public interest research group, cautions that “much will depend on how the system is set up, implemented, and used.” Dixon points out that even assuming that these systems are without technical bias and are accurate, there will be questions on where recordings are stored and for how long. “These kinds of monitoring systems need to be transparent and should clearly say what words and sounds are being listened for. The policies for these systems need to be in place before they are built and used,” she insists.
N.S. Nappinai, Supreme Court advocate, agrees, “India doesn’t have a regulatory framework for CCTV cameras that are already in place in multiple countries. The same rule applies for audio, so stakeholders are aware of what is permissible and what is not.”
Never miss a story! Stay connected and informed with Mint.
our App Now!!