Context is key
Products with voice assistants can now be found in millions of households, from smart speakers and doorbells to TVs, but there are also many applications for voice control outside of the home. Other consumer devices such as headphones and wearables now have built-in voice technology, the first cars with inbuilt digital assistants are taking to the roads and will soon be followed by autonomous vehicles. Talking to these devices is already very much part of our lives, but the truth is that they’re not really that smart. Without human instruction these devices remain dormant no matter what events are occurring around them.
Voice recognition technology has been with us for many years and, with advances in AI that have improved its accuracy, has now become a natural user interface. But while these devices can be remotely operated by voice communication or smartphone they cannot assess and respond to a situation on their own. Important sound events in the home, like a window breaking, or a smoke alarm sounding will all be ignored by the current crop of voice assistants, as well as emergency service sirens falling on the deaf ears of noise-cancelling headphones. These devices wait for the trigger word to awake and then rely on humans to instruct them. Just like the human brain, if the smart home is to be truly intelligent it must be able to recognise and respond to a whole range of events.
To be capable of reacting to an event without human interaction, these devices need improved contextual awareness. Of all the inputs, sound provides the most opportunities to gain additional key contextual cues which are difficult to obtain otherwise. Adding cameras is not practical in many cases and are unlikely to popular with consumers because of privacy concerns. In addition, unlike other stimuli, sounds do not require direct line of sight or physical contact. Augmenting voice-controlled devices with sound recognition is a natural route to deliver contextual intelligence in the home, and the technology can be applied to devices which already contain microphones.
Audio Analytic has already developed AI technology that can give devices a sense of hearing, accurately recognising sounds, beyond speech and music. Products featuring our technology are used in 147 countries around the world, including the Hive Hub 360 and Freebox Delta which are commercially available and offer a glimpse into the future of sound recognition (other customers cannot be named for reasons of confidentiality). Audio Analytic is on a mission to give all machines a sense of hearing that can improve performance, value and outcomes for end-users.
How do you teach devices to hear?
When we started developing our sound recognition technology, no tools or suitable dataset existed that could enable a machine learning system to be trained at the levels of sensitivity and accuracy necessary to deliver a great consumer experience. You can’t use YouTube data because it lacks the quality and diversity required for accuracy, plus this audio data is protected by copyright, so even if you trained a poor quality DNN you wouldn’t be able to launch it commercially.
As a result, a new audio-specific dataset had to be built from scratch along with the machine learning (ML) tools and a specific deep neural network necessary to extract, model and analyse the ideophonic features of sounds.
To build our training dataset, we collected a diverse range of target and relevant non-target sounds both in real-world environments, as well as in our semi-anechoic sound lab. This meant smashing thousands of windows, setting off every conceivable smoke and CO alarm, encouraging dogs to bark and waiting for babies to cry. This audio data and associated meta data are organised in Alexandria™, which is the world’s largest, commercially-exploitable audio dataset for consumer products. Relevant, high-quality audio data continues to be collected, to enabling us to continually enrich our unique dataset and expand the range of sounds held within our rapidly growing proprietary taxonomy.
How are the sounds analysed?
For machine learning to be able to process the sound data systematically, we had to describe and organise it. A key difference between sound recognition and speech recognition is that speech is limited by the type of sounds that the human mouth can produce, and follows set structures that make it possible to predict.
Similarly, music mostly results from physical resonance, and is conditioned by the rules of various genres, so has boundaries within which it is readily analysable. These characteristics of music and speech enabled music and speech recognition technologies to be developed long before sound recognition could be mastered.
Sound is different from music and speech. Sound is much more diverse, unbounded and unstructured than speech and musical composition. Think about a window being smashed, and all the different ways glass shards can hit the floor randomly, without any particular intent or style. Then consider the effect that the acoustics of a room would have on this sound, and whether the glass fell onto wood, carpet or stone. Understanding the full extent of sound variability is a pre-requisite in order to map its characteristics, so that a machine is able to process it next time the sound is encountered.
To overcome this challenge, we developed a specific terminology to describe sounds – a “language” of ideophones. “Ideophone” is in the Oxford English Dictionary already but for a slightly different meaning – for artistic impressions. We define it as the building blocks that make up all definitions of sound as opposed to speech and the term is essential to our work.
From sounds to Sound Recognition
AuditoryNET™, the DNN we created, models hundreds of ideophonic features from each sound, which can be combined in such a way to recognise hundreds of ideophones.
Once the ideophonic characteristics of a particular target sound have been modelled we produce a “sound profile” which uniquely identifies each sound. These individual profiles are embeddable into the software platform that our customers license, ai3™, which can be integrated into virtually any consumer device equipped with a microphone. Indeed, ai3™ is so light in its demands on a system that it can readily be incorporated in products from true wireless earphones, where battery life is a key consideration, to autonomous vehicles. That device would then be able to recognise that sound when it occurred and initiate an appropriate programmed action. ai3™ is also designed to run on the edge, meaning that all extraction, analysis and identification is done on the device and not transmitted to the cloud, reducing latency and satisfying consumer privacy concerns.
The benefits of Sound Recognition
By giving consumer electronics the ability to hear, sound recognition technology can bring many benefits to consumers. Research we conducted in 2018 found that 90 per cent of smart speaker owners want their smart speaker to help protect their home while they are out, listening for the sounds that may indicate an intruder or triggered alarm and then taking action, or alerting them. This re-emphasises that consumers want the ‘connected home’ to become the ‘intelligent home’ so that various products, like smart speakers and cameras can actively help keep their homes, families, pets and possessions, safe and secure.
Sound recognition can also play a key role in assisted living and wellbeing, delivering significant benefits to carers and patients. For example, a device with sound recognition could detect repeated coughing or sneezing and provide early indicators of poor health. This could then alert the carer or patient to take action such as change pollen filters, turn on the conditioning humidifier, book a medical appointment and trigger the provision of self-care advice.
Applications for sound recognition extend far beyond the smart home, hearables, mobile phones and automotive can all benefit from gaining a greater sense of hearing. For example, whether driver-assisted or fully autonomous, a car that can recognise sounds, both inside and outside of a vehicle can, for example, reduce the cognitive load on drivers and improve safety by responding to emergency vehicle sirens.
True wireless headphones that can alert users to hazards or dynamically adjust noise cancelling and equalisation settings based on the acoustic scene around the wearer offer huge value.
Speech is a great user interface and, for many consumers, their first foray into the AI-world. However, the wider world of sounds offer products the ability to be truly intelligent and even more helpful.