Sound Recognition – Dr. Chris Mitchell, Audio Analytic – Voice Tech Podcast ep.026

Chris Mitchell Audioanalytic

Episode description

Dr Chris Mitchell, CEO and Founder of Audio Analytic, who develop sound recognition software that adds context-based intelligence into consumer technology products. In our conversation, you’ll learn what the state of the art sound recognition systems can do, the types of sound events that are typically recognised, which consumer products they’re integrated into, and the many benefits and new possibilities the technology affords to developers and users.

We discover the difference between sound recognition and speech recognition, how sound recognition provides the all important context for voice enabled devices to make the right decisions, and how smart devices can take advantage of this contextual knowledge. Then we dive into some of the technical details of how it all works, including ‘better than real-time processing’, edge computing vs the cloud, the need to train custom acoustic models, and how these machine learning models can run on low-resource devices like headphones using TinyML. Chris briefly explains the process of integrating the AI3 framework into your products, then we tackle the all important question of data privacy and security.

Many of the smart devices of future will rely on sound recognition to understand the context of their environments. Chris and his team are at the cutting edge of the sound recognition field and are long-time experts in the domain, so there’s no better person to introduce us to this important technology.

Links from the show:

Episode transcript

Click to expand

Powered by Google Cloud Speech-to-Text

welcome to the voicetechpodcast.com Robinson in conversation with the world leading Voice technology experts discover the latest products tools and techniques and learn to build the voice app to the Future

hello and welcome back to another episode of the voicetech podcast by name is Carl Robinson and today’s episode is entitled sound recognition your Hemi talk with Dr Chris Mitchell CEO and founder of audio analytic audio analytic develop sound recognition software that adds context-based intelligence into consumer technology products in a conversation you’ll learn what their state-of-the-art sound recognition systems can do the types of Sound events that are typically recognised which consumer products there integrated into and the many benefits the new possibilities that the technology affords to developers and users we discover the difference between sound recognition and speech recognition we learn how sound recognition provides the all-important context for voice enabled devices to make the right decisions and how smart devices can take advantage of this context for knowledge then we dive into some of the technical details of how it all works including

better than real-time processing Edge computing versus the cloud the need to train custom acoustic models and how these machine learning models can run on low resource devices like headphones using tiny ml Chris then briefly explains the process of integrating the AI 3 framework into your products and we tackle the all-important question of data privacy and security over the course of the conversation it became clear that the utility of many of the smart devices of the future will rely on sound recognition to understand the context of their environments Chris and his team are at the cutting edge of the Sun recognition field a long time experts in the domain so really there’s no better person to introduce us to this important technology don’t forget there’s a great voice conference coming up soon the voice connected home 2019 is happening in Cologne in Germany on may 7th and may 8th if you’re up first voice event focused on brand companies and their voice Pro

that’s you got the Lights of EON tchibo Samsung Google the BBC and many more in attendance look out for my took that as well and if you want to meet up and possibly record some audio together then they’re just drop me an email the link to register for tickets is in the show notes and you can get 20% off with the code voicetech and I’ve got two free tickets to give away as well so just drop me an email account voicetechpodcast.com and I’ll pick 2 winners at random very shortly this episode is supported by dabble lab provide free video tutorials and code templates to help you learn to build your own voice apps at YouTube.com slash double lab will find over 150 video tutorials for both beginners and experts containing step-by-step instructions for how to build apps for Amazon Alexa Google assistant Microsoft Cortana jovo Bixby and more as well as using industry-leading tools that just William bespoken all the free templates are rats killed him plates.com and all the tutorial videos about YouTube.com slash

so if you’re a voice app developer or you’re thinking of becoming one you going to want to check out this amazing Resource and take your skills to the next level I want to say a huge thank you to dabble lab Manning Publishing and all the other companies and individuals who continue to sponsor the show it’s really thanks to you that I can continue to produce his episodes and the voicetech audience is very grateful for your contributions ok so now it’s my pleasure to bring you today’s guest Dr Chris Mitchell I’m very happy to be on the line with Dr Chris Mitchell CEO and founder of audio analytic at the pioneer of Sound recognition they’ve developed our patented sound recognition software framework called ai3 which we’re gonna hear all about this enables our consumer technology companies to embed contact space intelligence into their products Chris excellent happy on the show welcome

as some of the nurses know that I’ve I’ve had a little bit of experience in the field but still complete beginner and so very very interested in the hole sound recognition sound synthesis and transformation kind of field and I’m ready to get a little bit technical in it in this conversation so so so we had to work was your company do what kind of clients do you survey what kind of problems do you solve the Cambridge UK w have sales office sofa in California San Francisco we have a piece of technology that came about because we wanted to give machines in the broadest sense is a sense of hearing and said. For us meant sounds be on speaker music but those type of sounds very different distinct from a speech Sound of Music Sounds and needed very specialist techniques so that you

get the detection rates very high but the other part of what we do is we put this and run this wall on the devices themselves so on the edge of the networking into an area that’s becoming known as titanium out so incredibly compact hardware so we we run on anything from smaller than war headphone units O’Sullivan something like the the airpods or something like that in terms of form Factor writer to Smart home equipment itself animal get into some of the the customers we work with you to the application is sound recognition to this is distinct from speech recognition we talk about sounds I could you give me examples of the types of sounds that yeah your technology recognises

yeah sure so I’m breaks out into roughly for components this way the company does it do we break them down into how to wellbeing related sounds Elysee coughing sneezing crying safety and security related sounds to these things like glass window being broken if somebody breaking into a house or a smoke and CO2 alarm going off hands being an indicator of fireworks in there we also deal with communication which is around improving the way what he has picked up from environment based on a better understanding of fatback Band C in the sitting in and also on the entertainment site in terms of improving the way music is projected back into the environment based on a better understanding the present understood three of the four the communications when I wasn’t I didn’t come completely get that so could you explain that little bit more

clean my phone is trying to pick up sounds from the environment they could audio pickup systems ranging from complicated microphone system through to simplistic ones if you better understand the same scene in which data devices can operate you can pick up sounds better from it a supreme quality of pick up what you call likes little things about analysis of the audio environment in order to make our product improvements in existence is where a soccer lighting company so we licence our top bar software to some of them the biggest brands in the world who use of technology in their products they sell is the marketplace we typically focused on consumer products here at the moment so that will range from things like cameras

video doorbells smart speakers dino stat generator smart at 3 then through two headphones mobile phones and autonomous driving connected cars Sutton areas that we focus on interesting ok so what’s it what’s the deliverable is it SDK is there an API or do you actually send people into these companies and set things up are the things that configure the end Esprit engine to detect one or more things and then people work with us to understand how to design products so that when they can have they do now respond to the sounds that happen around them you can unlock entirely new experiences designing products that was

very new thing obviously you’re designing products that respond to human voices at is a relatively new thing especially on the the natural all natural language side so designing products for sand itself is even knew it that the Cutting Edge then so these these sound profiles these are things that you guys developed for specific use cases for specific clients to respond to there at the exact needs of some profile might be your the environment that a mobile phone is operating in for a specific manufacturer catalogue of songs which are people come to school then not so much bespoke to customer there more on something come to us we have a catalogue of the same profiles and they say I want and I’m going to make a steak and security products I want drop bar class brake and smoking SEO alarms are on this device so that his place and it can’t

action been driven by what type of device it is ok so there’s a couple of cases that use case of helping improve the product which I guess that the results go back to the engineers and they tweet the Mics and stuff and there’s also as it’s kind of reactive for event-driven kind of process where everything happens automatically is in the software at his and event and then something triggered in and actions taken within the software itself spam notification to another subsystem and that’s subsystems taken action we see ourselves as providing that subcontext understanding of almost sounds gabion speech for music and whatever your provided next to whether it be another machine or a person is it is the same from expected as the same

sing at the at the end of the day ok let’s hear a few of those sounds now then just said that the everyone’s got an idea of the kind of sounds and you any environment the quality of the sound that that you go to picking up and in in which they your goodnight you trying to edit the event basically you’re looking to the sound of it as that sound event within that are within that sound file so we’ve got a few so fast now we’re going to play them

so we just heard them was this a glass breaking a smoke alarm dog barking and baby crying and the audio you can hear the audio quality isn’t isn’t crystal Clear right so you guys have to deal with some kind of person quite challenging or the environment we have to deal with quite a few complex environment one of the very important things and challenges for sound does a face-to-face speak speak to you is obviously a relatively narrow band sang it’s got a whole bunch of constraints around it in in various ways by the language all the types of acoustic getting open the map and it’s also much more in the types of device

is Ron affected by the environments within so we actually have hearing Cambridge anechoic facilities the week the company owns some we do a lot of the recordings in the very clean neutral recordings for the data so they’re up to Worldwide levels we have a huge amounts of expertise in data augmentation and other techniques allowed to scale up so it sounds like it’s in a sample Tesco wooden house so I can brick House London Skyscraper in one of those romantic is one of the rooms are all the walls have to imagine all the walls and the ceiling in the floor are all made of these, polystyrene or whatever material can of triangles very kind of shallow triangles and then as the soundtouch of the

was it just disappears through multiple reflections into the the base of the wall and and nothing reflected back so you get this area fact where you’re in this this can I don’t know if this imperial like nomansland Salisbury SP4 want another word experience to be in one what you do find there is a secret interview for measuring impulse response everybody always loves them done because they’re EXTREMELY Loud and extremely quick so we have a range of from Blackfriars guns when we doing in hospital and understanding credibly different in that sort environment than they do anywhere else with this so I done especially so tied up with environment you and your measuring the sound of a gun in an environment which is completely different to any of them which will actually measure the sound in in the real world

you’re able to better detect the sound of a gun having recorded the original in there and Anna and Erica chamber how does that work sponsor yes it will sound different but obviously if we need to record guns in every significant variable that would exist all over the place or indeed babies or dogs or any of the other sounds like talked about clearly that level of scalability would be cost prohibitive but everybody on planes to fly over the world to Graham record Doulton everybody’s houses or record the essence of what is a baby crying and then you extrapolate that by adding that the noise from the other environments is it snowing

incredibly useful in scaling mapdata up with the only company who who attempt to do some recognition to this level of quality of data rebuilt and give me load it into a thing called Alexandra which is our custom audio library for this type of salmon fishing that is the largest library of this site for and some recognition challenges the only thing that comes close in size of one’s life sound effects libraries or YouTube data but that that tends to be biased c init data in terms of either quality or web Bridge real data or Not So Lonely Mountain fake car alarms on the internet extremely high some sound effects library made up what they think of Caroline sounds like an ostrich trying to somebody’s driveway from the video doorbell Lytham

new cars being broken into you want to be very sure it is the actual sound you’re looking for not what some synthesized version has decided it should sound like ok so it sounds like it’s not just secret source it’s not just a clever algorithms that you guys are developed over a prolonged period Bisset is the data set that really stands apart machine learning data quality is a finger give you the strongest foothold and everything about build on top of that now once you’ve got that solid Foundation that you can then start looking into the data and one of the fundamental differences between a what we do and say speech was clearly acoustic models are very different so I deliver in speech world a standard acoustic model might be based on MSC Caesar or something of that nature they’ve been shown to model the boy

boxer very very well and some vocal system very well clearly breaks a window into my house voice generator mum so that hole and it’s on speak clearly and very different sounds that hole acoustic models that’s been developed for the speech world over water 30 years worth researching B12 and refining Rosa acoustic models really capture the sounds that you want to for fabric conditioner on the other extreme of it which would be the thing that does quite a lot of the heavy little lifting in modern speech recognition systems or personal assistant Sony Alexa and OK Google would be the language models R&B the language models clearly constrain the the need to look at certain patterns coming out of the acoustic models or the intermediate step

courgettes language model were clearly isn’t really a tempting language in the same way towards language models don’t apply directly so the two real fundamental big building blocks that have had so much effort and research on the speech side just them translate and if you give take those away on your left with some stuff that would just generate false alarms too much and would be good enough for commercial application

ok listening listen up if you’ve been listening to the show for a while now and feel like it’s time to contribute just go to voicetechpodcast.com / donate where you can contribute $1 a month $4 a month or $8 a month to help people show running This podcast wouldn’t be possible without generous contributions from lesson and such as yourself if you’ve never use patreon before when I take this opportunity to give it a go but made it super simple and secure the donate to independent content creators like myself and there’s never any commitment so you can cancel whenever you like you like to see the options my patreon page is that voicetechpodcast.com / donate patreon tiers are also available for businesses so if you have a voicetech product that you like to promote on the podcast newsletter and blog then you can become a voicetech champion at overturned voicetechpodcast.com / donate for more details ok so now let’s get back to this amazing episode

ok so let’s talk about some of the case studies there audio analytic have been involved in some of the some of the products you guys about your technology into or licensed it to could you tell us a give us a couple of examples of what we can see and what’s out there in the market right now so obviously the huge free that you guys have over in over quite a lot of mainland Europe they have something called the free box dancer that was released in there but events in Paris just before Christmas that was the first instance of a set of a set top box / smart speaker having a french-speaking Alexa on

after party products in Longsight technologies are sound recognition capabilities they’ve already sold 100000 units of bats Reebok delta device since Christmas silly they seem to be going good speed and all sleeper free Delft has a fantastic reputation for being very innovative and highest and mobile phone tariff is like when you’re a model something for that the basic panel St ridiculous so a lot of people use them I wasn’t aware that the three books delta though I can see heard it’s got Alexa Netflix 4K TV speakers from DVLA for childlike produced at phantom I think it is that the amazing kind of what it what do they call the big things that fly from the Nautilus it looks like a no solicitor floats around in the sea but it is a premium SE

for your fancy panda whatever so they got all about tech built-in and audio analytic but what are they using the sound recognition for then it’s a setup box that’s it to people’s living rooms providing all this content and you know services but what do they need it doing analytic for artists Martin value propositions are accepting security values so they’ve got a range of sound on their dog bark speech to text you in if there’s somebody to house when there shouldn’t be a smoking.com section baby cry detection so as they got a range of different sounds itself the neighbour list of the, your true some part of the Smart home sweet things and also exposing those events to 2 what software then I mean if they can detect a dog bark then and the what are they going to do with that event like what what does that interface with

goes into the the apps to go with the UN device and then the the owner of she gets the alerts and sells so if it’s your house and I can see how long you going off then then you know you need to help deal with that or somebody breaking in and sandblasting you call the police and make sure about responding to us interested is a third party apps and that you would choose to install on your free box count as opposed to stuff it just comes with it as standard with the device further the last thing that comes with it I’m not using free box delta so I will see him in the UK with a don’t supply here so don’t happy birthday but sadly one of my guys explain to me

is there a stand that clearly they got plans further this so it’s already owe the speech recognition sorry this sound recognition technology and the as a while to get into a product as big as widespread as that in France they can gratulations did you have any other case studies you can share with us to give us more complete view of how this is being used UK examples that is bringing this happen to me although this product is available in the us as well there’s the the energy giant Centrica centricas a multi-billion dollar energy provider they own brand such as British Gas in the UK who supplies electricity and gas and British Gas they have a brand called hive hub and hive

high-top we’re on the Hive hub 360 this is a device that sits as part of their broad smart Home offerings so they of C2 thermostat dead cells in the UK better than the nest thermostat does it is very popular product itself and it’s a different smart Home devices like light bulbs plugs and this hub that goes with it and given that a British Gas as an energy provider so that this of the brand values are taken care of you and your loved ones in the house so it’s a sector security offering but they’ve also got Easter sun health and well-being offering so the smoke and CO2 alarm detection given that they ultimately pumping to a lot of people in the UK are poisonous gas and to be

important to make sure that everybody is is notified about that so that they’re providing the solution to a potential problem it doesn’t affect that many people want it does affect someone obviously those consequences quite serious and the others provide a lot of reassurance to know something is not 24/7 monitoring monitoring the environment your home in in a way that im too just not be monitored at the moment so this is something genuinely new novel and and does provide that you know table benefits so that that that’s great I know I did British castle so so forward for looking forward thinking in the intensive technology that’s great thing where to see that little little bit about contact them and understand that what one of the big advantages of being able to analyse the audio environment is so that you can understand the context of of where the devices where the user is what why is that so important and what what kind of the kind of benefits does that bring

most audio pickup systems audio production systems take great pains to pick up sounds nose environment in a very want another word respectful way I’m very clean as that you’ve done you to be here so before we set up interview you’ve taken great pains to make sure it’s a clean of environment you get a nice crisp reception got a decent microphones in all of our devices hear the real world has lots of sound in it already we wouldn’t want to pick up a lot of fans that we have no choice but to live with other than the the world itself and especially cities are incredibly diverse complicated sound environments to ignore that complexity in that context information is is to just not understand how to best work with environment so we feel that

knowing that you’re in a one type of acoustic seen or another type so whether it be a busy coffee shop or whether it be a train station or whether the commute in the morning knowing that information can help you better configure devices to pay more appropriately in those environments from an audio pickup perspective so that my train from choosing went to a control and turn on some form active noise cancellation which clearly is going to come out of battery cost so not something that you want to be doing all the time especially if there’s no need for it for dumping me the office I’m sitting in the moment they’re there is no noise having something with active noise cancellation on would just be completely wasting assets interesting subsystem control top-level system or finer level can only be driven by me

contextual understanding of that sounds scene in which one sitting and that’s something we feel same recognition as a whole as a discipline country yeah I can I can definitely see that im inspect especially when it comes to things like voice assistance if the device is aware of your already environment then it could choose to lower the volume by can’t use to and not to reveal certain information about you because it recognises your on the public environment for instance of your own you’re at work and there’s a you know an infinite number of ways that device could configure it’s behaviour I just wanna conversational level but you know through all of the rat games etc based on what it recognises about where you are and why makes it makes me think I think of Google and then and how they they trying capture as much information about your environment as possible you know where you are from Google Maps and will you know what you’re buying through Gmail etc and this is another level but it’s is Hyper localised it’s so if

real-time have to be in the right now it’s not gonna be the taste the same all the environment isn’t persist for long periods of time I see him so I have to be instantaneous reaction to her to what’s going on whether you don’t have to be a man it’s very easy to construct and I’m glad CD you haven’t which is a contrasting combining in your mind to the physical seems an acoustic scenes They Don’t Cost can be synonymous but they don’t have to be in a coffee shop can sound very different to 7 a.m. in the morning to the can at 12 so you’re just knowing physical you at the train station doesn’t tell you acoustically what does this sound like that you’re trying to drive acoustic Decisions of that and pink lily bass acoustic seem to text you want right now yeah absolutely ok ends and looking forward them for devices there’s an ongoing trend to make devices smart to connect me to the internet iot or this Connor

I’m still what does it mean for a device to be to be smart and and and how does the the sound recognition play into all of that and from from your from your perspective for me so if we if you compare it to Purcell assistance if you take that as a sort of a reference point then the great from a commander control point of view so they will let you say I want to do this as long day under stand that was saying that the Mostly do then they are trying to figure out how to do that if it if I can but not the way we interact with anybody else or most people in our lives are hopefully it’s about understanding the context in which you’re doing it so if you want to have a personal assistant be more caring understanding more aware of your needs an environment in which draw operating and what you’re trying to do

event context understanding important to give you an idea girl she still sometimes at night she still young enough to wake up middle night and wonder where she isn’t that some things no one of things that such a bad sleeping Satellite in her room that you can turn onto a very low warm glow very comforting for this great now if I’ve been there and she’s crying and I’m trying to speak to Alexa in their shoes in her room as a little. And she’s obviously this is early in the morning I’m quite tired she’s trying away and Alexa is repeatedly asking me what do I want without understanding there’s a crying baby there then. Very annoyed at very quickly understanding that that’s a stressful environment to be in understanding

unlikely to be wanting at off that feeding it in and controlling it and and enhancing its pick up up she gives it that contextual understanding and suddenly Alexa stopped being a I’m sorry I didn’t understand that new piece it three times and you’re getting annoyed is now she’s being more helpful than more caring and she starts to feel more like a personality in that’s good environment that she yeah it’s really as it’s giving that kind of the social intelligence that emotional intelligence to these devices so they don’t interrupt you two that they’re not so awkward and they’re not so intrusive it into your life and really giving them that that can a human level understanding of what’s going on around them so that they can they can then respond appropriately and so then when I see this is sister is it one of the many pieces in the puzzle that that go towards making these making these devices real and companions or colleagues or and you know useful assistance in her in a wide variety of the contacts iPhone natural exciting

voice chops Tuesday is a weekly newsletter to help you build better voice apps whether you’re looking for research chart depth charts or design charts there’s something in there for everyone just go to voicetechpodcast.com / newsletter and look forward to your Tuesdays so it’s move onto how it works then cos I’m really keen to get onto the technical side of things and could you like to subscribe to give us the main can a pipeline the main components involved in the sound processing and Sam process appointment you guys follow it is that one process that applies in a one-size-fits-all or whether their custom kind of pipelines for different situations and it what kind of what kind of components do you use in that so in terms of the acoustic model parts around we had to develop our own we could not find any of the Works

Netflix chat we will come up with the components acoustic model that we’ve come on to call idiophones that was idea phones give us them the best trade-off between compact representation of the acoustic spaces in which for operating the series accuracy of the overall system itself can we go little bit high level at then so this is to your you’re talking about training a machine learning model to recognise a certain sound event so it’s as soon as a classification algorithm using

yes I do our basic tasks is a classification multiclass classification can you try to get to sound in an unsafe to use a specific sound at the top level we taking a huge volume of data out of Alexandra we have custom that takes the start of Enterprise storage system rebuilt pushes it on through to the machine learning process breaks that down into these this idiot if you stick model which is the space in which we do with our patients with their head so when you say supposed to mean that the features you’re extracting the features from the audio and which are going to train on those that right hand side of the space in terms of what are the multi dimensional spaces over which you going to operate you going to make all

your decisions on itself so very very simple Crud example to illustrate protect me like a simplified example for Caroline it’s going to have two tones in it is going to alternate between those two tones so you could have one feature that was the low tone one featuring the high tone and then need something to say that it’s going to alternate between those two tones in certain patterns in certain times and then if you can catch your information you can say it is that very simplified version of a car alarms in the world is is not that wonderfully simple because otherwise it be easier trivalent and somewhat an interesting but also the variation that drives all of those difference between all the carolines we hear when writing about is everything from the acoustic coupling of the of the sound to the devices microphone

trutool the natural variation of the same so different cars different ways of putting the car alarms in the cars different environments in Whitchurch car alarms going off at examiner.com and you said so the feature extraction just how much is that is a manual work custom work on your end for a feature engineering and how much is it completed isn’t end-to-end solution we find cause of a mixture of both wee-wee computer clear a couple of hundred different features of the things we call Lydia phones then form a solid base on which to operate go forward so we found those things to be pretty stable when we have in Alexandria something like 200-300 specific sound events that you wanted to text and honestly I’m a very large number of Sound events that are just

part of the everyday background environment and we we generally find that as cover couple hundred is good enough to represent that larger corporate events that we’re trying to protect or detect leakages off ok so you have one single model that detects between 2 and 300 classification of London to take screen 2 and 300 sounds and it’s the same model that does that does all about does is it from same profiles the drive from ordering it recognises the the full range of signs of recognising than individual products we sell draw from it understand how that works there is a single model that you trying to cut

buy all 300 events that you guys with my identified as being important to your customers whether you at the moment whether it’s a multi-stage can a thing like if it’s so I don’t know if it’s the health it is a health kind of audiovent then it goes into the health model is there is there some kind of sorting at the beginning to like then pass the sound on to more and more specialised models was a just one model that does does everything I’m going to be able to chat to our excellent Nike Stefan Dennis real time is it must be that is the memory very quickly on device are we gonna increase the devices cost quite literally so this is all real-time or or or best real-time operation and it’s all devices

underline the fact that the training of these nerves this data requires all the horsepower but the inference that prediction when it actually has an event in the real world is much less resource intensive right so that was on the devices so it’s not just your parents but some your statement is Rocky right ok what have been the main technological drivers behind this kind of technology I’m interested in knowing what level of cutting-edge techniques vs tried and tested you no signal processing stuff has been around for a long time do you guys use and also availability of new components have enabled this service product come to market now and what what what are the key drivers

catching the audio data where we have high understanding of all things that speech world of had for quite some time so proper understanding and prior probability occurrences good spectacle levels proper understanding of the variation within the sound of what drives those variations so you can be certain if you’re selling a product mass market that it’s not gonna hit your house and not work for you baby just because we haven’t heard you were type of baby cry before for example so that level of certainty is on peak riven by uncertainty within the data and the variability in the data itself is coming from mobile phones I assume we got out to record a lot of it ourselves and then run it through those techniques I thought about ranging from The anechoic Chambers upwards

what really undandy variability within signs your work so what could you have done this 20 years ago for instance like this it just a case that you guys haven’t come around and and done it all the other other elements that are not not realise hp27de power you’ve got within a standard consumer devices that we price points on and thinking of something like a camera they might be selling for home security camera that you’re selling \u00a34.50 on box or something like that clearly 20 years ago that would be a very different device and the son of the iot explosion that’s been driven by the cheaper om series processes and people like comment and things like that and then into more modern

start some of the newer computer engines they got to do things connected with that like the recently announced army Liam platforms we done some work on and I’ll pop size on a website looking at how you get the most out of a relatively low-cost device so that’s process so you can get it into the hands of the consumer 40-50 box of a price point that can be going along side that then would be the out with them as itself and the fundamental research into how you do sound recognition and love that we’ve done in-house you could you can borrow so much the speech Worlds in the music world but very quickly you found yourself that you just did with a different type of machine learning in the different type of

because that’s what I’m interested in because you’re saying that the the speech world as it can only really take so far I need on a huge amount of custom work on top of that for the algorithms you’ve had to develop your own data sets so it wasn’t is not the case were you’ve just made this Sunday start of days are coming off mobile phones you have to go out and get the right kind of data to do it so it sounds it sounds as though from the from the research and the development side of things that you guys haven’t really rely too much on on what’s available at the state-of-the-art you could you could possibly have done this and it could have been done you know 10-20 years ago but it’s just that it’s the devices that it would have done it to some extent that it seems to think about how well it would have worked you at your performance rates would have been there are modern techniques we use with an assistant and then build on top of them to modify them for some recognition so clearly that disadvantage in generic popularity DP

networks and things like that have helped in terms of driving forward to some of these typologies and architectures that makes sense and then the the very super cutting-edge stuff around initiative going into tiny amount signs that a somewhat over room Google’s campuses a couple of weeks ago every Mountain View with the inaugural tiny mlr group out there with silicone benders right up to the software I wouldn’t developers looking at how do you optimise that entire staff to make sure that influence can happen accurately precisely in an economically on age type devices in I really close to absolute can you give us some of the key takeaways that you got from that like what are what are the key ingredients to making his models run on then I’m very good you know

Devizes obvious problems of silicon manufacturer or your instruction set my car back then a lot of the obvious stuff that applies generically to anyone network speed accumulate I multiply type operations that occur a lot of Croxteth sunken aircraft about with them pretty well accelerated your bed into the the interesting one of this is an extremely dynamic space and people are obsolete flooring the types of network topologies and configurations and constantly discovering new things silicone guys up to their lead times when you better silicone or new design for it might be multiple years we’ve got the silicone guys they’re saying please can the

guys I tell us what album is going to be using the next year and the year after that the other people going well I change last Wednesday and it might be completely different by next Thursday investors in a piece of silicone to lay down so if you got back tension going on which which is interesting in itself obviously be the population has should be made that actually beyond those basic calculation blocks small changes in the domain that you’re working in contract lighting to large changes in terms of the way you might choose to make the albums of the optically operate on en silicone so for example image recognition speech recognition sound recognition music recognition all quite dramatic changes in the main and they will translate into a dramatic differences and so that sort of

it is just a universal neural network type of Peace works to a level but not if you if you actually are going to be relying on me systems for real high performance high mass market applications and that’s where the main differences start to dominate the decisions that are thinking about producing custom domain specific hardware for these type of a common applications is that right domain has its own set of apologies and configurations at those are constantly in motion at the moment cos that field is so new so which one gives the hardware designers some incredibly complex problem to deal with it itself whether you’re building survey forms to do that or whether you’re building silicone to do that you’re trying to

audit for small low-power embedded devices it’s the same set of problems away from the developers things then if there was about 4 and I wanted to integrate audio analytic se03 framework into my product and what do I need to do what would I need to download what skills do I need how do I process the data so that I can get the best results and then that somebody would would help you through the process in terms of the actual development part of it it’s a c library that is provided relatively simple CV library to use at the very high level feeding audio in in in frames and then you’re saying please look out for these signs by loading in a range

sound profiles and then you’d get when it’s frame comes in once enough audio look back has been built up so you for adding another frames it start to produced events out for every year frame with audio you feed it in and within those ventless it would say I detected a baby crying while to take to do a dog barking if they stand to be present within the audio stream I assume you’re looking for raw raw audio no preprocessing done on its front from my previous episodes preferably we we we we always prefer if we can get it obviously if you are doing preprocessing it doesn’t necessarily preclude it we just need to have a chat or a look at the date sheets understand what pre-processing you’re doing of some sort of pre-processing might destroy the very characteristics of the sounds were looking for

at you take the obvious ones if you feel that have the frequencies through hospital to some sort then clear Eskimo damage the ability of the system to lookout post and generally find out the types of audio pathways configured consumer electronic devices are perfectly fine. We we very rarely these days running to a device where it can provide us TV audio input that we need and that’s been one of the great think advantages of personal assistants and other focus on sound so used to be signed in product development for consumer electronic devices be fairly down down a product managers list and it normally went something like as long as I can hear something then I’m happy and I’m your test would p.m. can they hear some

Sanders crept up the focus for consumers I want good quality sound I want to go to hear clearly I want intelligent sound eccetera focus that’s gone into making sure the audio subsystems are well designed to improve the overall industry I’m dramatically ok that’s good to hear that do you guys have any plans to implement or make available your your technology through some kind of web apis to the cold could be made remotely as opposed to just don’t on device

I know where I’ve been embedded company so we we like some stuff be that out of people under that they put into their devices with we feel the bat from privacy point of view is the best way to do some recognition and unlike speech and obvious Sultan wake Word type configuration so if you were doing it in the Cloud P2P streaming me what you up to the cloud prettymuch 24/7 and I know you but I don’t want to product in my house that has no audio being streamed black 24/7 no problem with Alexa on the wake word and I’m fine with that as long as the 27 plus internet bandwidth and under on the other side it would then be the cost of processing all of that audio in the ground receiving darling it’s so we doing just fine. That’s a natural

fit for the ways and recognition needs to be done if it’s been done properly as though you’re talking about the real the real time use case now I can completely understand that makes perfect sense what about the the asynchronous use cases example have there’s just not much call for that but if I had some see if I have time response wasn’t so important to me and I just wanted to know how are these events are present in this audio file I suppose I could do that again with with with a i3 but is that mean I do Solutions such as that exist and is it have you come across use cases that require that kind of consumer electronic show this year I don’t have you got an iPhone or an Android phone what to do the phone if you got on Android Google surveillance system instructions

capability called memories so all I can do is I can say hey show me pictures of Daisy on the summer holiday and he knows what my little girl looks like it was showing me pictures of Daisy I would like to be able to say show me pictures of Daisy laughing videos of Daisy laughing clearly a black girl especially on children extremely emotive is what you remember a lot of the time as clearly as satin based event Nottingham mid bass events and Siri doesn’t do that through the memory functionalities that sort of thing that could be done a post-event which is a scenario you describe obviously for us it’s in the audio still captured on a phone using that is apples the weather is done at the event or at the event it’s the same audio data it just happens to be captured

dancing Welsh dancer birthday permissible use case and you could use our libraries to do that you don’t put into the plant systems to run over all that data an amateur club and then those tags to be such high officials I’m surprised that Google doesn’t records at a portion of audio with every photo taken or added to their their photos that you know that seems like you’re as the you get the audio with it but they’re not indexing offer still allowing it in the case of whatever the equivalent of memories would be on Google on Android on pixel phone itself and will certainly don’t answer it takes me ages to find pictures of that video of Daisy laughing summer summer holiday so I would like that a real world problem yeah I can tell you that you deserve it and it’s not there it’s a nice subway onto the subjective data

insecurity and obviously nobody wants something listening all the time and this that that your software relies on it is, but it says sound eventbase weight word isn’t it is business processing the sound that it’s hearing all the time and it’s not sending it anyway and that the audio is just been disposed of I assume immediately after its been processes tsumino event has been there has been detected so you are what are the details around like how old are you and let it addresses the concerns of users and what are the data storage requirements and how do you how do you earn a year to the editor gdpr and audit regulations that are coming in our special place in people’s hearts and needs to be treated with respect and appropriate measures as you point out the operating on the edge of the network so

no audio need to be transmitted for means of transportation so obviously that that simplifies quite a lot of the privacy of Conundrum while trying to in machine learning trade-off yes users want privacy but they also want well-functioning products and have wall planting project you need data so amascut the equation of trying to be squared itself we Square that I do a lot of the heavy lifting of collecting that data ourselves largely because you need to enough we had a very large number of devices out there looking for glass break and were getting the audio recordings back somehow we still wouldn’t get that many instances of glass break he has just not Grosvenor currency

quadtech in certain ways if you’ve Uganda properly now obviously we do have we have orange 8th man-made they do go out and place devices in people’s houses without permission and house that recording going on I don’t see that privacy is taken very seriously by the company people have been kind enough to understand The Acoustic sensor that deserve respect the talk about the future applications of Sound recognition then so could you give us your view on like what new developments related to see over over the next year or two what what do you think coming down that coming down the pipe so so exciting stuff and talk about stuff I’m going to have to not talk about to break a nice to myself for my tea

Denison Court. Going on Sam profiles with working on at the moment we just did a outdoor glass break itself which Ops II sandal glass break and we just starting out car alarm detection of the outdoor camera light action struggle with a market for indoor car alarm section become so numb brexit and and then we work through a range of us and often driven by what are partners require which it is why I find it hard to talk much be on will be publicly talk about because that would give you hints of what our customers and our partners going to put out

motability customer driven into doing a new thing as coming out for a customer products the 80s going to take that sounds first another people taking afterwards course but that’s all ways to get those new sound capabilities out in the marketplace is always someone to walk because of our business so icing model on the headphone side this is an area of significant interest to us so I’m making these very low power devices connection away and stuck with the ways of described in in this interview anymore I think we’ll see coming out tomorrow Place Sunday getting some exciting things done so I’m not that was quietly

she want to me because headphones are one of those few devices where the microphone goes everywhere with you sit on the level so really takes Sandra mission out in the bank with you so few other devices meet that criteria in fact about that end with regards to podcasts very relevant people saying my people don’t really use much because the podcast in our green I like I don’t either but if I was in my ear then yes I would ask my voice assistant to just play the latest podcast because it’s it might hear it so it’s all my head already so it just that just that one element that mobility just changes changes everything and I yeah I can imagine Focus and recognition of the Phenomenal because it’s you everywhere you go then so what what are the big unanswered questions in the field right now the hot research areas and what’s what’s likely to come that’s that’s really gonna change things for on a product level over the next few years

for me and I’ll take that one first one close to where I think I think is how many sounds is enough be able to get contextual understanding how many sounds do we need to do that so that’s interesting challenging question to us we don’t know the right answer yet and design implications such as his round it so that’s quite a complicated one of these driven by in what environment do you want to provide Conte tions etc etc general stores on want to give something the same level of contextual understanding as a reasonably young adult I think so good at some space to ame4ican then you’re into specialisms by itself so cleaner

many people specialising in different contextual especially from acoustic bonnyview musicians be obvious example really understand a subspace of the spectrum that’s an ongoing question for us and I don’t ever forget announce that? I would like to think we would in terms of staff within the field as number signs go up clearly the complexity of the problem goes up and hence resource uses on the devices trends are constantly then pushing down on that in the opposite direction in terms of so if you want to provide context that’s quite broad for an application you not going to have to choose I’m too many subsections of that context otherwise it here instead of the

start to lose their sense of hearing which is it is what rate for final question then what’s on the horizon 4 audio analytic what can we look forward to and what you what will you be focusing her energies on over the next 6 to 12 months a big question credit predicting a couple of months out in 12 months out of your crystal ball Saga level isn’t it for me I’m enjoying immensely trying to explore this new area what it means they will be constantly challenging ourselves to understand what is the limit to sign recognition wedstock where does it meet the other films in what way to speech recognition take over and where does an repetition stop include those grey areas over which they overlap itself

those are some of the questions I hope to be tackling and then I put my flat on which is getting more marbles product types of customers helping them experience what’s an recognition is because the ultimate adjudicator what people want sound recognition do will be people exploring Sam ignition for the first time on second time of third time on on multiple different products that I’ve learnt that truck predict that type of information is just very very challenging and was going to ask her she was going my next question is where are we likely to see some recognition in the next year that we don’t currently see at the moment that do you have any any ideas any clues that you can go out I think you’ll start to see some of it coming up in the automotive space

and you’ll see more of it in the Smart home space itself and those of the other, a little bit later but those everywhere you are at home in the car just walking about we got to wait for a reason we will use them up pretty much hopefully every single day and it’s amazing if you start to view the world through Sirens how much will I offer map that peripheral sensor and and then if you look back at products meeting why do you not understand that same sense once you get yourself to that position it’s just watching it needs to have because of what device is just going to annoy me it’s not going to behave in the way I wanted to and it’s going to be infuriating to you

a lot of delightful views and that ultimately it for making consumer products what we’re trying to achieve absolutely fantastic thank you very much for coming and best of luck

ok so you just heard from Dr Chris Mitchell the CEO and founder of audio analytic that’s all for today hope you enjoyed listening as always you can find the show notes at voicetechpodcast.com I’ve enjoyed this episode as many ways to support me and you can tell one friend or colleague about this episode you can leave a quick review on iTunes voicetechpodcast.com / iTunes I’ll be very much appreciated if you like to write for the blog are you can submit your original content voicetechpodcast.com / publish and of course you can become a sponsor I just like the many sponsors that I meant to the beginning of the show voicetechpodcast.com / donate and you’re confusing really does make the show possible I’ll be back soon with another episodes and I look forward to seeing you at the conferences if you choose to attend our but until then I’ll be in your house to come on man thank you for listening to the

Subscribe to get future episodes:

Join the discussion:

Support the Voice Tech Podcast:

Share this article

What do you think?

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Posts

Chris Mitchell Audioanalytic
An Overview Of Rnn Lstm Gru
Fixing A Few Common Errors When Publishing Your Google Assistant App

Get notified about new articles