Machine Learning Signals – Christopher Oates, audEERING – Voice Tech Podcast ep.020

Christopher Oates Audeering

Episode description

Christopher Oates is a Senior Audio DSP Engineer at audEERING, an audio analysis company that specialises in emotional artificial intelligence. Chris explains how the human voice production system works, and introduces us to a technique called linear predictive coding (LPC) which can extract the features of the voice.

We then focus on machine learning for audio, including using expert audio knowledge along with machine learning methods, leveraging the openSMILE toolkit for feature extraction, and signal processing techniques. Chris explains things really well and even brought along some audio clips to help illustrate the signal concepts.

He then reveals some of the latest projects at audEERING, including using speech analytics in gaming applications, such as whisper detection in a ninja game! It’s an awesome episode that is jam-packed with useful and interesting information. Enjoy!

Links from the show:

Episode transcript

Click to expand

Powered by Google Cloud Speech-to-Text

welcome to the voice tech podcast join me Robinson in conversation with the world leading Voice technology experts discover the latest products tools and techniques and learn to build the voice up to the Future though it’s very a sound of a robotic because it is it’s not a natural signal you can definitely hear that are characteristic

hello welcome back this is episode 20 I want to say a big thank you to you whether you’re a longtime listener or if you just discovered the podcast and decided to give it a shot at a real Privilege for me to be able to bring you these interviews and I’m very grateful to you for help me get this far today’s episode is entitled machine learning signals and which will hear my conversation with Christopher Oates the senior audio DSP engineer at audeering now some of you already had my interview with Florian Tyburn as the CTO of audeering back in episode 14 where we got a good overview of the company’s products do not going to cover that again in this episode but briefly for those of you haven’t heard that hope that episode audeering are an audio analysis company based just outside of Munich in Germany are they specialise in emotional artificial intelligence and their products are able to automatically analyse acoustic scenes and the emotional states of a human speaker which is also known as voice emotion Analytics

they do a lot more than emotion are there in fact they are there craters of oppa smile which is the open source Toolkit for real-time audio feature extraction and classification now it’s the most widely used tool for audio analysis both in industry and academia and today with Chris so you’ll hear me discuss some of the finer points of the tool kit as just to help you understand how it all works but in our conversation you’ll learn all about the human voice production system and a technique called linear predictive coding LPC which allows us to extract the features of the voice Adam we focus on machine learning using expert audio knowledge along with machine learning methods we discuss the open Smart tool kit for feature extraction and are we Delve Into Summer and signal processing methods and the fundamentals you to understand things in a little more. Let me just got some of the cool applications that you can build are using this technology at all during I’ve just started venturing into gaming applications and using things like

with the detection a Christmas things really well even brought along some audio clips to help illustrate the other signal contacts and so even if you’ve got no experience in this area I think it’s a fantastic introduction to the subject so I got some exciting updates to share with you first of all about voice chat Tuesday the weekly year was technology newsletter three issues have been sent out now continue to have a fantastic response really good open rate with a good click right so I think you guys are really liking what I’m sending out so far definitely no to sing that the stuff focused on building building voice apps that the developer tools and the actual can of how to and tactical staff is getting getting more of a response than there may be some of the the history of the historical or general interests are kind of articles I said I’m going to focus more on that next one goes out on Tuesday obviously you’re definitely of the sign up for it if you haven’t already and because now you can also listen to the newsletter thanks to binge with

I’ve contacted down by an Alexa the co-founder or the founder I think the price is she’s the founder and it’s a text to speech plug-in for WordPress and so all I have to do on my end is copy my news that the content into my blog it generate which is a button that appears because of the plug-in and in a few seconds literally I’ve got an audio version that I attached to the newsletter I’ve been talking with Alex and she’s been making lots of improvements to the audio player and then and getting feedback from the community it’s been exciting to be part of the journey and are excited to say that now I’ve got to be in two ways than generating audio for all my articles on on the blog as well as the newsletter that goes out so if you want to sign up for. Tuesday just go to / newsletter I’m also inviting guest authors to publish content in addition to the voice that we now have a medium publication so if your voice tech write it on medium I have to do is send us your article link and then we’ll add you as an author and we can get you get your stuff published the weather your on medium or not

if you got some content you’d like to share with the voice tech audience I just had to / publish where you can read the editorial guidelines and submit your articles publication to the audience are in fact like I’m saying all the articles on The Voice tech blog an hour converted to audio with beans with if you publish article with us then will also generate an audio version of you too I’m very proud to say that this episode is sponsored by dabble lab our first champions sponsor is perfectly aligned with the Peng boy developers build better voice apps I really couldn’t ask for more from my first sponsor and I’m thrilled to be able to induce you to them today as the what’s that will have and what there a company that helps developers businesses in agencies build custom solutions for Amazon Alexa Google assistant twilio autopilot and other emerging digital system platforms that YouTube channel contains over 153 tutorial videos aimed squarely at voice developers and is contained step-by-step instructions on building for multiple plug

he’s a very high quality videos Stephen his team have gone the extra mile to ensure that they have very up-to-date and have even gone as far as putting a date on every videos you can see how recent they are and we’re talking short concise Apple videos between 5 and 15 minutes each you can get in get what you need and get codeine quickly so if that sounds interesting I go to slash table lamp as d a b l e lab they don’t just offer the videos they also offer code templates for building Alexa skills so you can get one up and running in no time at all the template code is available for free at their sister site skill templates. Come and all the tutorials on the code samples are absolutely free you should also know that double up as an experienced development shop as exclusively focused on conversation applications the video content is produced by their team of in-house experts but also by other industry experts in the field such as Yankovic from Jovi is not contributing a weekly video called dojo the Wednesdays so it is sounds like some

say that you find useful go check out slash double lab and let me know what you think you can also a sponsor the show for as little as $1 a month or you can become a voice tech champion white double handed and promote your business through our website weekly newsletter and podcast episodes more details can be found at / donate ok so without further ado it’s my pleasure to bring you today’s guest Christopher Oates ok I’m here in person with Christopher Oates senior audio DSP engineer at audeering Christopher hello and thanks very nice here in central Paris in your lovely apartment and you’re in Paris holidays and I believe he was here for a little business trip with them myself in the girlfriend a good enjoy the sights in the sounds of Paris excellent I’m really pleased that you could at you can make it on we can meet in person as listens will probably know already done

1 interview with the CEO of audeering Florian Ivan couple of episodes ago that was episode 14 so if you guys want to find out really what order is all about and they’re all about the office my library that they do at the some of you already heard about then I encourage you to go back and listen to that episode but those who haven’t listened to that episode can you just give us a quick intro winter what audeering is an inner tie a bit about your background and how are you how you ended up working out with earring is a speech Analytics company also Audio company we primarily focus on speech and we can buy an artificial intelligence with audio DSP vs peach vs be primarily focused and not what someone has said like would you get with Alexa and Cortana and Siri but how something was said because a lot of meaning is translated in the tone of the voice of comes across in the tone of voice in the articulation right though it’s not all about

their text what you say but it’s really about how you say it right so the title of the last Podcast episode was a voice and motion Analytics is that’s not all that audeering does right it’s not just about the the emotion it is a more general general yeah it’s all about how it says and and all of the all of the details in this is the voice signal that aren’t the word basically exactly ok great tell us a bit about your background and tell them from Ireland originally I did a bachelor’s degree in musicology when I started I was a drummer and I thought this is gonna be great I’m going to play drums in the studio for my day is the only fun and at some point on the way I discovered signal processing and in a scientific methods and thought it was much more interesting and achieve much more than I did in my drumming did I found a job in Derby ok so Dhabi is what we see at the cinema Derby Derby Derby as well for the home and that also got me into the

god of compression audio compression ok yes it is Dolby in Germany but there’s also one of their competitors which is fraunhofer and they are a government-run Institute and they’re their inventors of MP3 MP4 yes he works for a bit and then how did you find a how do you find audeering I doing position which are the focus on voice Analytics and this is something I had knowledge about but haven’t done a deep dive into it and this was way more exciting it’s a way to a friend that’s coming out there for sure there’s a lot of surround sound stuff and core decompression a lot been done it’s been around for 20 years ago I was a very fresh very young industry there’s a lot of research that happened in the bin good results but bringing stuff to market hasn’t really come about yet so in this it’s a good way to write if you like yeah I’m duly are running away with that we all I am the one Way Or Another for sure so I doing

one another award be sunny this one’s a little bit unique than the other ones it’s Bavarian innovation award for the state of Bavaria in Germany ok as a very very general 170a companies applied it’s only every 2 years as well we are here and we were in the top three mates quite quite an achievement I think congratulations yeah I will do and going from strength to strength really fantastic yes I would like to thank you card for doing this I think this is a really great service you’re providing the to the community and I found your Park House a few months ago and I said listening cos I was hungry too kind of learn more about this the world of audio texts and now I working audio texts try and also a proud patreon and I would say to people that picture and so I would say to people listening there if you watched a listen to three episodes in your planning to listen to a voice you should strongly consider becoming a patron I really appreciate that thanks thanks so much and I am very honoured to have you as a patron

then and now the guests on the show that really appreciate it ok to let me go onto the topic today and we’re going to talk about a digital signal processing as it relates to machine learning the really it’s so it’s going to take keys for people looking use machine learning AI for audio in one way or another but when I understand the concepts behind digital signal processing a little bit better that they can they can manage the data in and use it turn off the mini application many possible applications then that exist today we’re going to start from very fundamental places the voice tech podcast I want to start by discussing the voice and how voice sounds are produced and then this will tying into smarter machine learning applications which you can you narrow in on a particular problem that you’re trying to sell them and also a lot of data adding as well to get the result that you need and I don’t like a robust model give me a bit more technical than some of the other said be quiet

dress you got some experience teaching and writing tutorials and he’s kind of things that you’ll be alright that end in your company as well you said that you were used to explaining these complex audio Concepts to program AI machine learning type III developers into machine learning but don’t know anything yet about audio accident ok go back to having with you’re in the right place the voice production system which is euro Euro you’re lying is your voice box in addition yeah yeah yeah your mouth you out what to call your articulated see your lips your teeth even your nose these all add a second characteristic to the sound of your voice in that makes you sound like you essentially and things out with the origin which is the voice box and an inside the voice box there I was called vocal cords now like a little too little elastic bands which you have two muscles which and stretching for you

and USMC blow air through them and they have it example as I can I found in what’s happening with the ar sound you’re stretching your vocal cords you’re pushing air through and the vocal cords vibrating at a certain frequency does frequency is known as you’re at the fundamental pitch of fundamental frequency of your voice of one’s a little bit different and they will hire some people have lower ok so it’s a bit like her like a woodwind instrument where you go data relating to me like you’re blowing air pass something like vibrate and the vibration the vibration of that that thing transmits a signal into the air which is then carried yes ok some more specifically for the voice the vocal cords open and close to you release essentially I have severe or little pulses pulses or get onto later OK Google severe I sent you little pieces of Sound if you like and then the time in between the pulses determines the pitch ok so tired closer together posters will result in a higher pitch more spaced out

are pulses result in a lower picture house of these of these poses are happening very very fast and then I mean what’s the distance in time between each of his passes reducing fractions of the names of something with vibrate 100 times a second that’s 100 pulses I say ok yeah of course so that yeah I heard of the human vocal range I think it’s real life 50 to 550 something like that the overall but for a man and a woman is different but it does exactly ok it’s a women have a higher register epidural cause of vibrating faster than male vocal cords yeah ok mate is that alright so yeah it’s the one of the main components of the the whole of voice production system than the other two components you have the word god the source which is your vocal cords when you have the articulators of the filter

which is your up your mouth lips teeth tongue tongue very important I doing this thing make a distinction between different vowels the tell you make different vowel sounds for example I have a bunch of things that the vocal cords connect to do which result in different sounds ready mentioned that if they are stretched you make an sound ok but if you relax them when you speak like it when you whisper so your vocal cords are not engaged in this case so you’re just pushing air Through a Tube essentially you and I chords the whisper song which is why it sounds noisy and there’s no is no pitch to it be a lesson because nothing’s vibrating essentially as an important distinction right then it says refer to in some cases voice to voice Brackley is the voice is where there’s the voice box is engaged and it’s adding that vibration and the other voices is not it’s just using their the hits of the air

because there’s other sounds which are also come under invoice account which I hisses lo-fi will usually called plosives us syllables in the sounds of sugar are examples and so what you doing that you’re not engaging your vocal cords but you’re pushing it through the space between your teeth and your lips example you guys gonna try that are you doing is putting it through the night. To sound you can make different sounds by what you do with your lips on top of that so I can go I can’t push my lips forward I can pull my left back a very different towns Yasin it is this kind of stuff comes up in Action detection a dialect detection where people have different ways of pronouncing the same words in a busy doing different things with their mouth and you if you can understand what they’re doing with their mouth and you can then look for this in a speech signal for example and use these as Trigger’s for an accident detection and nice easy

is the Indian accents if you have an Indian and Native Indian speaking English as a Second Language they do certain things that we don’t do as as they speak for a deer sound is not feeling so when I say tea I’m using my tongue push to the front of my mouth and they push the tongue to the roof of their mouth and feel like a pop sound so good I watch a video about this really interesting on the word don’t they don’t they don’t as I said the same way they say don’t you pop down so this is a practice a really obvious indicator that you can look for in a signal for analysing someone speech to tell what dialect they have and so that’s why it’s important to understand the mechanism of production because you can actually spot and you can actually phone classification tasks and identification tests etc is on the signal of machine learning

and running we just have as much data as possible at your network and hope it learn something more you do a little smarter way where you you understand the problem that you have any focusing on on the characteristic that you’re looking for and against the dark and into a speed signal down to those very simple parameters or characteristics and train on those and learn doesn’t learn that space am I trying to pacify in the within that space only the other makes me this is what you do when you trying when you try and educate your child princess you don’t give them the whole world all at once you simplify at your present it in a form which is you know it accentuates that the bits that you want them to actually pay attention to and that makes the learning process easier for the circular Quay to let me give you another example is it again we’re still on the voice back the crackling and your voice and it’s related to the waiting for the work we did and Pakistan gonna try to impress if you get dark at like there’s so what’s happening there is your vocal cords vibrate you can hear the little pulses in my voice but they’re not

vibrating in a periodic fashion the non-periodic superior to piss in the mains pattern is repeating in the pulses the distance between the pulses is fixed on a nice clean girl it’s a very nice so regular with everything is uniform yep you when you go so crackly sound becomes non-uniform and with Parkinson’s Disease detection we did this is one of the things I mean are those that we are looking for a surface irregularity and in the in the policy so we looked away for interest me see the posters in the voice which acquired all vowel sounds and really look for a regular spacing between these and then the extent of the irregular spacing will tell you something about the extent of their the park and how little control they have it’s a pathological it’s not the only one thing I do think about no of course you as a result of the disease yeah exactly exactly

Wagner detection but it’s not simple about a course and there’s no one featuring 010 characteristic of the voice that can say for sure whether someone has or hasn’t got the disease but it’s one of those things that is a good indicator can be used with other views on intelligence other say yahoo get a head start problem I like that I like using human intelligence they’re not expecting the machine to just figured out you know where we got the most powerful bransdale Avenue link alright so that’s a greater greater than to the voice production system

ok listening listen up I’ve just launched the boys tops Tuesday and you weekly newsletter that helps you build better voice apps just go to / newsletter to sign up is a nice mix of useful stuff and fun stuff easy to read and it’s not too long the next day she goes out on Tuesday to sign up now in time for the next issue also all the past episodes are available through that link to is / newsletter do it now I look forward to your Tuesdays so we can move on to the Adele cover the voice box but also the articulators as the other important so that they see my selling point something is vibrating this is a source the name of filter on top inside the shape of a melody Anderson see if you’ve ever blown over a bottle with a little bit of water in it you’ll float you get like a nice toned and

fill up the bubble with Livermore water are you going to high pitch tone as there’s less space for the air to vibrate within the bottle yeah we done exactly what you’re doing with your mouth for different I sounds as you’re changing the shape and you’re allowing different frequencies to become excited or to pass through and the rest got any way to learn that is essentially the difference between a doorbell sound so that r vs python and we are lips and pushed out and give now made the space in your mouth longer so long before a lower frequencies have Space 2 to vibrate and become predominant in the signal where is the shorter frequencies going to tell me where did out because they’re not resonating will it within the the the mouth caused a lot of good simple example wrong can imagine you know you make them longer longer way form

magnified in the shorter way funds get any way to do exactly get there and then you have the opposite end where haven’t found we’ve pulled you left back yet and now those low frequencies that were resonating are no longer than it is no space for them in the high frequencies ok excellent let’s move on then to Linear predictive coding like a very complicated time but it’s it’s not a complicated right I mean you’re going to do my best I know what it’s good for first jump gonna know why we use it and I’m going to make an attempt to explain how work your house is a get you the characteristics of The Voice so exactly when I talk to about the difference between his re sounds they are they have different formats so different parts of the spectrum are excited as I just discussed 3 should 4 month was it was a 4 months in everyday speech formant is basically you have your spectrum

and as I said different parts of the spectrum get excited to get little bumps if you look at it as spectrum ok to a Spectrum just for people who have 0 experience with the initial signal processing why didn’t describe a Spectrum what does that rather they came up in the work is the name of the I do actually did that Shakespeare might as well have you ever does infrared imager kind of looks like in and it basically shows you what frequency is are excited or I’ll be present in the inner sigma excited and your voice right but present in her in it in a signal so that is a present get a nice strong red or white yellowy colour and then it goes to black with there’s no signal energy if you like if you want a visual simply just Google Speech spectrogram and you’ll see exactly what we talked about because I’m very intuitive when you see it you had a good describing words I find it’s true but if you can see it it’s it’s it’s not it’s trivial it’s it’s very easy to understand what you’re working with

so if you look at that picture you will see lines for example if I give you a long you will see a series of lines in that picture as these are the frequencies other the harmonics in in the voice and you will also notice that some of a brighter than others and these are the four months so if if if a group of the harmonics are nice and bright that’s stuck to B14 Mansion when they’re attenuated there’s no formant and ok typically you have to two three four months and their location in that picture determines what sound but what works I want bell sound uu has has been set to let me clarify that it was my understanding we’ve got a fundamental frequency like what we recall the pitch of the sound and we’ve got all the harmonics of that page which they sound like there are many of maybe I don’t know 20 or 50 or something in certain groups of those harmonics will be brighter have a high magnet

can others and those of the formants the roots of the the high magnitude I’m on it yet prepared sound example something we can take a deep dive into this sound examples police the problem is that when you’re trying to describe it in words you can’t get that far because we can’t do images here I’m asleep but you sound an example definitely help just don’t see how you imagined you have the fundamental frequency is the lowest frequency and I can generate similar sounds using sine waves in a special program called audacity and I can play them back for you so we have a first a fundamental frequency say at 200hz ok that’s what a 200 Hz sine tone sounds like a sounds very synthetic doesn’t sound natural at all and now it’s important to understand is that the next harmonic is it follow some rules

the next time I could be a 200-400 say I personally cause it to 100 in the next one will be at 400 instead of voice notes always double and I will pay that for you now that sounds quite different the I’m giving here is a little more character happening is not so simple and your sounds very synthetic is that only the whole night was that last a fundamentalist the home on the mental place to Hermione ok as we add harmonics the signal gets more complex the outlet at the third one because I was only a little more complex Daniel motorcyclist jump ahead jump up to 22 add 20 add – very very different is actually if you look at the way for it looks more like pulses was the first one was a very smooth and wave and there’s also there’s a lot more happening in the same as much much more complex do you understand ok and now with this to this then this is essentially the source

earlier I mentioned it’s a source filter model the source fingerlings the filter be new articulators lips mouth teeth and tongue and what the LPC model does is it modelled articulated so we can generate the source from a collection of sine waves ok in generators harmonics do that so signal the sunrise we’ve just been hearing that equivalent to have voice boxes is likely to thing as in the Rain Little Party Never Gonna add the articulators like I left on its natural to modify that signal yes I can do it as I recorded myself saying a few of us are you and what I’m going to do is take that filter which was generated through the cell PCR method linear predictive coding method and apply it to those signwaves also hold on a second so you would say separately you record yourself making loud noises which or a combination of voice box plus articulators your mouth

can you use Dell PC to extract like the characteristics of your articulators exactly exactly ok specifically which frequencies resonating Lucy and which frequencies are not ok basically captured that information and we’ll play with fire filter but essentially capita the steadings of my voice yeah those particular sound so I can I just play the actual sentence I sound so play first and are so it’s very attentive a robotic this is not a natural Signal 1 expected to turn natural you can definitely hear that there are times I can I fix amazing little quieter but yeah you can definitely hear any in that it doesn’t even sound anything like real speech know that it’s magnitude is totally flat that happened with your speech

there’s no no variants there’s no randomness of my speakers random this in my voice and it’s all very clean but it still sounds like it’s it sounds like what we want to sell it like I never know it sounds similar to the voice of the late Stephen Stephen Hawking the weather 01 you can kind of hear it sounds of a bit lower in brightness if you like Italian bread less convincing room the level as convincing as well but essentially what you’ve done there is you can have any way that are the higher frequencies cousins what you do in your mouth though you’re my lips are extended low frequency is a resonating so when I apply the filter only in my signwaves only the low frequencies remain and that gets me closer to the sound mahapps because it with a new sound that there are other mechanisms or processes going on with vibration of the

something which are being modelled simply by performance there’s some other kind of the characteristics of a new sound that perhaps that’s why it sounds less so I left him visiting than the other than the Athens didn’t quite possibly an interesting of the LPC album is not made really for invoice sounds would you talk about earlier noted for voice but you can still use it to aid and Y stands still have a spectral shape are very different to the o to the to the valves are we can I still use it so I have an s c o n s is basically instead of the sources no longer sign wage for an S now the source is noise because that’s what you’re doing when you said have an F sound you’re basically generating turbulence within your mouth and this sounds noisy so we can apply an LPC Feltham to apply the LPC coefficients to noise and we get a kind of an S sound

sounds quite like noise does a certain Kylie there would have been able to capture yeah I also have the sound of sugar yeah definitely hear it if you have any hazardous in the Sunday show that it was originally intended for more voice to sound exactly the exactly explain why is well with voice to give a quick recap what this is doing is modelling the articulators of a voice which give it a just giving a voice to shape a certain characteristic ok so what are some of the applications for LPC then there’s many understanding speech to text is essentially what speech-to-text algorithm do is if you can you look at your at Spectrum as we talked about earlier and you look for the characteristics of a new owner

and and and you can labels us when you can kind of concatenate them all together and then you generate words so basically you have a recording of a word and you try and figure out what is the spectral shape you figure out what word was said essentially ok so it’s about is about identifying which parts of the world I guess is it is based on the phone in level is exactly exactly the bat and the cat he’s a kind of earnings and with the sound of a sound of the end you can come up with those together in that sequence will have a specific spectral shape and you you look for that you have a catalogue of different special shapes within you do a mapping between your catalogue and your your recording and you see which spectral shape is it most like in my catalogue of spectral shape right ok it’s morality the lower frequencies excited

boosted ok this is this sounds like you speak to textile rhythms of it in the past I think that nowadays they use more than you know when your network and of their exactly is the starting point to this is the starting point and it’s important to understand that what where it will come from same as I can also if you’re doing a speech synthesis and you’re doing a version like I showed you doing something with more advanced methods now you’re doing something very similar where you are imposing the spectral shape that you want I’m a source sound to to generate those sounds or even generate the unvoiced sounds alright make sense if you’ve got them you can use them however you want entertained as well a dialect detection if you know the differences between dialect and how the words are pronounced I’m going to look for those specifically those specific differences and then say ok someone’s in the north of England the South of England right now I need a speaker even yes

looking for interesting

are you enjoying the show but haven’t yet become a sponsor when I was your chance just go to / donate your contribution really does make the show possible listen to support for at the bit nibble byte levels that’s $1 $4 $8 a month respectively to help keep the show running there’s also new patriot is available for businesses did you got something from to promote how about the coming a voice tech champion the router 1 champion allows you to put your company logo and link on every weekly email newsletter buy stocks Tuesday you also get your company logo and link on the front page of the you get priority publishing for articles that you submit to the voice tech blog become a spoken word champion to get all of that plus a 30 second pre-roll promotional message spoken my me on every biweekly Podcast episode plus you get a larger logo with premium placement on the weekly amount newsletter and website just go to / donate ok let’s get back to this amazing episode

ok so we’ve heard about LPC and Performance which characterize their the filter of The Voice the articulators we can you say supposed to be taxed at the speech at identifying accents regions are these kind of things but there’s just one of the many many features that we can extract and all we can use to identify human voices right exactly so I can you tell us about some of the other features that exist and and how are you how are you rather close yet so we are starting point here is this open software which I work on from audeering yeah and Lauren introduced into that very well actually it’s a little bit deeper dive in explaining the new 15 how to use it in particular ok and what it means so it’s called a large feature of space tractor so we can come and break some of those turns down feature being some

set of measurement that we applied to the signal at the LPC that we talked about earlier as one example of them some easy examples to understand our for example the loudness of a signal how loud is ok Mum has an intuition about this but also dynamic range a quiet to loud as a signal get ok this is important for example of how excited someone else I have personally I don’t want to live with their talking about the other high dynamic range that being way more expressive than if they’re bored and talking is a monotone into the outside or something exactly exactly you can look for an excited excited speaker characteristics and through the dynamic range and bye see all that is the minimum and the maximum of the Mack the loudness of the signal bright and you can take holidays characteristics of a single so that those are some that we have English words to I can’t explain and give you an idea of what they are but there’s many many more so we do 6

housing plus how to receive 6000 characteristics the as a human voice I can say Norma and they don’t all have nice names to make it so intuitive to understand sure they’re to understanding me to take a deeper dive into DSP and what’s possible in terms of analysing so I took the loudness initiative measure of the signal energy can talk about pitch which is just a measure of where the the frequencies are the fundamental frequency of noise ratio example that’s good again not just for the pants and Scratchy someone’s places that have more noise when I speak I have more noise and won’t be as clear the air as well how much noise ratio was quite good at distinguishing yeah the clarity of someone’s speech and do they are actually used on a harmonic to noise ratio when I get out when I did my project on her voice of motion conversion were used it to identify the voice vs the Android phone names. And yeah because the only voice

and have a lot more noise in them so it’s an easy way to just feel to those out when you do that cos you have a feature over time because two different kinds of features does a feature of a time which is things like loudness how much noise ratio where you’re measuring the signal in little chunks and turn taking a value and storing it somewhere and then that that’s that’s 100 weight and you can train on those features in machine learning algorithms if they’re time-dependent like a recurrent neural networks are either lcl TSM the app that works with any other networks like support vector machines which work on single characteristics is no time dependency there and for that you need to summarise things like your harmonic ratio is the average loudness over the entire file exactly is possibly the the maximum if you’re looking for a peek to see if someone

sparkly sing us a break a certain threshold then they can be classified into your voice ideas just say yes or no. basically the entire current source as it says some of the features pictures of a sensor measurements of the signal in some meaningful way right right and then it is fundamental measurements and is also derived features right as you’re explaining what you got that signal to noise ratio you can choose to use it in a millisecond window or you can take an average of the whole thing then you can process it in 100 ways in 2006 and I was right and then this gives you a feature space now and for a particular class of signal at a male female is an easy example to work with and you can have a male class and a female class and the the features will cluster like pitch for example they the pitch values are the mean pitch values over

mymail will cluster in one part of this feature space in the females because they’re higher pitch yes or closed in a different part of the feature space we cannot generate these sort of constellations have you imagined stars in the sky the different classes different galaxies so they say and then you can use simple algorithms like support vector machines if you to draw a line between these two clusters the other decision boundary exactly and then use this is your classification criteria ok so anyone familiar with the fundamentals machine learning will know what we talking about there and if you haven’t got this one Defender lentils and I encourage you to check out the many resources online be described this in detail alright so unlucky for the machine learning guys this is another method which I find interesting which you can use with open tomorrow in particular the 6000 features so yeah you can run into the problem of overfitting with the way it’s designed you don’t actually need to know anything about the audio itself necessarily

sometimes you do but you can go to get away with a few clever feature extraction as I feature selection it just select me to search in Argos and 6000 features some of them most of them might be meaningless for the problem like pitch as I’ve met him a classification most of them might be meaningless the like loudness is not particularly meaningful a male female classification which intuitively is both males and females can both be loud or quiet you know the way for the microphone for other other applications it is useful write that you can do you can see through this support vector machines you can see which features get prominent in terms of classifying a two classes and which which tennis be useful and to Clitheroe feature selection you can get rid of the useless Peaches and just remain with the use would and Discover them yourself to sing

not really knowing anything about what the features are doing particularly but I know they’re there are algorithms that perform this feature selection automatically Zack Knight random Forest it wasn’t a minute to the other ones the end but I think what you were saying that we talked before this before the show desert there’s obviously room for human intelligence as well if you know that some pieces of be more useful than others then you should use that of course I have a good job security as a nurse beside of them still relevant example of said already this male female cover case you know this already you know something about the physiology of the voice and that picture is a good discriminator to begin with and you can put that into your algorithm it won’t be a perfect application some females have low pitch noises that will get you a big chunk of the way there and then you can add more features it is to try and Discover other characteristics about the voice that maybe

we don’t know to better separate the two the two classes that comes with Parkinson’s detection and we’re looking for a crackly voice but is way more going on is it the multivariate problem and it’s gonna be and so far beyond understanding for us and that’s where the the power of machine learning comes in how many features how many pictures should we use you said s6000 Washington enter use all of them in every project right so how do we how do we start we start with one feature and gradually add or should we start with all of them and remove my idea that you should start with wine and gradually add but I do that which one do you pick from the 6000 is where you need to human intelligence and the common sense even as as as a starting point interesting and obviously it’s a problem specific but I think you get better if you can get up to 60% accuracy with with one feature for example than you can ever know you’re onto something yeah Shawn Mendes

what are the rest of the percentage gain yeah you can use the brute forcing methods to make that off but you’ve already got yourself a great starting point through a little bit of common sense and I said it’s obviously problem dependent so I worked on something really like whisper detection whisper detection ok yeah those about that we know as I explained earlier when you whisper you don’t engage your vocal cords but you just push through this is an obvious darling point that you’re looking for a noisy characteristics and looking to detect that noise in the signal I just working with the noise but the noise have a Spectrum you were saying before they said it so it still goes through you still move your mouth in the same way and you’re still generating that spectral shape as we talked about earlier than some frequencies are getting excited but this time it was that the source we talk with the source filter model the filter is the same but the sources now different so it’s not pulses anymore it’s just air doesn’t turbulent air causing buzzing noise ok there’s this

slight difference in the in the spectrum because of this they are you can do even learnt through mapping if you have what we do we have a database of whispered and a database with the same texts and normal speech houses are parallel database that is spoken with another voice spoken with Westwood voice and then you learn the mapping between what does it look at the photo do Jews did divide by phone internal how do you get exactly exactly map from one phone in whispers voice and and what does that give you them as you got that mapping than that allows you to detect that is this phone in being sad but in a Wispa formally trained and then also even on a synthetic complex side on the simple side if you can always simplify your your your examples your your your use case if you like her I am I use case I know we’re dealing with some sort of speech when I didn with traffic noise and they said I can look for the the increases a magnet

Weather summer activities some activity and if I don’t see any spectral and the harmonic structure in my spectrogram then I can assume where we working with a whispered sound I say alright so there’s less differentiation between whisper detection which is just are there whispering or not and actual as speech to text on whispered speech which is also took them out before sure sure sure I will not we’re not there at the Texas Pete hot will taking baby steps to simplify the problem when you say when ordering or you can tell whispers from from Norwich normal speech this is this is this is required and then we can go on and do a text to speech ideas you’ve been working on that the detection first of all that works and now you guys are working on the actual speech to text on whisper station has the work in progress a new direction for doing so I can talk about the reason we’re doing this in the first place

this is enigma gaming scenario and the kind of scenario we discuss in one of my meetings with that you are for example a ninja sneaking through some where are you going to talk to your teammates yeah yeah so you have to whisper if you don’t whisper someone will find you and you’ll be discovered and you lose the game for example yeah right and this very simple scenario there’s no text to speech required all the two required is that the game no is your whispering on NatWest bring if you’re talking normally you get discovered if you’re whispering I don’t get discovered as you progress through the game Des’ree basic exactly very simple so that’s something I can actually announce here and so we’re going in this Direction with using our technology in the realm of gaming and using voice as a controller using motion as emotional as a controller in the in the game and actually what we’re going to do I think it’s in February going to do a comp

are we were basically hand out our our tech in a unity plugins and game developers can take our tech so it will give you some sort of classification with but now whispered angry sad this comes at the suite of emotions that we offer a girl and then you can use that ok it’s up to the game designers to think creatively and come up with gaming scenarios yeah which was take advantage of that and improve or enhance the the gaming experience absolutely what game is often use as a testbed for these these new technologies and brilliant to hear that yeah you guys are launching that and I can’t wait to see it and games online the game of myself and it be fantastic sending your troops in the battle you have a need to give a rousing speech and encouraging miliaria right which we can detect so we have this arousal not very arousing enough

eonenergy this page is not going up and down up fluctuating you don’t sound too excited and maybe a troops don’t do as well but I think there’s so many applications for this other whisper detection I’m just thinking in particular I mean just today I was saying another application when it comes to the elderly and maybe they don’t have a particularly strong voices you know they’re pretty and we use cases at night you know if you want to speak to Alexa or whatever you can you whisper Sinnott to disturb the person next to you and then it knows that you’re always brings with whispers back or insulator you know it knows the contact that your end because you’re whispering are there in a must be a reason why you always bring the ID also Enterprise I think I’ve been thinking about it when were at work would probably going to be for a while a bit barrister just be talking to her voice assistance but probably more prepared to just whisper at it and with the release of all these are smart glasses have you seen the little mentored reality glasses will be wearing he know if you could just wish

I think that’s going to be better than talking to yourself right where you can imagine having like tiny little mix when I specifically that people will hear you know it’s late Gillingham there is Tony articulations in your voice this morning answers the answers personalised to you as well try another time you know to know what you’re doing same thing as it’s a really exciting areas not in it what you’re describing is here just the first the first set of use cases to get get people aware of the technology and then yeah that’s a good advice for people starting off with the machine learning any news to listen the Beginning think small and baby steps give yourself a simple problem to solve right yeah and use use that to 2 as a j jump off point too much more complex problems yeah yeah my sense is that right now without coding compressed audio and it’s implications on machine learning which I wrote a little blog

about what you find on there and the website I can we talk about using compressed audio like an MP3 as input into a machine learning model ok if you train a model on origional audio only and then you go to tested and mp3 audio at sedlec giving easy example of very low data rate so we know it’s different we can hear it’s different and what with the model differences like to of all the model could be in a motion detection for example case it’s been trained on raw audio yes high fidelity audio but then you give it an MP3 we can invoice yes and say what emotion is being more than likely I will get the wrong answer because the spectrum what it what it uses to extract features from is now radically changed I say so if you have a typical thing they do with MP3 as they cut the higher frequencies told you in an earlier podcast about how we can only hear certain frequencies up to a certain point

example 20 kilohertz is what people say that she wants you get older this dropped off significantly an MP3 for example automatically cuts I think it’s 17 km everything about 17 kilohertz automatic it right because it out of range if you like and I don’t have much to add anything significant you might lose a bit of brightness in the audio signal in the speech everything all their frequency content Israel only for people with the best days are the best hi-fi systems are going to show you the different exactly but if you’ve got a feature which are using to separate your to your class as for example in the emotional classes which uses those frequencies now those frequencies a gun the values you get back from those pictures are now useless that are there now we talk about this feature space earlier then I’ll be in a different part of the future space ok so it won’t be able to classify them like you did even though perceptually to ask a sound the same understood yeah so what were the recommendations then is it is it the case that we should

use mp3 audio uncompressed audio all that if we are if you want to use at the we have to train on it as well it’s important exactly so yeah there’s no getting around it it’s such a huge data source of YouTube audio books these are compressed audio and so the data is out there to be used so it’s not like we can we can say ok we won’t you just want users they don’t really know why you can do on the machine learning side as a few things you can do you can augment sure you include in the training of your model include compressed audio I would I see an compressed together exactly same versions of same my voice speaking and then encoded and decoded and generate a new file to your training data ok round and the chances that when you test on a uncompressed audio but you’ll get the right classification dramatically improve it’s the same as when you augment with noise so you want different noise

listen to add traffic noise to your speed for example so you make sure that your classification works in a noisy environment where locally can I see so it’s just another use case I get another condition that you have to account for when you’re training a model that makes sense I guess you have to be honest I was going to be used if you don’t pick features which are not only good for classifying but whatever you’re trying to classify male speech for example male female speech but you pick the ones that are not affected by these compression artefacts so in about an order to do this you need to understand what the algorithms actually doing and how they changing the audio which I write about in the blog post that to the site how soon after if it hasn’t already been so encourage listen to check it out this is the post exactly on on this topic as each other so you can get a reference

what are the what are the unique challenges are there around voice data for one week training machine learning models know your data know what’s inside your data as a nice example of hurt if someone is doing classification between a violin and a viola examples of to reasonably similar sounding examples how you make a mistake in your recording and audio violin sounds that that say one second after your viola sounds and you at your your measuring time in your in your model how do you mean start one second are snippets of a violin and viola sounds and silence and the beginning of the biology of each time a sound file exactly exactly right that’s a huge difference technically speaking in terms of the properties of the signal that the massive massive difference between the two classes and when you’re doing your labelling everyday don’t you learning on those labels

you’re arguing with ink adheres a massive difference between the latest training on this island and those who are not on this Island Discs science the noise if you if you have one class of recordings in a noisy environment even just a different kind of noisy environment I see any other Class Learning to trade model the noise of a to the signal exactly their legs when they create these databases the make sure that all the conditions are being equal accept what you’re trying to classify good point in you record on the same day for example the microphone same sample read all these find details which will be important yes when you want to do a classification for The takeaway message is that you really have to control and know your data and what’s inside and if it really represents what you’re trying to actually classify absolutely is usually important ok good tip you mentioned that you’re ordering soon to release the

the whisper detection feature as part of the others my library and what what are the things that are all daring were you your personally going to be working over the next 65 months so just gaming thing is going to be a big push for me trying to think of useful classification scenarios that we from the voice that we would want to classify and be used for in a gaming snow in a solicitor something I’m only working on I also she have an open Smile tutorial series which I’m producing which will go on YouTube and I will help people get started with open Smile with everything from downloading to explaining the concepts little bit when I talked about here but the future space and what that is to running open Smile out of the box using it in a purely machine learning background way we do feature selection and also customising it for yourself if you have that little bit of DSP knowledge you can customise it and write your own

picture sets video feature extraction but you can customise it yourself and get tips and pointers how to interpret the output for example many many things when did this is the first part of the series hope it’ll be a second part of a series as well we take a deeper dive into machine learning stuff and that just the DSP and using the software ok to be a little bit of Educational stuff along the way I try and motivated with the DSP stuff with machine learning stuff like I’ve done today yes and give you an insight into how the DSP is working because if you’re an informed user like I said you can use your own intelligence tomorrow in on the problem that you’re trying to say how was really a little knowledge goes a long way round super useful to anyone thinking of using audio and the machine learning models in there and the others my library which is better than you know the industry leader in

I’m a priority is he going to be working on the personally I’ll be working on at the improvement the robustness about algorithms now someone has student could probably do some very decent emotion detection with databases that are out there now ok some savvy savvy algorithms with the difference that separates us from from just any old person who who has a little bit knowledge in machine learning and DSP is that we have a whole team of natural researchers and developers to focus in on the corner cases the cases where it doesn’t work expose these work on them find out why they don’t work and improve the software and you can get a long way with simple methods up to 60% or so that those up to tidy up the 90% and this is where takes a real team of people by Ella experts many different feels like she signed the DSP guy and I’ll doing ok we have our machine learning guys we have a linguistics guys and they work on prosody which is like

again how you say things not what you say and we also have our development guys who put into embedded environment so we wrote we are growing and becoming many more aspects of what it takes two to bring these algorithms and these models to fruition and make sure that they work in a robust and highly consistent way before and hiring as well still have live yet we had a massive explosion that people I was in the beginning of this kind of wave I think 3 years ago was over 4 years ago was like 5 people in the company and I’m now more than 50 if you include out and if you don’t include we have annotated they have people are not hating date is usually important topic and have you got those out cos is 20 years old is 637 I think now when I go so exposing and there’s more coming turn the new year we have two or three more people at least coming and I think that’s going to continue a little bit further so yeah people have their interest in applying do it do it quickly yeah

Dominican I get involved likeliness Levellers beautiful right brilliant thank you very much for being on the show that stayed in so enlightening I feel like I really understand the subject a lot better now and when I come to train my next machine learning models and audio data I’m gonna have a lot more at my disposal no more knowledge and her and his party as well too and to take advantage of the data that I’ve got the sounds good thank you very much for having me

ok you just heard from Christopher wrote the senior audio DSP engineer at audeering I won’t keep you too much longer as I’m sure you’re all came to go off and come whispering ninjas again I want to say a huge thank you to our first champion sponsor table lamp and I highly encourage all of you to go out and check out there are many tutorials on YouTube that that double have have over 150 different tutorials on a range of different platforms very highly produced their sure to help you and your voice development activities so check it out it’s a slash double lab as though that’s all for today hope you enjoyed listening I find all the show notes if you join this episode there’s many ways to support me and you can simply tower one friend or colleague about this episode you can leave a review on iTunes at / iTunes I become a sponsor sash donate I’ll be back soon with another episode until then I’ve been your host Carl Robinson thank you for listening to the voice tech podcast

Subscribe to get future episodes:

Join the discussion:

Support the Voice Tech Podcast:

Share this article

What do you think?

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Posts

Christopher Oates Audeering

Get notified about new articles