Deaf Person Calling – Benjamin Etienne, Rogervoice – Voice Tech Podcast ep.006

Benjamin Etienne Rogervoice

Episode description

Benjamin Etienne is a data scientist at Rogervoice, a mobile app that allows deaf and hard-of-hearing people to use the telephone. Ben shares his inspirational story about how he taught himself data science and machine learning in the evenings, so he could work in a more technical role. He tells us why he’s not keen on Kaggle competitions, and why getting a job in data science is the best way to master it.

Ben introduces us to the challenges faced by the deaf and hard-of-hearing community, and how they can overcome them with the help of voice technology. We cover how Rogervoice works from both functional and technical standpoints, and discuss the pros and cons of using a commercial cloud-based speech API versus developing a custom in-house speech-to-text system. Ben reveals his reasoning behind his choice of machine learning models, and describes the advantages of using connectionist temporal classification (CTC).

We then discuss the state of data science today, the limitations of current models and data preprocessing techniques, and how an understanding of the underlying psychology and neurology of users can help us design more effective voice technologies.

Links from the show

Episode transcript

Click to expand

Powered by Google Cloud Speech-to-Text

when she’s in the office she likes to sit up and tell everyone that she’s going to make a call welcome back to the voice tech podcast you’ve got a great episode you today before you get into that I just quick update the downloads are steadily growing which is fantastic a massive thank you to everyone who listens on a regular basis and a big welcome to a new listeners and because there are so many new people have joined us and I wanted to take a few seconds to introduce the show again this show is focused on The Voice technology itself so we talk about the how it works as much as the what it does I conduct interviews with people who actually implemented voice technologies in a project such as academic researchers ctas engineers and software developers voice interface designers project managers

human-computer interaction experts in more we Delve into all aspects of voice to interfaces and they’re enabling technologies such as an LP voice synthesis machine learning and AI as well as applications of these techniques such as chatbots and social robotics and related Fields such as psychology and emotion by listening to the show you gain a good overview of The Voice technology ecosystem and stay up-to-date on the latest developments in my name and hope is that these conversations will be entertaining inspirational and informative and not only give you ideas for voice applications that you can build but also to introduce you to some of the tools and techniques that you’ll need in order to actually build them having only been in the field for a couple of years I’m relatively new to all this to say I’m very happy that you’ve chosen to join me on this journey to find out more about boys to allergy and we can look forward to seeing the block fall into place as we build our understanding together learning to build machine learning-based voice interface is no easy task

anyone whose tribal tell you that and as with all complex tasks it have to have an expert she had her experiences and advice to stay you in the right direction so did they were joined by one such expert he is Benjamin Etienne a data scientist at rogervoice a mobile app that allows deaf and hard of hearing people to use the telephone and shares his story about how are you taught himself data Science and machine learning in the evenings so he can work in a more technical role he tells us why he’s not keen on kaggle competitions and Y getting a job in data Science is the best way to master and introduces us to the challenges faced by the deaf and hard of hearing community and how they can overcome them with the help of voice technology we cover how the Voice works from both are functional and technical standpoint and we discuss the pros and cons of using a commercial club a speech API vs developing your own custom in House speech to text system

as his reasoning behind his choice of machine learning models and he describe the advantages of using connectionist temporal classification CTC we then discuss the limitations of current models and data preprocessing techniques and how improving our understanding of the underlying psychology and neurology of users can help us design more effective voice technologies why are you listening to the podcast I recommend that you sign up for the voice tech roll up the monthly newsletter for This podcast it contains top 5 voice news tweets of the month links to the latest episodes as well as other unmissable getty’s such as exclusive newsletter offers it’s quick easy and free to sign up just go to voicetechpodcast.com / newsletter so with that said it’s now my pleasure to bring you Benjamin Etienne I’m here with Benjamin Etienne he is the data scientist at rogervoice

I currently working on automatic speech recognition emotion detection in speech and speaker identification rogervoice is a mobile app that captions your phone calls their mission is to break communication barriers for the deaf and hard of hearing hello Ben hello nice to have you on the show thanks for joining us this evening congratulations on the win last night for France Mr Grey happy thinking that the same thing out of a final tonight if so could you listen to start by telling us a bit about your background because I’m talking before the podcast was recording it is pretty interesting how you know how you’ve moved between different roles

I have a master of science in their free mechanics so basically I just did with many many French people do their bit closer by Jay and their own ecosystem which I think is not very well understood abroad bed Fran classic classical background podcast for another thing as I still printing press education system is a Pizza Hut is a bit particular bit so anyway so I started taking and programming I started my career at Alexandra which is a Consulting company and I’ve been working out for a different Industries ranging from my retail how to make it industry so this was my house Consulting industry and I started being

tikki tikki things are in my ex my first job and I wasn’t really into real problem in bed with you like stuff like scraping in the second season on a regular basis that it until I got more interested in 2 visa into this subject I mean the more I got involved with data and the more I am I could I could show people my resource centre and maybe sometimes give me directions or maybe suggest improvements so we have the results of my out of my analysis I really like the other driving now that I drive inside of my of my job and that’s why I decided to your company

I’ve been working for a hernia for the energy sector in advance and I did several levels ranging from visualisation soul music off light in dashboard somewhere quite this remember the time I discovered this discovered this field and I decided to move away from me to work on a on a project in which I wanted related to be important to spend time at work and that’s why I chose chose rogervoice because I really like the product and I’m doing

advice for 11 year so as the data Science in Action motion detection on Sony speaker diarization path to becoming a data scientist transition from from a project management kind of return to a very technical role and doing things like it was keeping alright alright yeah starting to use with my nose into her and then Forest example and Innovations this is something I will when you have a scientific background in engineering background the you heard of these before you know what’s a linear regression is of course but you don’t really apply it

abuse cases and so that that’s when I started to look around to look on the internet for for ways to just starting to starting today sign so obviously you have a lot of lots of resources on the internet available Marcus Evans variation for every problem guess I just said my job was mostly a consultant consulting job but this is something that I didn’t say as a hobby first and then I quickly realised that I wanted to know I wanted this to be my job in that shouldn’t be understated because Consulting itself I know is a very demanding demanding job that requires you know long hours at times

they still have the energy to come home in the evening and don’t and work on machine learning which is by no means the easiest subject to learn yeah that really take some doing but you did it you are your basically self-taught long is that right the machine learning side I don’t know today is Southport you what were the what was some of the causes that you took can you recommend anything that was particularly helpful or or or anything that you felt like you waste your time on any would recommend not doing I think there’s this one if you really if you really want to start a really want to start working with dating you she don’t have a lot of experience in this field 1 course I definitely become anything is the entry point for a lot of people and his fellow is Andrew Andrew NG course coursera

not a lot of people taking that one so I feel now I must I getting more and more popular semen sometimes have to pay for it but I think it’s really really care and was BBC course you can find in the way that is really well explained and it’s not too complicated at first and end if you want and if you want to go deeper you you can but it’s really I think that is the best entry point I could recommend there’s another there’s another course I was thinking of her still not sure which is the cause of different networks new networks for machine anymore I don’t want them remember that I took at least one is a bit more this one is a bit more difficult because I just paint in his kiss

most people learn about learning maybe have the name of your finger for a bit here we go into the theory and we see lots of things like the bayesian approach behind your networks which I think is very interesting I received different different types of networks I’m pregnant network Solutions Ltd. Unit or CPU I suggest if you want to start I suggest you start with me Andrew NG course and then if you really want to come and go with the image of intercourse good so good and I did you did you get involved in kaggle competitions on a machine learning is it one of those things where it’s fine you can do the courses you can do the exercises even if you if you if you do it for me

in school but you don’t really learn it until you get your hands on real problems so how did you have you taken that problem you know you need to get your hands on your own your own problem in a minute if it’s fine to watch video is good at one point in need to do you write your own code because the only way you’ll learn as for the data so you can find lots of simple toy datasets available on the web ending on what you want to do classification or regression lots of different datasets available different examples if he can see on github or on Blogger blog posts but I think that is the best way I could get my hands with real problems when you get a job as a data scientist mean you can you can you can try to book a go competitions that can deliver to me is really there’s one major problem with takeaways

size of the datasets on way too large to fit on there on a standard computer if she want to get the results on time and to get good result against off on Google today in if it’s a shame it’s not a people are just stacking multiple models together and getting an enzyme all of these models and the related to 0.1% 0.2% eye means just such a waste because there’s no I think that the problem with this disappointed that people are just think that the bigger the Beta machine you have and the best results you will have Wi-Fi to you you just get rid of all the future design and feature engineering part I think is the most interesting machine learning in minutes is the hardest part for me is to get the data and to attract the right features of the data and then putting it into model is just 10% of the world

no I mean if a girl is just you have to leisure Maschine go through their the hole in the whole dataset and then spit out the result I mean there’s no real intelligence behind right ok excellent see you found a job in data Science not really accelerated your learning and what was it that attracted you to speech processing in particular was that by chance or was you know how did you find rogervoice and how do you decided that was there the company for you and my partner Jimmy my past jobs and so I started looking at the different fields in India planning my hands to me to me there’s three Fields near me anything because of his language do computer vision stages and speech processing and I think speech processing as a bit lacking behind

and I’ll be I mean I found a penalty was really interesting I found computer vision was a bit less interesting I don’t know why because maybe you did hear of kosher convolutional neural networks physically attracted to me anymore chatting with speech the Host PC audio and then and how do you map audio to text and I only found this fascinating and the same the same time a member of my family doctor doctor by rogervoice to me she was using her she was using rogervoice and pack and so it for myself a little bit of the use cases just incredible using speech text to help deaf people is it’s not just adding taking pictures in order to sell your products on underwear

it’s a noble cause we can talk more about that in the second you were telling me before as well that you’re really attracted to the psychological aspects of a speech processing and you read a lot about psychology that right well I could read only you only technical papers about how to design your network and which one is the best teacher ever I think if you want to if you don’t understand all the other processing Rogerstone on me on the audio on this picture and they speech audio I think it’s it’s interesting to have also there is a psychological perspective in order to understand why why did we designed such filters for the audiobook sample why do we use for your Transformers ok but you feel like understanding at least the basics of the psychological effects of speech on on on the human body helps you to help you in your job as a data scientist and speech processing engineer

picture I’m not running I’m not just interested in designing your networks and let my computer run on a days and 4 weeks I want to know that they don’t have the big picture round and you can tell me if I’m fair planes and boats and planes and birds are not the same but still they fly butterfly so you can tell me yeah but who cares if they are out with them for China speech speech recognition algorithms which I’m not really inspired by the brain and the Brain still to me as I think it’s interesting to have those pictures you may be to join together some inspirational I think that’s part of the data scientist

not only to programming tune into inventor into better without them is also two to find new ways and your ideas to tackle problems because sometimes you feel like I once you reach a certain percentage on this PC task of the job is done that you can also I think you can always try to find your address and I’ll get you to build cheaper models let you go faster than added you to a logical way of them from making connections between different things are the effects of ok let’s talk about rogervoice and the product that you guys are working on so I give a brief description of the Guinness what is rogervoice set and how does it work

cake without using your own caption called that is used by people who are deaf or hard of hearing and it allows them to communicate over the telephone line or over a voice over IP connection it’s a mobile police also available on this is called a widget this will be just like a Skype call a deer with an employee from the bank of the same accessibility plugin for for websites and you have a stand-alone mobile app for the consumer offering

alright so clearly it solve the problem of allowing deaf people or hard of hearing people to communicate over the telephone which until now they couldn’t do this is there the first solution that enables deaf people to do the work that you use case is unique I mean speech to text already exists VIP already exists but the uniqueness of a solution lies in the fact that we combine these two technologies into one and I think that this really helps people this really helps people hard of hearing and deaf people to communicate because we have lots of examples testimonies of people saying that they did ever made a phone call before because they couldn’t will be had to be had to have someone next to them could do they call them and we had a lot of people telling us that that’s the first and it may be called thanks to thanks for your solution any message really something which is it

alarm small faces when we read it as such thing as a motivating Force turn up to know that you’re making such an impact can you tell us some of the some of the stories that you’ve heard I was thinking about the business the business woman her name is coming soon and she she lives in Paris and I think she might be around 45 a thing and she should be in the last part of tradition with afraid of making phone calls in that she wouldn’t find other way to treat people but she wouldn’t she she just couldn’t answer the phone and now she’s using mine she she likes to say that when she’s in the office she likes to sit up and tell everyone that she’s going to make a call and you should if you pull out and she’s with you

she’s one of them is one of the best uses in the way that she really she likes using a product and we also testimonies of people saying it’s the first time I’ve been able to do for my uncle who is there we work the other way round as well yeah yeah always have the other room is the family will tell the people in the receipt because it will be able to tell that they will tell them that you should you should use this app yeah it’s a bit like having a relative who lives on the other side of the world and you just don’t you just couldn’t speak to them because telephone calls are too expensive and then along comes Skype and it’s like where everyone can talk to each other this is the next level now that he’s using me. Tag technology turn to open up to you know people who couldn’t even use that before so ok so this is as you have a mobile app and you have the the website which

what’s the what’s the division for the company where where where are you going from here now what do you plant her to add to this service functionality so we started with the station conditioner system so I just finished texting speech to text within designing also video calls and now you can call someone and just like a regular polygon CDG on the face of the other person or more on your phone so as standard the end of video call what we are planning to do also is to a user internet so I sign language interpreter about deaf people who prefer to use sign language instead of speaking or Reading I can still communicate with third party fire listen to Prezzo we we we we are our goal is really to be there there the centre platform for communication for deaf and

add a few people so if you can think of her speech today in think of Interest you can think of video calls but I really want to bring all of these together so that when that person wants to talk to another person you have lots of different options running shoes no didn’t put enough something I learnt when we talk to her earlier was that not all deaf people of the same of course there’s there’s down and there’s hard of hearing and if people who are born deaf or became deaf very early on and then people have lost their hearing later on and they have different needs and depending on how people lose their hearing they have different ways to communicate so that you said it who is it that prefers to use sign language and other people who have never heard of from Burford school who didn’t really who who became deaf quite early and the lies you have two different populations basically you have the the people who use

sign language and who would you don’t want to learn the oral language so they just used sign language and population which we use the oral oral communication so they will learn to go to school in the world learn how to speak out please try to speak and they usually they don’t really do this unless we are due to different populations among these people and their populations which is there and it’s the population of her people became are there for at least lost some of their hearing through age is just a number of different ways that these people become rogervoice so that the basic system allows the person who is deaf or hard of hearing to type the message into the mobile app and it sends

it is converted through a text to speech into spoken audio computer spoken audio which then be here in person at the other end of the line here and there may be using the rogervoice an app on their phone or they may be using her and you never Standard telephone you can talk more about that in a moment and then when the hearing person replies and that voice that doesn’t have voice so it’s converted into text using speech to text and then that’s displayed in a text to all manner and you’re saying that some of the deaf people are not an optically happy with her that the accuracy of have you happy with the accuracy unless they don’t really like you’re having the voice of because when you use text to speech it’s a synthesized voice it’s not a natural voice and some people don’t like outside they the hearing people don’t don’t like speaking to a computer but I feel like there’s a computer in between the 14th and I say I am and it’s a lot less expressive as well but if you’re someone who’s you

do using sign language then you know you’re you’re you’re physically looking at the person’s using your hands and you’re just as you’re not just signing out there the words that you’re actually expressing at the same time as text text only if so could you text which one would you are using right now is that you like you lose some information about when you when you talk to someone you have the message and available balance of the non-verbal which is gestures but also the intonation GGG are you in a certain emotional states are you talking very loud very quiet and Deezer music use that you can have right now when you do speech text because the only other types sometimes a message is a bit ambiguous and you need you need to use in order to disengage the message so if you use the video when you use a sign language easier to communicate for these people but using the text only is still

not very satisfying certain situations where you need more than just a text to Mother turn this down with the person wants to say absolutely endearing person probably can’t do sign language only way is that expression will only be going one way now unless you have unless you have the video of both ways to compensate for the shortcomings of the text only approach you said that you are developing emotion recognition and you’re going to actually that tell us how does that work then he has only had a personality of the person who is Peking play the willy has he has an important agricultural so the control background emotions are not displayed the same way in Japan and they are in Italy for couples

together they won’t be on the same as it won’t be using the same intonation the same the same volume on the 4th of a voice so so what we doing is trying to find some excuse like for example be like a sudden variation in the stick of RAM song im getting a bit better, we use the special grandma for the audio and Rachel trying to find your identification so unusual unusual signals which might indicate that the person is excited ok so you are analysing the speech was coming from the hearing person who’s talking honestly you are not detecting a sharp changes in the in the Frequency as an example and then displaying that to the the deaf or hard of hearing person through the rogervoice out or you plan to at least

Building Design because it’s hard to find something out which appeals to people should we use colour should we use emojis in ways of representing a person emotionally so it’s something that is something which is not easy because you need to hang it out a way which is understandable and not complicated but I’ve not seen anything like that except emojis richer will be quiet coming into have emojis popping up with the entire time you talking to somebody next to the next to the to the emotions you shouldn’t be a college to be an emoji but because the fact is that if you look at papers dealing with the emotional machine speech you see this sometimes as humans are quite bad and distinguishing visible Angela vs happiness

say that the best human skin reach in certain situations is around 70 percent maybe for our motions and emotions like sadness vs yanga think I’m quite easy to I got easy to distinguish better and Happiness if you just look on the on the Spectre of lamb and you just look at the features it really close inside and so you sometimes people mistake a an angry person with a happy person and vice versa interesting ok so the machine learning has a hell of a task in front of a human’s going to get to 70% it’s not like a word recognition isn’t phone again speech to text for instance where I understand now computers can actually be that the best computer system I don’t know what it is now I can actually beat humans in in word recognition but only just is that 96-98 I’m not sure you didn’t do I don’t know what the state of the art is but extremely high emotion at 70% as

does a very low barber goes ok ok ok

ok listeners listen up if you haven’t done so already head over to voicetechpodcast.com / newsletter and leave your email address the voice tech roll up is a monthly newsletter that contains the top 5 voice news tweets of the Month by engagement links to the latest episodes as well as other interesting bits and pieces such as exclusive newsletter offers it’s only one email a month so you’ll always be glad to say arrived in your inbox so to get it just go to voicetechpodcast.com / newsletter unsubscribe now well that leaves us nicely on Amazon to another technical details of rogervoice is built so could you like describe that you know the technical Stark and light how are all the pieces fit together so we use if I’d architecture is based on microservices so microservices is so basically you split or your architecture instead of

ring one monolithic block you have different parts lightstrip different apis basically and so you split your you change your system in different blocks based on their functionality so why don’t have a block with speech recognition have a block which deals with the audio you have a block which dealt with this speech have a Blog which deals with the emotion recognition so there’s really wanna one big API for all these modules modules might contain different apis associate everything into the box the whole of the whole back and is in no doubt as we use Wii U is aware that EC4R communication4all web RTC that’s what what does that do basically deals with their communication protocol allows you to

media and media streams between devices and and set us ok so it’s sending the other The Voice Search voice audio over the over the internet basically ok so we have a part which is the IPL voice over the internet and we have a pot also which is the link to the public telephone network so the pstn Macaulay pstn stands for public switched telephone network the network everyday when you use your mobile phone so we have a between the two so between the device and when she’s in VIP and you like your doctor you trying to be so you can put cheese on the PSP use a screech owl Ivory Coast free switch which is open source and basically out all the all the incoming and outgoing calls between the two we are the two legs basically

switch does the writing between there between the recorder on the kodi from one Roger voice app to another free using voice over IP but if you want to call from a rogervoice app to a standard telephone then obviously that incurred charges because you have to pay for the use of the telephone line is as everybody does we are we offer 1 1 hour free when you subscribe to the to the other one hour and public telephone access only one and then if you want to have your own rogervoice number in order to reach down if you don’t even speak when you’re on your own now because obviously you have to pay for the for the PS3 and for the for the infrastructure centrepin that you can purchase a Sky pin number which is like a

Standard telephone number for your country anymore can call it from a normal telephone but it will go through to your app your rogervoice app access and so I said I loved the Architecture then what do you use for the actual speech recognition so should we use an API API all the name because I can’t give a name over you to Commercial API development environment we will use a homemade and speech recognition systems so before we should we should have been working with python and tensorflow yy tensorflow you might ask because I think it’s the most production-ready a deep learning framework available today

used by Google so it’s been tested and retested and really tested so I think it if you want if you want to put something into production tensorflow I think your best move I think I told she is becoming more and more popular these days I definitely more flexible than than tensorflow and I think it may be easier to learn on southern tensorflow but I’ve been using I’ve been using tensorflow because also as I was using something in my eye with them which was only available in tensorflow the time which is the connectionist temporal classification ok we can talk about that in a moment they are so just to clarify then you said you’re using a commercial speech API but you’re also developing a custom in house automatic recognition system what Wi-Fi using both and are they both in production right now

not using it right now because it hasn’t reached or maybe things I mean the same put someone’s level as the commercial a game use ok so we’re trying to do to improve her maybe we don’t know if I’ll be able to reach the same performance as the Commission if you have at least we want to be some quite honourable performance I say so we still working on it and why use a commercial API product right now just because it’s the off-the-shelf solution gives good results and so when you have when you want you to try your product ready I mean sometimes you can’t afford to lose mum can dance online research and you just have to get going to get a product as soon as possible using the commercial 120 system up and running immediately

but it is interest because I spoke to a number of people who are developing their own custom in house automatic speech recognition systems because they focus on a restricted domain where that they are able to get better results than the the commercial commercially available apis simply because they are dealing with her you know how much smaller vocabulary or they use an audio which has a particular characteristic to it like it’s poor quality to find a different sense so that I can I supposed in the rogervoice case you are transcribing and synthesizing any words because it’s so is a casual conversation between you know any two people I will definitely could be talking about anything so I guess there’s the problem that you’ve got that you’ve got the general problem that that these commercial apis are looking to chat and looking to tackle as they make sense that you guys are using that and yes it’s a bit

the challenge isn’t it from to develop your own system that matches that in her in a general in a generalized context looking beautiful. You can have different tasks come from the simplest 12 the most complicated won the simplest one is basically digit so spoken digits 1 2 3 4 5 and recognising isolated words and percentage is one of the simplest so it’s a limited vocabulary often it’s quite constrained so it’s one of the easiest things you can you can do it speech recognition if you want to start there and designing your own my speech recognition system just tried to do this before I maybe use a really small small datasets simple words and and you’ll see you already if you’re on the right track on all the most complicated is the most complicated situations as you said

spontaneous speech conversation conversation with each today if you looking for today at the data used by Google data used by Facebook not really even Microsoft my Amazon for the speech recognition system which is collected through the you know the OK Google I hate Siri your address so that is the reason that they’ve been formatted all things really isn’t it so that data is it when you when your TV when you come on through speech speech device you basically you’re really have your sentence made up and you know what you’re going to say and you say in a way that it’s not really like you’re making you a sentence as you speak I was not natural spontaneous speech

pre-planning natural transaction of making responding to requests available box Ford Figo librispeech been made by people reading Reading sentences reading books or reading Reading sentences so once again read text and speech online I’m not really the same because the information is on the same day the hesitation is in the voice on not the same way to win in a conversational speech she has stopped you have non words you have that leads to the question then so what where do you get your data

for training or your models for for Roger vice-captain if you do with on YouTube thing for example of a good source of that video is he said there’s a database as you said the video collection of that talks with his son English the problem was a way of dealing with with with with our most of arms of our users are in France and most of the resources available on the internet for speech recognition are in English course and then we can pay for data set in French but they’re not really as I said they are mostly people really miss being up sentences and so they’re not really what we want so I think it’s really interest

how much is 34 is that if you want to design your own your speech recognition system are you going to have in mind or so why do you want to use this picture speech recognition system what’s the estate and if she was going to go this morning if you want to do it people speaking on the telephone definitely need to have the data corresponding to people speaking in a natural way you use data people meeting of people dress same numbers or isolated words the model will not generalize well on long sentences how do you think the community can can get together and develop new datasets because I hear it again and again now that’s one of the biggest roadblocks to producing a high accuracy systems it’s just a lack of quality data or appropriate data to my mind is two options I think first option is you you in a supervised learning framework so you need to have

more data you have the more I create your algorithm will be and you know that have more data than you can just try to work together data so when I said by being creative or by thinking of her I said you were on your way that you can do this or that they said when scraping is there is an option you can think of so maybe of transfer learning maybe so you start with a model which has been training English and you just reclaim the lower layers in order to have a job which works with your target language and second option my thing is that if you can find all the data then you need to maybe you need to find another model which is less data consuming or switch to and supervised learning and I know it’s something a couple of people have tried and supervised learning for a speech speech recognition it’s quite difficult I think because

a word might be spelt in a unique way but people have thousands and thousands of different ways of saying this word meaning of a person depending on the position of the word inside of the sentences you have quarter collations phenomenons when you speak which makes it which introduces a lot of variability in like today in the speech and so I think I’m supervised learning might be quite tricky because it’s hard to find invariant features in speech in order to recognise my van and it sounds like you would need a would need a huge data set for that as well I’ll be texting you just leave it systems in figure out that takes her by itself but it’s true that today we’re a bit luck with the supervisor learning and I mean stuff because

obviously if you don’t have the data you need to be created in to find other ways so you can you can try to add you can try to learn more apartment is your model but you should have more power metres and your new not so much data all you can try to add language models in order to to correct maybe the outputs of your Cristiano so you really have to be creative and I think this may be one then one of the biggest and I mean what will the one of biggest issues that im running in the in the years to come is that companies like Google and Microsoft and Amazon people working for them but you’re so so that they’re the best because they had them more than they have the data and the day the day we find the ways to achieve similar results with less data I think it will be a major breakthrough in the planning in the age where the internet is now considered fundamental human right but I think we’ll move

a point where data will be considered a fundamental human right just because it’s so crucial to everything that we do what you have learned in this field is just to give you ready to so kind of her maybe if they don’t know about if you have a cold or one-shot-learning one shot learning and one-shot-learning OK Google and it explores a new way of dealing with your networks introduces immense not introducing because it’s always been there before but it really understand how can Among Us examples possible and what’s quite nice in this paper is that they are performing vacuuming performances I won’t describe the experiment because it’s a big Tom

the networks to design their own memory system so that they don’t need to see thousands of times the same picture to know it’s a dog’s ear really like you do know we can sit out and be able to renew examples of something that we’d only seen once or twice ok we’ll talk about models so I wanted to ask you what models that you’re currently working on at rogervoice or what was it you tried and discarded I am why you chose them so we are using this heavily inspired by deepspeech model model OK deepspeech how long so it’s not an end with LSD you can decide to use convolutions or not

in the lower layers and process the audio but doesn’t makes a big difference to me the most important is the way is the way you which uni or are you a lesbian Bar Soho these are used to this CTC lasso the connectionist temporal classification which is has become quite standard today is one of the most widely used in there in the papers I’ve read so what does it what does it do at what point in the in the network is it used in white wine why is it the preferred Choice holidays this EPC for each flaming pie so you have your audio spectrogram then and for each slice of respectable you dream you’re trying to keep a probability distribution of the characters out of it and so you do this for each frame and so you will end up with the so you have can choose to have the path

the single and with the works on a letter of the alphabet level so it slices the audio up until it’s like you’re having to 10 milliseconds slices analyse the frequencies in each slice and then you calculate and then I think of all the frames you end up with the expertise and then with his you just calculator the probably busy the past the past we had several path leading to the target Centre and so you you this kind of a trick if you want to if you want to look at you you you see the property overall probability of getting your target with all the probabilities of Venice act and slices of the name of your spectrogram I don’t know if it’s really clear

I think you need you you go from you don’t have the acoustic model lender linguistic model and implementation model you had it in their gmm hmm systems which state of the art and FILA years ago you just do one thing in the same as in the same blog you need to go from the audio and you had your character there’s no sensation dictionary in between there’s no fun games it’s just straight over geo mapping to characterful so much more simple simple model does it require more or less data to do that as it requires a in the Bay City that the general rule in fact in there and give any music that you should try to have more data than the number of damages in your model

parameters it’s a primitive that the weight that way training but how do you how do you count the amount of data like in you need more examples than than than than a number of ways that you have is that a sign if your network has a single layer with 200 ways from Paul you need to have at least you should try and kiss you have to 100 observations so if you have a model with 8 million a million parameters then you need to take me an example example or is metal everything you see on the image and computer Vision and which require millions and millions of family you don’t I just had tens of millions of images to train your network so I’m a bit sceptical

CTC this really is really something if you don’t know where to start I suggest you start with CTC so pick a framework which has CTC included because not all play my CDs included Enterprise ACDC last included but pytorch typing doesn’t have Whiteleys you need to clone some libraries on github so it’s a bit complicated but you can still do it so what kind of preprocessing of the data do you do if it if anything how do you prepare the data to put into these these models really standard processing relies on the form and Mail skills programs so lots of reasons for using basically just let it be eating to understand is that you slice your audio into chunks where the Trunk is believed to be stationary and so you can calculate frequencies in room

is John cancer usually means that the frequency is within that that chunk on a constant as stable they don’t they don’t change that have to be quite a small drunk I’ve already exactly you cannot overlap when you go through your audio in them to you if your favourite information with using it then milliseconds you stack slices of stimuli second and third Avenue in the y-axis you having a frequency on the x axis you have the time and so it is give you a heat map of the audio signal so that you can leave you wanting to the nfcc both studies have shown that mscc from your networks are not really necessary because the role of NSPCC correlate the signal of the spectrogram coronated the features a lot of problems when your network mean they can deal with, nature of the teachers must program

it’s only Tuesday and correlated features they can just put all the data in as long as it’s so as long as the data have the information of the network needs it will find it necessary you have anything to do with the way maybe to enhance today the future and the future propose to sing to me with the key relies on the on the slicing I don’t think the brain does irregular 10 min seconds there a Singer when did he is a when he gives a speech thank you for your purpose in you should look at things like a variable frame rate for this program so different.

sizes of the window on which you calculate the frequency for example there’s also interesting part of you going back to the psychology this item is also working papers in the last years saying that in feet and the brain waves will locked together so I don’t know if you brain waves on waves mpq neurons I basically like her and then it goes away after beta waves after waves a multitude of neurons offering together they say they represent of cerebral activity on different skills so you have for example a scale from 123 Hz for example for 2:10 hertz etc etc so you have different different activities on

manage who’s going into your brain when you sleep when you speak when you do nothing until your brain is always active and the studies have shown that the speech in fact is the brainwave synchronises with the speech and this is these brain waves of thought to allow the brain to distinguish between syllables phonemes and maybe you constantly and so so so we did David the brain waves actually radio signal on different time scales I say to isolate words syllables for name is interesting and I think it’s never been tested before so want to go ahead and try me B air filter in your signal days on the brain waves behave ago when I was looking to do a PhD the next three years sorted for you ok so that is that is really interesting so when you’re listening to somebody

took your brain waves are actually becoming in sync with the words are there using this is probably explains why some people are a lot easier to listen to than others because they literally having a brand new show Netflix apple people suffering from dyslexia it’s because they brainwave and pulled can synchronise with this speech so they say I’m still sick of tea and a standing what you say and have difficulty writing it appears and there’s lots of interesting stuff for going on with the brain when it’s now widely psychology neuroscience and it informs your data science works the weather going to think of these things in a holistic laying there what is on the horizon 4 for you and for rogervoice over the next six to 12 months well versed festival he needs to improve our speech recognition engine obviously going to have a nice piece of Google

Zoe on today even if we wish maybe you can be lower the main thing I like to come up with Elsa and model which is cheaper in terms of computational and in terms of them of data necessary to train day so this model I am really as I said I really think mum’s today our way to way too complicated and maybe I’m wrong maybe I’m not going to be that they need to be this complicated but I really like to know something we should train more like the brain trains like when you learn a language you don’t trust learn 2000 hours and 10 you start buying basic sentences and you stop adding more words to your back vocabulary and then you build up your own voice is like a tree it’s like a tree growing impact it’s not a problem I think with me on network sometimes you have your number power metres is fixed it’s not dynamic it’s fixed so you

dating the the weight with gradient descent is also work maybe the backpropagation is not the is not the best way of training a neural networks are found in which is the following application even said he wasn’t really sure that the propagation was the right way the most efficient way to train the network and he said maybe that’s not how the brain learns but I really like to see me be your networks growing like a tree euro or a movie lighting and also be genetic algorithms which is something that has not been fully explored up today by genetic algorithms I am you and you are you trying I thinking ing planning and it might be worth looking at it maybe something more about the architecture

the Deep learning model evolving to Feather the problem to find a solution to the problem as opposed to the training of the model and the two of the two things can work in conjunction is that right you can do both things at the same place at least I think I mean it’s just having the biggest the biggest servers and better than more than more GPUs you had the most powerful GPUs and computers deep learning is also about trying to find Smart ways to achieve tasks that the brain the windows today and we’re trying to replicate and on computers but it’s not a question of brute force and Matt cook additional power I think we need to be smart and sometimes and finding ways of doing the same with less resources research as prospective is 2

the right way to do things as opposed to an X industry perspective which is just you know get results anyway you can as quickly as possible so if it’s it’s often cheaper to just buy a bigger computer than it is to come up with a better algorithm what is true that an NVIDIA for example I think might be a lady and Lydia might be very happy today be given the number of GPUs day they are saying to me those who that they have been wormed but I think you definitely want to keep her burning is it in this Direction but I think you married if you want really due to due to be today at artificial intelligence to me is not very intelligent are sometimes it’s not really intelligent because you just adding more layers and more Power Rangers and more computational power in order to gain a couple of dissent and I think that’s not that’s not intelligence of course I just plain pattern recognition in very very restricted domains are every model is very specialised in a particular type of data yeah it’s a long way of being

trainee intelligence will get there one day ok hope where can people find more about you online so we have www.rolls-royce.com I can download the app available languages so you can download Siri for Android and iPhone or Android and iPhone and we are on return we are on LinkedIn we are on Facebook are you can look for a rogervoice does the voice-over network OK great I would definitely have to check it out

you just heard from Benjamin Etienne a data scientist at rogervoice a mobile app that allows deaf and hard of hearing people to use the telephone I was inspired by the band story in a couple of ways possible as always great to hear about how new technology changes the lives of people who genuinely need help having been a product manager for a mobile health that myself I know how moving it can be to hear from real users about how the product that you’ve helped build is is improving their lives on a daily basis and is Ben says it’s much more motivating to work on this type of product than it is to just build product design to sell stuff it was also inspiring to hear how Ben deliberately guided his career from a non technical semi technical project management role and to add route truly technical role through a combination of dedicated self-learning and through careful choice of job roles they think it shows that wherever you are in life that you have the

how to change it for the better if you have the passion and the drive to make it happen I love finding out about new technologies such as these so if you have a product that has a great story behind it please get in touch and we can feature on the podcast that’s all for today and I hope you enjoy the episode as always you can find the show notes with links to resources mentioned in this episode at boys tech podcast. Comm and you can also follow me on Twitter at voicetechpodcast.com so please subscribe to the voice tech roll up a monthly newsletter containing links to the latest episodes top tweets and exclusive newsletter offers to head over to voicetechpodcast.com / newsletter and subscribe now I’ll be back soon with another episode until then I’ve been your host Carl Robinson thank you for listening to the voice tech podcast

Subscribe to get future episodes:

Join the discussion:

Support the Voice Tech Podcast:

Share this article

What do you think?

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Posts

Artificial Emotional Intelligence
Florian Eyben Audeering
Joshua Montgomery Mycroft

Get notified about new articles