Home » Podcast Episodes »

Speech to Text – Eric Bolo, Batvoice – Voice Tech Podcast ep.001

Eric Bolo Batvoice

Episode description

Eric Bolo is the CTO of Batvoice Technologies, a speech analytics startup based in Paris, France. Eric talks about building a custom speech-to-text system for their flagship product, Call Watch.

He introduces us to speech analytics and audio-mining, and describes some typical applications. We go into detail about speech-to-text (STT) technologies, and discuss the pros and cons of using cloud STT services such as Google speech versus building a custom STT system yourself.

Eric tells us about the latest open source tools and frameworks for building STT systems, and how to get that precious voice data to train our models. We learn how to build and annotate a custom voice dataset ourselves, and hear his advice on starting a voice first company.

This is a great first episode to kick off the series! Eric is super smart, with excellent technical skills and a real passion for voice technology. We already know each other quite well, so I couldn’t think of anyone I’d rather have as my first guest on the show. I know you’re gonna enjoy hearing what he had to say!

Links from the show

Episode transcript

Powered by Google Cloud Speech-to-Text

the growing availability of data really has helped speech recognition reach a new level

welcome to the voice Tech podcast my name is Carl Robinson and I’ll be a host to this brand new podcast series about voice technology thank you for joining us full episode 1 unlike some other technology shows hey would be focused on the technology at South will talk as much about how it works as the what it does the series will include interviews with people who actually implemented voice Technologies in a project such as academic research his CTS engine is in software developers and also voice interface design is that manages human-computer interaction experts and many more

we will delve into all aspects of voice in faces and naiah enabling Technologies such as natural language processing and natural language understanding voice synthesis and conversion machine learning and AI audio engineering and Signal processing as well as applications of these techniques such as chat box and conversational interfaces human computer interaction social robotics anything related fields such as the psychology of conversation and language and emotion

my aim is to provide you with a good overview of the voice Tech ecosystem how existing techniques of being used to build voice technology right now as well as to introduce you to some new techniques that currently being researched and my hope is that these conversations will inspire you and give you ideas when you voice applications that you can build and also to introduce you to some of the tools and techniques that you’ll need to actually build them

the voice Tech podcast is for anyone interested in the coming voice Revolution where the old currently working with voice technology or not of course it will be of particular interest to software developers code is America’s startup Founders eye technology enthusiastic students and lifelong learners

so a little bit about me on the machine learning engineer in training at specializing in voice technology I’m currently working on voice emotion conversion at a research about Tree in Paris France I’m just as excited as you are about the voice ghost movement that’s building up around us I think voice is an extremely important technology that’s going to have far-reaching implications for not only how we access services in Little Lives but also how we communicate with each other and behave toward each other and ultimately what it means to be human

an out on with the show my very first guest is Eric bolo CTO of batvoice Technologies who lives in what’s right here in Paris Eric tells us about how he is built a custom speech-to-text system as part of the speech analytics product off of by batvoice it’s a really great episode I’m so pleased how it turned out considering it’s the first one I’ve done at least in part because Eric and I already know each other quite well and I’ll ask you what for him as an intern at Voice last year I have it’s a great guy he super smart and stripes and technical skills and he’s really fun and inspiring got to work with so I can think of anyone I’d rather have is my first guest on the show I don’t think you’re going to enjoy having what you have to say to

so weird that I give you Eric bolo

today would be talking about speech to text or TI Mining and speech analytics in the context of a business intelligence product I will be discussing how to develop your own speech-to-text system versus using a cloud-based service I will be discussing the issues involved in building your own speech-to-text database versus using an open-source dates for instance and I will be covering some of the open source tools available to Goldie systems these are important topics right now in the voice technology ecosystem because many many companies but you can see me focused on business focused on building building products using this combination of listening to voice converting to text and then deciphering its content to the other option of reporting can take I’d like to introduce today’s gas to going to help us understand all of this it’s Eric Bolo

co-founder and CTO of batvoice Technologies welcome Eric and ODI Mining and tell us a bit more about voice Technologies were trying to sound is involved in conversation intelligence for customer relations companies that have customer support call centers weather internalized go externalized with mystery understanding the experience of their customers over the phone we built a product called watch that helps companies spot pain points problems opportunities in the calls using a variety of techniques Bradley known as speech analytics and that of course includes speech to text but it’s speech-to-text is a subset of speech analytics it turns out I can do many other things perhaps we’ll touch on those things

yep okay I’m sick of you tell us about voice does and the value proposition and why your base for the teams looking like it at the moment so there’s five of us including a professor from the University Katamari create who specialize than human computer interaction also social signal processing and currently most of the members are developers or data scientist and of course is my business partner and CEO Maxi so it’s a bad voice is name of the company in our products are call watch and what it will do is it will take the thousands of calls that a company receives and handles every week transcribe those analyze those and then turn it into actionable Data such as you know what percentage problems were about this or that part of the

customer experience and that leads to actions that the companies can take in order to streamline The Experience potentially shorten the calls and overall have a better customer relations so you tell them what’s going on in the phone calls that their employees are making

or that they received from prospective or current clients and as you can imagine that’s a lot of data those calls her or usually recorded and in most companies there is some quality control made by humans but because of the time constraints of having to listen to you know imagine 10,000 hours for single week I mean it’s straight out in possible for a human being or even a team of human being to really get a sense of the overall content of distribution of topics of distribution of problems and the advantage of obviously speech-to-text technology is to be able to handle this massive amount of data and summarize it and potentially also bring up data that will lead to improvements of one sort or another

really so typically just just to give you an idea of the type of client we’re currently working with a major tourism company in France makes more than a billion euros and revenue every year and it handles

if I’m not mistaken in French alone about 10,000 calls every week every week every week that’s call centers that are spread out across Europe very difficult from a purely human perspective in without the aid of a speech to text and speech analytics to get a sense of the big topics and what the company should focus on and that’s that’s where we can stay up until this point the managers have been locked in the dark about what’s going on across their organizations from a collodion point of view and there’s there’s feedback that they received from the customer support agents what platform is on what town is it uses and how the uses and tractor that

so yeah call watch is our main product at the moment and what it will do is is handle massive amounts of data so just to give you a very simple we have recording those are made available by whatever Recording Technology or client uses and made accessible to us either via rest API that we’ve built or an FTP they give us an FTP and we go fetch all those recordings and then we use attack of stacks right of speech speech analytics technology most of which we’ve developed internally

batvoice is at its core at R&D company and


2 to the extent that it’s possible and that it can automatically makes sense we try to develop on technology so that we can have better control over its improvements overtime so we taking all those calls will use this type of technology so basically things like very simple things like voice activity detection but also separation of speakers if we have mono recordings speech-to-text of course then some natural language understanding to for example understand what the subjects were and also to pinpoint problems and then is another side which looks at more the nonverbal aspects of speech things like speaking turn things like silence things like intonation bacon clue the model into what is actually happening in terms of the Dynamics and the experience what the emotions are that’s her thing

and all of that is made visible in a easy to use interface and then the client company can decide who has access to which parts of the information so you know a bit a large group may have several Brands and it may be of interest to the IT department or the marketing department are the customer relations department are the complaints department to all have some access to this information but I should know that we’re also dealing with sensitive data

sometimes you have credit card numbers we have telephones we have addresses and the goal of our product is not to track each individual clients and Traction in fact we do quite the opposite we anonymize every single call at the very beginning of the treatment pipeline

and so the really the purpose of college is not to track you it’s not to figure out what you said over the phone so that we can then send you a a marketing email based on the contents of of the car

rather we try to understand problems the kinds of problems that you as a customer or many other customers might encounter

and what the company can do to you know either solve problems more effectively request for information is in the like or make the experience a more pleasant one more welcoming one sec I could you describe a typical day in your life and be as specific as possible if you can or is that such a thing and it starts up at the typical day

well the Royal really cheap you know startup is an impermanent ever-changing structure and the moment that is sort of thing he buys it has no longer start up its company in my view it as the results because with every phase of development the roller city of changes at the beginning it was a it was mostly I would say 70% it was you no hardcore research and development actually building or internal libraries for paralinguistic nonverbal analysis as well as for speech recognition and now are you and maybe 30% pitching going to Van’s meeting prospects

then as we started to get into more serious proof-of-concept project or no 6 months. And the right it became a lot of work on the product itself and so I brought to bear some of the experiences I’ve had before building websites and and that was very different more product oriented face

and I think as basic as a company grows which it’s starting to do

the important thing is to delegate and then you know I could focus to I could focus on the on infrastructure topic infrastructure has become a because as Retreat more data infrastructure needs increase that’s sort of a whole new topic that’s emerged

see your voice sounds very very that sounds like at the beginning you are doing a lot of reading and experimentation in the code and then it was mold product-oriented to coding we watch the building the station of the product and then it moved on to the wider issues if like maintaining up for that and actually delivering the Sabbath to you too fast customers is that right although the R&D part that the first part continues through it’s it’s really part of our DNA like we want to I mean the reason why I personally I got into this project is because I’m absolutely fascinated by the technology invites potential and we want to help improve it in our Niche and so we are we’re constantly running experiments but a lot of this is less hands on you know we we have people in our team who specialize in that but I guess as the company will grow I will have to download

one more okay let’s discuss some the field of speech analytics what is speech on the left text son or your Mining and could you give us some examples of typical applications for those Technologies what kinds of companies are using them and for what purpose

speech analytics involves really a wide array of Technologies but they all have a common purpose and that is to understand to parse to help machines cars human speech

in a way that is done intelligible like human being so I can so I can process and then rested to that information

so you say pausing means extracting the words from the speech and then would have you talked to understanding the information contained in those words

yeah although the words are part of the picture only because you also have all sorts of information that go beyond the words such as the tone of voice or the emotion

or or the social interaction just imagine that you walk past to two people in the street who didn’t speak your language or whose language you didn’t speak but you saw the waving their hands and frowning and sort of shoulders hunched you would know that this conversation wasn’t going particularly well and so likewise a machine can understand things about speech without actually understand the words and such a part of it on verbal communication happens in the speech as well as in the body through jasta

yes absolutely absolutely took speech and lyrics it is is is it selfish of the field of social signal in traction processing so that is to feel that handles trying to parse interactions between people and also human-computer interaction when so that’s you know when you’re using your Siri or Alexa this is the type of speech analytics were talking about

and some of the typical applications then all of speech analytics what companies are currently using this Sunday for what type of says he might think of is Alexa or this type of Goku by or Siri you ask a question the computer what actually happens if the computer parts is that using speech recognition turns it into words than captures the intent of your question tokenize is some of the elements and then brings up an answer but Consulting some database or follow the script then you also have some applications in health speech analytics can help with diagnosing cognitive and physical disorders such as Parkinson’s those are some very interesting but also very edgy Topix

and then of course there’s contact center applications that such as the ones were building and that basically involves taking in the Carl’s processing them turning them into actionable information for a client’s business intelligence through medical diagnosis and Champa. Traction and and human and augmented intelligence it really has far-reaching applications and the interesting it’s a very interesting field to be working at the moment because the Technologies are really booming the abilities of machines to par speech have really improved

yeah I agree with that and what excites me is that it’s a the he’s in which we can integrate microphones until sorts of social situations compared to the video for instance it says I feel it’s much more flexible as an input that leads to you know how a greater range of applications

I mean I guess yeah I mean you could argue that but you could also argue that you know visual medium in a noisy contacts might also be something interesting if it’s really hard to predict but there are clear cases where voice is the more in first of all we don’t need to learn anything you know you know how to speak that has the machine that have to understand us rather than I knew I needed that directly as a developer who works with a laptop and a desktop and it still uses keyboards for a lot of people

barely interact with keyboards nowadays and it’s mostly touchpads where typing is a bit sore so voice becomes easier but in any case I do think it’s a very interesting medium in this going to be a lot of interesting applications coming out of it

and that’s also what we’re doing it with callwatch basically the idea here is understand phone conversations dialogues about relatively restricted sense of topics that have to do with customer relations and how can speech analytics help with this type of interaction both offline and that means you know analyzing the calls extracting the relevant information but also although we’re not quite there yet having some sort of real-time assistant during the call that’s going to have the agent

with Solvang whatever problem the client has or question in the online offline distinction ready to time is totality this year is this the right different challenges the online challenge I would say is an infrastructure is is one of our infrastructure namely you need very powerful infrastructure very good internet connection and you need that infrastructure to be identical across all your call center and those call centers may be spread across the world

the key to all of these applications is the speech-to-text technology where is speech-to-text come from is that as a new technology or is it been around for a long time and I walked in the developments in the field

they have been quite amazing actually you have you have you know this is very recent speech-to-text systems that on some tasks

have lower are AIDS and human being but that was a long road for us to come here the technologies that are there used to develop current state of the Arts

speech to text systems in some form and some basic form or the neural networks that have been around since the 1960s and Fields of AI it’s the available to Growing availability and diversity of data that really allowed for this explosion of Technology with speech-to-text relied on by Asian models of the hmm the hidden Markov model type and that was

worked to some degree but nowadays recurrent neural networks and more specifically bi-directional lstm networks have really shown their ability to to solve complicated challenging speech-to-text problems example by do released open-sourced its its speech technology model that ran on $10,000 and really produce spectacular results using recurring neural network type of architecture

so the combination of these architectures with the growing availability of data really has helped speech recognition reached a new level but in spite of the the results that you hear that you know the some of the models surpassed humans those are relatively narrow tasks and we’re pretty far from having we’re still pretty far from having a human level speech recognition definitely I think you you know this you experimented this if you’ve been using any of the mainstream tools today

it’s greatly improved from my West two years ago but there’s still a long ways to go

RNN lstm that was sounds quite complex and and cutting-edge technology that really you have to be a machine-learning engineer to actually put into practice is not the case or other in off the shelf systems that anybody can pick up an and stop. Getting results with you can use neural networks relatively easily without much T’s however the more expertise the more knowledge you have of how they actually work the more you have a chance to be able to debug window problem happens in inevitably a problem problem what happened with your training or some some strange results and in that case actually being able to dive into the architecture is help but you can get acquainted with these Technologies and Bill neural networks very very quickly nowadays if you have even rudimentary coding skills I using libraries like Caritas

tensorflow and you have many many examples if your if your

tasked with something that’s that you’re probably has been done before or many other companies have faced a challenge then you will find many templates out there my first question is why not just use one of the online cloud services I know that Google Microsoft IBM with that Watson product and does a is a range to choose from we can just send the send out the speech data and will dial send us back the the text so you mean why why as a company did we choose to spend time building speech recognition when clearly you know other companies have have this type of Technology if somebody was thinking of building a product that uses speech-to-text then why wouldn’t they just use one of these online cloud services who often I’ll provide the services for free to listen to something limit

actually that limit your head that limit very very very very quickly if you’re dealing with the amount of data that we deal with even with a single client so I would say I don’t remember what the what the prices but I I know it gets very very pricey for a company like that specializes in doing that also just consider that using something like Google speech on the cloud would require us to Route all of our client data through to Google and our clients is that makes that makes it very difficult for dealing with sensitive data now you’re talking about something that’s on the cloud something that has latency that depends on the quality of your connection and that introduces and another variable for maintaining your system

as far as can you compete with those I think you you can in in certain ways I think that if you specialize on a particular topic then you can reach performance in terms of Ward erase that may actually that are likely to be lower than those generally speech-to-text because when you think about it Google speech and many other sort of open not open but apis for speech they were designed to solve a general problem I were building speech to text for very specific problems and so we want to be able to text but also said that’s it to the contact because one of the models are connected in different ways but in our case there’s a linguistic part and there’s an acoustic part and both of these demands may change as we switch from one client to another vocabulary may not be the same as you go from tourism to insurance

but also if if enter them people call from their home breast versus entrance I don’t know it support after car accident and the people are calling next to highway so then you have an entirely new challenge of handling with noise it’s a maybe not recommended this is a general thing you know by all means if you feel like you shouldn’t

Bill Jones Beach your text don’t it is a considerable investment in terms of training in terms of time and then just in terms of X cubed in a gathering the data but in our case it makes sense for training training your speech to text on client-specific data using bad words and the same acoustic properties that that dated then you got high accuracy results then if you just went to a a generalized cloud-based API speech-to-text system

okay then the next question that is we decided to build around speech-to-text system so to keep the cost down to maintain privacy of the lowest latency results on the best of the best quality results how do we do that do we have to build your own tools from the ground up for the open source tools and frame lights available for us to use command you know not Reinventing the wheel but nevertheless understanding the wheel that’s how I put it because if you don’t so there is a lot of research and development work in just trying to understand how those models work what is the state-of-the-art today so you know I would eat papers for breakfast basically 4 months until we were you know we win we write ran some experiments but there are a lot of tools today available if you have the data and I would I would say you know call Dee

a a l d k a l d i n open-source platform that helps with building a pipeline for speech recognition that’s that you know yeah but it’s not you know it’s not like Google speech is not like using an API it does require some effort in many cases though I’m sure if you look up on GitHub you may find some examples

again if you’re really serious about this you want to be able to

really finally understand how your model works for example no in our case

we don’t have a model like by do that trains on 10,000 hours because there’s no way we’re going to get 10000 hours for our size of transcribe perfectly manually transcribes audio given that there’s a 1 to 6 ratio in terms of audio time and a patient I’m so for 60 hours you need 360 hours of annotation so in this context we have a system that has an acoustic part in the linguistic part and we want to be able to fine-tune the linguistic model so that if there’s another new expression that comes up

in a call and that expression involves perhaps the brand and it’s really important for you to capture the expression properly you want to be able to go into the language model and change that so the difficulty is in stacking all those different parts the acoustic model the length of the linguistic model the phonetic model that tells you how words are pronounced essentially that my phone into work could you define each of those for us and what were those times mean

so I need to create an acoustic model will take the signal the the the audio signal may be transformed in some way and pre-processing somewhere and then interpret that as a sequence of phonemes

funny so sounds essentially converts sound file into a sequence of sequence of phoneme labels

yep okay and then the linguistic model will take sequences of phonemes and interpret what the most likely sentence corresponds to this money

and that uses a phonetic that has a phonetic dictionary with basically for each word that tells you how it’s pronounced or maybe several possible pronunciations of that work and then you have the statistical model that tells you essentially the frequency of a sequence of two particular words or three particular word for that language and in that the main to be very very Concrete in that model it would say that hello how are you is a very is likely to be a very common expression versus apple I want to eat you know she starts speaking like Yoda then language and it doesn’t

think that that is a high-frequency it rightly does not a sign of high frequency to that expression when you stacked all those two of those of those little pieces acoustic model of the phonetic modeling language model you get a way to you don’t start from raw audio to a sequence of Words which is the angle between their language and linguistic model

okay yeah there is no difference I use them interchangeably the linguistic model and language model but basically it’s a picture of the frequency of single words so for example the word I will be very frequent and so there’s a way to encode that frequency and then 3 frequencies of expressions of sequences of words like I want

I want is going to be a high-frequency sequence of words because many people say I want in my contacts and that’s where the big data comes in because you need to you need a large amount of data to establish that the correct statistics the frequencies for each of these phrases you need a whole lot of data tomorrow language in general I would say you know we weave weave reviews models that had several million words from Wikipedia in French for example

and we built a basic generic language model out of that

then every time we have a new client with. That and perhaps we’ll use other corpuses corporate sorry for example transcripts emails or transcripts calls and that will get us give us a sense of how people speak in this particular contact from Publix and you are using client-specific only use conspecific I mean you could if you had a lot of it but to start we didn’t we didn’t have a lot of data at all and so you know which we will use whatever means was hands so we’ll use Wikipedia to get a very large a dataset can imagine the language Wikipedia is very different from the language of a for a contact call customer support call the style of speaking versus the style of writing

the perspective you know the first person perspective is virtually absence in Wikipedia whereas it’s it’s everywhere at ubiquitous and customer calls but that still gives you a sense of some words so there is some commonalities but then you adapt it and then you try to accumulate as much data as possible that’s my that’s a specific Target it’s your domain as possible on the subject of the three main options do you have when you you look into train train the Moto that you’re building with that the Frameworks we discussed before you need to find some for free or you buy it or you build the day to yourself you actually record or write the day to yourself I’m could you give us an idea of like that the pros and cons of each of those three

there’s three options it really depends on your goals there are very few free non-english databases in our first clients it turns out it was French we know where it where French company based in France and most of our clients and Prospects are have French speaking customers

of course not something else English lesson is done and I don’t often think about because Mustang results is online or in English it’s a different story in another language

exactly it’s a totally different store in another language so that means you know you need the information and given our model architecture we needed about 60 hours a very high-quality annotation then you could think okay what are some non free databases for purchase and you know should be curious about that I would recommend elro that is list some database we didn’t find any that were suitable for us

like I said these leads a daily basis of audio plus you said annotations but what does an annotation look like it’s just in a company tax file is that

I mean actually yes well it depends on your on your model actually and that’s that’s that’s been that actually is something I overlooked when we discuss the evolution of species packs used to be that you needed Word level timestamp sanitation that means for every word you need to know when it was spoken exactly when and that made The annotation process very long very long but now with Technologies like connectionist temporal classification or CTC that helps that helps with the alignment problem figuring out when the phone rings and when two words were said so sorry I’m going to go over all this but basically the bottom line is all you need is the beginning and end of an utterance

and what one counts as an utterance in our case we were we were using a phone data so we use voice activity detection to detect segments of uninterrupted speech that would have a lower and upper time limit and if I recall correctly we use something like from Aransas from 8 seconds to 25 seconds long

that’s what we call it an utterance in our complex you automatically drop your your sources and then you have annotators

right what corresponding text is sentences in text transcribed from these the segments of 8 to 24 seconds of audio

another time stamp for the beginning of the end of the electrons and that’s it. Is that an invitation

that’s it but you need 60 hours of that sounds pretty labor-intensive

it’s it is very label in Imlay labor-intensive and when your startup and you know you know when it when are we going to start your hammered with the principles of being lean and all that stuff but in our case it seem like you never to go that we would need to get our hands dirty and actually build around databases because the data is scarce and I think that’s that’s the most Salient feature of current the current you know level of a I basically the models with are the data is relatively scarce there is fast progress but it’s all coming from the big actors and those aren’t academics any more of those are the Googles in the bodies of this world

I want to say this is it feels like running that I’m the one hand the Technologies or Johnson quickly because of the huge amount of day to this now available but the startups the problem remains that that’s just not enough relevant data in there any appropriate format talk to you got to get good results so let me know what day to that’s that’s open and free to access in the language that you need in the demand that you need I still a challenge

right and when you

Conrad it makes total sense from his teaching strategic. Perspective mean the defense ability of an AI Enterprise is mostly I think today there are there some some of the defensibility comes from the technology and the research absolutely and there’s a lot of interesting things happening but I think most of the defensibility meaning you know you could you can establish a business model but then also sure to keep that business model

is rest on the data so what you what you’re doing is as virtual cycle where you go to clients you collect Niche level data and that and then you build upon that and then you’re the only one to have that’s for a day. So of course she has the best models and that’s sort of how it works but as far as advancing you know the progress but it does have some pernicious effects namely the new specific data is

a lot less available then then the models themselves and then you mentioned we could we can just buy these databases from company such as elder or do they not have the main specific devices in the crate language was it is cost is it just prohibitively expensive because the expense of building your own is quite High actually so if you find a database but we weren’t we weren’t able and I would say just some broad statement but it wasn’t very rich I think there’s there’s not that much information and they’re certainly isn’t like a dataset for every sector for example or a dataset for every condition I mean we’re very very very far from that

okay so there’s always going to be I need to either build your own speech database from scratch or or that one that you can get your hands on in some way I wouldn’t go as far as making that prediction because and ironically I sort of Hope for

the opening up of of data I think that’s that

maybe we can find a way to open to make that date at widely available to Bill datasets collectively and I think that’s sort of this is an important because otherwise the competitive advantage of these Giants is never going to be challenged because the data will always be on their side no no open sores that technology the entire Community will help improve but really only those that have data can benefit from the technology available in terms of free open source data

for French there’s very little open but I I wanted to mention the Mozilla common voice project which is trying to accumulate data

and Dave for English you do have more data that’s available you have Ted Leon that’s a famous dataset which is a dataset transcribed it instead of TED Talks you have box Forge which are audiobooks so what I was saying about the scarcity of the data is not as true for English although you know it if you’re working customer support cause neither TED Talks no audio books quite correspond to me

okay so I’m little bit about building your own custom speech database that does cars won’t have any soda main what are the what are the main components that they’re going to that the main consideration

well you need to be confident about your models first of all as so as far as as much as possible train your models on free day. Just to check that your models actually work and that you control the entire pipe weigh that is before collecting your first transcribing your first call interface and we chose to build their own because it was very difficult to find an annotation interface that suited our purposes and we wanted to have we wanted to be able to control it to to help

the control it’s Evolution to be able to improve as an annotation into vases is a way that you can listen to a sound file recorded from a recorded conversation and then manually transcribe the words that are in the audio phone is that right

that’s exactly right this is a transcription but you know adding the the timestamps and well as well for instance

you don’t have to because what what what the pre-processing step was to Chung the audio and during the chunking the the the beginning and end of each chunk was recorded words

great exactly but they need to be orthographically perfect and you know what sometimes the user may stumble because the word is an unfamiliar word to her so then you need to hire and again the one to 6 ratio what we’re trying to improve that but that’s currently what it is today 12612 sex sometimes the agitator really doesn’t know what the demand is about that makes difficulty the bad bad audio quality can increase the difficulty but I think we’re going to build on our experience and hopefully improve that is 12626 hours of annotation for every hour of audio and

very legible it is very labor-intensive their label very labor-intensive device is 360 oz of of of Labor even Force most out and I will take the whole team weeks right I know there’s no question that you don’t want your developers your data scientist to your business developer Zara CEO transcribing away what you want is this is a skilled task so you want to be able to find people can transcribe that for you and all that all the while respecting whatever causes you to have with your clients from whom you got the data so you need to make sure that you’re not moving your data and places that you shouldn’t they want Sevices are available to annotate data within those constraints respecting the Privacy requirements of your plan

well it depends on those constraints you can image of the people think of Mechanical Turk crowdflower but we chose actually personally hiring annotators who would come physically to our office and I and annotate that way we can directly communicate with them and we can get back because we consider that you know The annotation party annotation interface over building anyways part of our product it’s it’s a way of annotate you know it goes beyond speech-to-text we also use it for sentiment analysis and for many mlu natural language understanding past it is it is an Endeavor but the the results payoff ultimately the results do pay off because then you have to be should text that really is well suited I mean just to give you a simple example if your client has a brand and the name of that brand

and it was not known I never been encountered by your speech to fix model is no way your speech-to-text model will ever recognize it and the generalized Moto labor-intensive process gross less labor-intensive as you progress and we’re not there yet but once we have enough databases we can measure their generalizability and potentially will reach the point where we have decent generalizability and so already were transcribing less as every new kind because we already have a base speech-to-text model which we then adapt how much the results improved having Joey Ryan speech database as opposed to using some off the shelf that is

well that has very tremendously it went from a 10% word air raid Improvement all the way to 30% in some cases and basically if the a lot of that has seemed to depend on on the language model

search for if the language was really special or Niche than off-the-shelf speech to text and it to perform a lot less

well then our own speech to text which is trained on that particular linguistic contacts to my SIM the more the need for a custom database of cost

yeah and also the green or the greater the advantage the competitive advantage of building your own

alright I’m so fun a few questions on and what’s on the on the horizon for batvoice there any new projects that you’re working on that you can talk about and then like what would you eat be focusing on that for the next 6 to 12 months

well we have yes many projects underway with other clients and triggers them and insurance and banking in retail and so the goal right now is we rode tested call watch with two large clients up to the Tenaya and the girl is in September to have on-boarded 3 more additional clients in for first for the French language and that involves basically

strobe ossifying or infrastructure making the product more easily replicable more easily configurable from one client to the next so that


it it takes only less than it would take less than a week for the product to be installed in any client so today really is we’re trying to replicate model replicated product that we built with our first client that’s going to be you know really the top priority and then parallel to that we have some

R&D topics that were looking into improving or speech separation something we didn’t touch on but a lot of recordings in contact centers or mono so that’s one channel I need to separate the voices sometimes it’s challenging and so that’s that’s one big topic and then is also all sorts of other research and development projects that were doing on the side I mean parallel to this infrastructure effort and they have to do with the paralinguistic the non-verbal aspect so we’re improving our anger detector for example to pinpoint part of the conversation where you know the client really get that gets pissed off power Linguistics as well as scaling your current product cool watch across multiple clients

scaling is really the the top priority today scaling and making the product as flexible as a can without without over adapting to each client cuz the acquisition cost needs to remain low between being as specific as possible for one client thing as general as possible to apply to

and as it as a B2B startup you want to avoid the Trap of becoming a consulting firm

you want a product that that actually replicates I’m just I know this is completely obvious to most people in the field you’re playing anywhere I can you play oh yeah yeah I know there’s going to be plenty of people who are getting excited at the prospect of doing Bryson something very similar to what you guys are doing is developing Cutting Edge technology putting it into practice on putting it in place in a client and I’m building a startup around that advice could you give to someone who’s at the start of this journey that the Genevieve just been through routines or how that side of things that you found to be helpful I’m that you’d like to pass on

okay without you know trying to sound too much like a self-help counselor I would say just basically Sports doing a lot of sports in a meditation in your startup it feels like you I mean things things move quick and you have to be a highly reactive they have the very draining yeah you need to start my first Tango is is it exhausting exhausting so you really need to be able to take care of yourself in one way

these parts I meditate and I tried to keep times when I’m not thinking about the stuff although

I don’t always succeed as you said and done with and I need to be a little bit of insanity and excessiveness in any co-founder because otherwise why would you go through this harrowing ordeal but it is it is it is it is it is challenging and so

then I would say you know to me what’s really important is finding the right people to do this with

it is a it’s a it’s a it’s a challenge you going to get Roblox

breakfast can wait and see you need to you need to make sure that the people you’re going to do this with or people that you would enjoy spending your time with and that you

trust totally trucks and that takes time so it’s so the partnership like you’d do better billing that partnership can take months I can take a year I can take sometimes even more but it’s totally worth it once you realize that you have a partner to partner with whom you have complimentary skills and I would also say do something that’s that’s interesting that’s really interest at least some part of it is really interesting interesting to you and

that did Stephanie the kids for me I really sort of fell in love with the technology and its potential I have many many ideas for how we can grow what would possible uses that go far beyond the current application can you give us a sports technology going to give an example I’ve been playing with the idea of a note taker or sort of like assistance for conversation the conversation was that we would be there it’s basically a little like an Alexa like thing that looks and our conversations and then as we ask for information or the assistant comes up with suggestions and then it also summarizes are

heating sensor note that’s just one example of a say I’m also very excited with what’s going to happen in terms of vocal box but I think he is a huge challenge in making the conversations more fluid and really making sure that we can actually dialogue with human being with the machine sorry and that involves you don’t going Beyond speech-to-text it involves it involves teaching box how to be tactful how to be socially appropriate and how you define that end with the ethical implications of that are huge challenge

what I like about the proliferation of these is Alexa devices and I know like it’s just a fact that becoming so popular and that capable of collecting data possibly it’s both a blessing and the curse that says it’s like a huge security concern on the one hand but it was so I filled up a whole range of possible applications I just like being able to to collect what you say without you actually having to actively engaging the device never mind that the whole conversation last backs of actually talking to the computer that you say with that the meeting Note Taker by they can be extended in and I’ll bring two directions to to inform you about your own personal life or your interactions with people your success at work yeah I just think we going to see more and more of that the time goes on

yeah and generally I’m really interested in augmented intelligence projects and so anything that uses voice as a medium and because voice is so uninterested like you say it has all this potential for being used in a very casual way in our day-to-day and it’s extremely rich information we when we speak we can be a lot more information than the Woodwards that’s one thing we can lay all sorts of social signals and correctly interpreting that is really cheap to developing good interaction with machines I’m really interested in that it you know in the context of example autism where it’s been showing that robots can help autistic kids with

with with Austin like a pair of glasses that tells them when the person that took him to his is angry for instance just help me recognize that the emotional state of the other person

all right exactly so it is so in a way that sort of opens the door to all sorts of augmented intelligence that goes beyond that say analytical intelligence is also short social intelligence and emotional intelligence and it’s really really difficult to predict where this goes but wherever this goes it’s going to be interesting and it’s going to be challenging both technically and ethically empty know that is one of the reasons why I’m so interested in the field I think it it just has far-reaching implications yeah I think I see you know I think I could happily work you know for the next 10 years on this topic and not be bored to see how many of the people who sing feel the same way I can people find out more about your wife about batvoice online

batvoice. Com yeah the reason why it’s bad is because bats have excellent hearing

and then meth and they navigate through space by analyzing waves what will also be at vivitek will we will be at the HP Pavilion and I think that’s the 27th of May end of May and interesting work on voice as well it’s going to be really interesting thank you so much for a lot about speech-to-text building around speech-to-text system has given us a lot more understanding on the resources that we need to stop building around systems like that

thank you for having me

you just heard from Eric bolo the CTO at that voice Technologies

Eric gave us loads of useful information about speech-to-text systems

I learned that form any specific use cases on business applications the cloud-based Services out by simply insufficient and that you may need to consider building your own there are a number of Open Source frame wax and tools outlet but the key ingredient is the data and find the main competitive advantage in a voice Fest business it’s not the algorithms access to the right data

free and paid voice data sets are in short supply especially in non-english language is I’m in specific to mines so is often necessary to build and annotate your own voice dating site so you can try them all those however this is a time consuming and expensive test the cam be undertaken lightly order this underlines the need for more open source voice Texas at projects in order to stimulate the voice development ecosystem

you can find all the show notes with links to the results as mentioned in this episode at voicetechpodcast.com that’s voice Tech podcast. Cam you can also follow us on Twitter voicetechpodcast.com po di at the end this is a brand new podcast that really needs your support so if you like the show and you want to have more episodes like that please could you head over to iTunes on Stitcher and leave us a 5-star review

another way to support the show is to just spread the word tell one friend or colleague about us or mention us on social media I really appreciate anything you can do to get the word out

we’ll see if you like to help me with the cost of producing the episodes please consider becoming a patron visit patreon.com voicetechpodcast.com for as little as $2 a month

that’s all for today episode one is in the books stay tuned for more great episodes are female host Carl Robinson thank you for listening to The Voice Tech podcast

Share this article

What do you think?

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Posts

Vivatech Horizontal
Etienne Boulanger ErCPgyXNlto Unsplash(1) Rhonda Martinez

Get notified about new articles

[yikes-mailchimp form=”2″]