Signal Processing Basics for Audio – Dogac Basaran, CNRS – Voice Tech Podcast ep.008

Dogac Basaran

Dogac Basaran is a post-doctoral researcher at CNRS, the French national scientific research centre. Today, in part 1 of 2, Dogac gives us a crash course in signal processing, where we learn what signal processing is and discover some of its many applications.

Leveraging his teaching experience, Dogac uses simple language and real-world examples to explain the fundamental signal processing concepts that are used in voice technology today. He defines frequency, period, and stability, and describes how sound cards use sampling and the Nyquist theorem to convert analogue signals into digital.

He then recommends some good educational resources and software packages, so you can learn more about signal processing and get started right away on your own programming projects.

For part 2 of this conversation, visit Hum a Fingerprint, Extract a Melody.

Links from the show:

Correction [21/08/2018]: The term ‘stationarity’ was cited in the episode, but this should have been ‘stability’.

Episode transcript

Click to expand

Powered by Google Cloud Speech-to-Text

this concept actually amazes me this analogue to digital conversion lead it has three basic steps the first one is the sampling

welcome back to the voice tech podcast this is the show that brings you the latest research and developments in the field of voice technology through a series of entertaining inspirational and informative conversations with voice technology experts your hair about the latest products and Concepts invoice get new ideas for your own voice project and then about the tools and techniques that will turn those ideas into reality today your hair from Dark bass Iran a postdoctoral researcher at cnrs the French national scientific research centre this conversation is a bit more technical than the previous episodes so I split it into two parts to make it easier to digest in part 1 which will her today we have the basics of signal processing as applied to music and voice in part 2 which will be released shortly will explode Archers research into audio fingerprinting

and melody extraction will also discuss query by humming and why learning how to build neural networks is become an essential skill in the field to today don’t give us a crash course in signal processing what we learn what signal processing is and Discover some of its many applications leveraging his teaching experience do actually simple language and real-world examples to explain the fundamental signal processing Concepts that are used in voicetech allergy today he defines frequency. Stationarity and describe how sound cards use sampling and the nyquist theorem to convert analogue signals into digital he then recommend some good educational resources and software packages so you can learn more about signal processing and get started right away on your own programming projects just a quick reminder if you haven’t done so already go ahead and check out the voice text subreddit on reddit it’s a place for listeners like you to

gather and attract you can ask questions and get answers and post links to things that you find on the web via voice technology news products research whatever you like so head over to read it now and subscribe to the voice text subreddit or just visit voicetechpodcast.com / Reddit so with that I bring you out that Iran so I’m here with go out but Iran a postdoctoral researcher at cnrs which is the sound of national de la recherche scientifique or French national scientific research center hello thanks very much for joining us here to hear in their the Sunnyside berio at eircom by itself and the research center looking out onto the plastic stravinsky yeah I do really nice view the beautiful setting to that has this surgically

ok so it’s could you first of all I give us a bit of an idea about and your backgrounds are where you come from and what led you to where you are now I’m from Turkey from Antalya I did my PhD in Bath university in Istanbul I was a musician as well while I was in undergraduate we had a band names by them well we we went to do with Emma once to a professional music studio which was like a space station at the time for me and if ever there I figured out the relationship between the electronics engineering and the signal processing music ok musical signal processing in general the Dead I discovered that day and after that I became a better student started to work on signal processing trying to understand the concepts

and they changed my life actually then I went into a master program then I went into the PhD ok and say you’re already interesting signal processing came from the interest in your passion in the music and you see you when I had you did a masters in it then you went and did a PhD or Masters I did lots of I tried lots of signal processing algorithms like so Lagos MC know there must be used for time stretching or picture thing you take your voice used algorithm shift the pitch and now you have the third shift repeat again now you have the fifth as you can make a harmony with your own by then you can you can use the back because with yourself in this episode we wanted to cover a bit about the core signal processing Concepts because I know it is pretty complicated and it’s quite alien to most of us I don’t think many people who listen to music while thinking about the music on a on a signal processing level I bet if you want to manipulate the sound

you say it is important to understand the basics early as to what’s going on in order to be able to use the the algorithms that are available to be able to use the other items it’s more of a practical point of view so you don’t really need to understand the whole thing I mean for example every guitarist uses a compressor compressor pedal right on the stage but I’m doing all the sound how it sounds they know the result so they can manage it but they don’t know what it is doing exactly sure sure I mean from the practical point of view you really don’t need to understand everything but I’m if you’re interested in that stuff and trying to understand everything then it’s more enjoying so that there that’s that’s why I wanted to see the person you can appreciate it on a deeper level and then it’s often the 80/20 rule as well a little knowledge goes a long way signal processing in a general sense

ok so as signal is basically an electrical signal that you you are able to record for example a microphone takes the air pressure that you create with your sound yes and turns it into an electric electric signal so that I mean so that you can modify it and a Lycett synthesise it do lots of stuff on his I mean for example if you record your vocals you use at you talk to her through a microphone it there’s a coil inside with a magnet that shifts with the day they are fit in a dynamic microphone that we use now price literally a physical system yeah yeah I change the magnetic field so that the coil create the electric and that is your sound in the computer actually do you take the sound you do whatever you want to do for example you can add Echo you so that your sound could could see your voice good sound

do I need a cat as well or you can add a call you up like that you can add distortion on your guitar recording you can you can do lots of modifications according to your task whatever you want to do so chic signal processing basically what it does is you you record the signals somehow with the sensor let me to Santorini in the microphone the line whatever you have then according to your purpose you modify it then put it back so that you can listen it in a I’m Innocence you one you want to hear it so that that is what a single person is great I had a quick look on on Wikipedia and its sums up what exactly what you just said signal processing sensory analysis synthesis and modification of Signals which are broadly defined as functions conveying information about the behaviour of attributes of something ornaments a sound images by Lodge

actually it is this fact that when you say signal processing most of the people understand image processing they don’t understand that audio processing really how deep to me that’s not I wouldn’t say so for me it’s near to Lloyds when I see Singapore sing I mean audio signal processing but in the rest of the world when you say signal processing people understand image or video see video sorry my mind when I have signal I just think radio I just think of electric away is moving through the air so you think about the voice is a signal or biological measurements so that requires a little bit more light abstract thought I mean whatever you whatever you replant represent in a computer with ones and zeroes take an old toad in the hole can be told how can I have a signal processing it’s about capturing real-world phenomena and encoding it so that a computer can manipulate it I noticed that it

not just audio and image there’s actually a whole list of Fields that signal processing used in I couldn’t even after off my head again I looked online earthquakes near Anfield eeg signals is he didn’t they’re all very similar financial signals and there’s only I mean you have a time to visit to eat in mathematical point of view you have a Time series and you want to manipulative somehow end up in another time series yes from my mother with your point of view they are all the same if you’re over financial data if you have a CCJ speak later if you have an audio data they’re all the same fantastic ok so now that we’ve covered and what signal processing is in the types of Signals that we can have let’s talk about some of the basic Concepts in signal processing understand

diets you actually taught signal processing for a bit yeah you’re used to explaining these quite complicated topics anybody have never come across them before they just going to include most of the people listening to this service program well so I mean I’m I’m sure they encountered it a point because I’m in signal processing is everywhere just you don’t notice it if you don’t see it say what I mean most of my students understand the concept very well when I said they use your using your eat you in your car when you’re listening to music so do you know what it is it’s actually feel you you wants to I mean a signal that repeats itself in 1 second in how many times so that is what the frequency is if it’s repeat itself 100 times in 1 second then this frequencies 110 is the number of oscillations repetitions in 1 second or in a period of time in the unit ok so the frequency of presentation comes from the fact that

then you can represent your your signal with the sum of sinusoids and eats I know it is a single frequency ok go back to add maths class when we were you know simply think about everything can be written as a sum of sinusoids and everything is frequency so that you in your Indian do you have a frequency representation that’s actually used in each use because when you talk about frequency you talk about the bass you take you talk about the mid Middle frequencies and you talked about high frequencies so for example if you want to hear more about the bass but when you in your car then you increase the bass in your is right right so that doesn’t that increasing the amplitude of the Other Place things with which other frequencies which have a low. Eschew

the oscillator 9 is within. But low frequency if it is a low low frequency so if it is like to present repeat itself 2 times so the period is very high to the period is the length from one peak to to the next year in the way it was just going up and down up and down and the frequency is the number of times of the number of Peaks you get within 1 eggs ok so long story short I mean when I teach my students I I I like giving real real life experiences for example there is this phenomena called stationarity of a system everywhere everything around you is a system of you give the input you take the output right so and in between there is a system you have an amplifier you given input with your guitar you take a sound from your M25 so everything is awesome and there is the stationarity in the systems so for example in a

concert when you hear this week a very disturbing high high pitch noise when I put the microphone too close to the speaker and it’s a very good example for an station on stationery stationery stationery stationery means that your system you give a logical input and meaningful input to your system and it’s use the crazy advertisers site stationarity as it’s over the predictability of a certain input will always give us a movie and indexical the feedback in indie concerts actually like you ok you you talk to your microphone it goes to the mixture and it is amplified and comes to your m comes to your monitor and goes to your mic again any it make kind of makes a loop and it increases the sound then then these two eating high pitch noise you hear ok so one of the one of the other

fundamentals of signal processing is there sampling theorem able to concisely the concept this concept actually amazes me this analogue to digital conversion did it has three basic steps the first one is the sampling so the idea here that most of people do not think about is this in in real life everything is analogue you hear analogue you see analogue you don’t nothing is digital why do we make it digital so because we want to represent them in a computer so that we can use the competition will power there to make the modifications because it’s a very very powerful tool that we can use so do the thing is we somehow need to put analogue data inside out computer right because computers any day with one day no

that is great yeah yeah so did the thing is an analogue and continue signal mathematically means that the nearest 2 points has infinitely many points in between them so there are infinitely many points in a continuous interval so you cannot put infinitely many things in a computer because the size is not enough I’ve got a sore you need a discrete to presentation so something is actually getting that discrete representation so there is this song called nyquist theorem the nyquist diagram tells you that the highest frequency in your signal ok let me briefly very briefly explain the 3 Series tourer ok go ahead as I said everything release can be can be expressed as a sum of sinusoids ok so everything is all it has a frequency right yes so when I express a signal the higher

frequency is the size of a date has the highest frequency is important to me if I can express that with two samples just two samples are from which one piece from one sample and from the other pick another sample just two samples within one. Yes if I do that with the highest frequency then I will be doing a very better something for the rest of the frequencies because I’ll be getting lots of samples from them ok so just doesn’t really important point so I as I’m talking now this microphone is recording my voice and in my voice there are low frequency sounds and they’re all so high frequency sounds maybe when my whistles through my teeth or something and the the highest frequency component of my voice is the one that you need to capture with the sampling rate and you’re saying that the sampling rate needs to be short enough of fast enough to be able to capture enough information

the highest frequency component is Eid ul does that you’re not analysing the sound composing them into sinusoids 10 sampling you not doing the sampling the origin of Signals able to sample the highest frequency you’re already sampling the rest of the frequency is good enough it’s also important because when we start talking about these collection of sinusoids you imagine her you know a whole array of Dinosaurs coming at you in the air but it’s not true there’s only one signal there’s only one oscillation of the air there’s only one oscillation of the voltage coming out of my cage theoretically you can break it down into multiple sinusoids and I mean this thing is you take two samples from one. Office I lose weight from one peak and from pickup and 1 pic down promise I am in One promo signs now ok right it’s from the high to the low right ok so I mean this is another very

education the minimum this is the minimum you can have that that is the idea at least so which means that if you have frequency f of the highest frequency then you’ll need it to Ash something right because what from one. You take two samples I say so if it repeat itself 100 times in 1 second then you need to 100 samples you want you take two samples from one. So for a 100 high frequency you need to sample it 200 hot to capture the all the information Andrea exist in real life so let’s come to the real life I’m what what it does mean what were the sound card does ok so this is the idea here is the human hearing system cannot hear you cannot hear over 220 kilohertz 20 kilos 20000 year 2018 you cannot hear it

the dogs can hear I guess up to 25kg here in better than asking you know you know this what it is called Dog screamer the dog whistle if you want to make a dog go away I can a few but the dog can so like someone screaming at his ear device to it and it’s like a very loud noise but you cannot hear I seen actually that they develop their kids as well to keep the keypad lessons away from the front of the shots I’ve actually seen that the kids can hear the frequencies of the adults can use when when you get older they are told you you you it becomes harder to harder to hear the low frequency in the high frequencies for all people always that’s why

sometimes people cannot hear you I mean sometimes you you say their name and they cannot hear you yeah sure that’s why because the high frequencies are gone now so the thing is the thing here is that this is the human human system cannot differentiate after 20 kilohertz you don’t need to get the sounds over 20 km yeah you can ignore them you can ignore them you can just cut them off so you apply a low pass low pass filter which is called an anti-aliasing filter in signal processing that cuts the Frequency from I’m from 20 to 2 hours for example then samples and samples and then if you cut it for example if you cut the signal from 20 cards then your highest frequency is 20 clothes for sure so then you are sure that you can sample that with a 42 Hz why

I see so first of all you just put it through a filter that removes all the high frequency Denney sample at Exactly don’t worry about missing his decided is 44.1 kilos of the highest frequency in your signal is 2222 km in 50 hour and try and actually not number 44101 the c2050 is the highest frequency that you can have in a music in a voice recording in a CD which makes sense cos is based on what we human I can hear exactly after that you cannot hear any way just arrived so that is called sampling William alright that’s really really want to know it’s applies equally to music to Sound and any other signal but specifically we’re talking about boys in music hair

well you enjoying the show had over to read it and subscribe to the new voice text subreddit they will find other listeners as a show of posting links about voicetech news and product launches asking questions and commenting on the post the subreddit can be easily reached at Whitwick podcast.com / Reddit to go check it out right so lovely does onto the short time fourier transform there’s this is important I know it’s technical but it is really important because it’s a really common representation of frequency over time as I understand it introduces please short time fourier transform is a kind of representation that you want to see what’s going on in frequency to time yeah so there are two reasons actually that you use short time fourier transform so that the thing is

frozen for speech speech speech is sad that it is quasi-stationary so it is stationary in short intervals which means that if you zoom to a recording ok you record your voice and use of it on a Volvo letter you say and you show me the news is you’ll see a periodic signal there ok apparently no it’s just repeating and a regular fashion it’s not the same as I did so weirdly shaped but repeating thing ok ok in as you took you have no vowels and consonants in stuff some some part of it is periodic some part is not mind if you say it’s not. The thing is if you took some part of it is. I can support of his not so it is said that the voice is changing through time but it can

tote a stationary or the period remains the same basically inside the short short interval interesting our I said that you just said something I didn’t think about before is that a constant vowel sounds like our is periodic if you zoom in but as the sound is is not but to the human are you think they’re both constant sound you’re the same element with the symbol that both regular regular sounds but I lost frequencies level they’re not well I mean not sure that how it is processed in your brain but if you are if you try to understand the phone I’m phenomenal inside the computer of course you need to observe it so you drove it with MATLAB or Python whatever you told you using you drove it and when you do meet you see that they said there is a periodicity there ok it’s just a patina keeps repeating not a regular sinusoid but some kind of a regular pattern but you could say

this time repeats multiple times did the thing is if you want to analyse this sound you need to use short interval that the stationarity is preserved so I mean let’s say you want to find out the pitch ok you the picture of a certain part but as you took the piece of changing if you take my whole you take this whole conversation and analyse it as a whole you cannot find anything goes everything is changing all the time if you are a short time intervals then it is meaningful right right so this is the ideal short time fourier transform as well do you take short rains you take the free transform of those frames ok you take the free you take the free transform tool for that frame you analyse the Frequency you have a frequency representation for that’s why you take the next train you do the analysis put put the and Horses on,

then for every frame you can create another side by side and in the end for each frame you have the Frequency analysis and then what you have is the frequency are observed Through Time right right side is called a short time for a transfer and that produces this nice sagra this looks like a heatmap rhyme recording technologies in older recording studios or the sound engineers use it because I’m in as you as you see the changes in the Frequency then you can understand what frequencies are higher but you cancel or what should I do to make a better recording for example you’re recording a guitar lesson and some frequency speaking all the time you want to know that the Frequency to make a better sound and within these tools you can analyse them and see what it was going on here is such

so you to represent representation set an image in your mind there’s a there’s a graph with an x-axis and y-axis the x axis is time from left to right and in the y-axis is frequency and if you chop it up into little Blocks in the in the vertical and the blocks at the bottom would be the low frequencies and the blocks of the top will be the high frequency is it really nice is a Byzantine but that use colour to represent that the magnitude or the amount of frequency in each of those block you can see that isn’t you two dimensional matrix representation yeah so if you I mean if you drove it like in a three-dimensional image it’s kind of a kind of a mountain yeah it’s like a series of bar graph stacked along interment Isle of Coll of discrete discrete mountain to scream out in the item locator every time slicing we can have a voice in music what usually is how big the time slice usually speech it is usually

20 milliseconds 20 milliseconds ok has British more year and for for music it’s like 10 times more to 206 may be out at 10 x Morris you need a much smaller window for speech yeah this is the frame size yeah you know there is this other concept called hope size so you take one frame now remember we did the sampling right now we have the sample yeah ok we take the samples from the first sample to the 100th temple so we take the first understand we analysed it put it on our Tour de France on then from which sample we should start with we can start from 101 and take the next 100 or we can start from the 50s sample 1554 sample and take the next 100 after their overlapping yeah yeah the windows are usually they use overlay

Windows so that there is no information loss between those frames Yaga same for information can be lost at the boundary between Windows so you want to another window that captures that that boundary information bang in the middle of a Window ok alright well that’s it that’s enough on sqft ok it’s taking me a long time to get my head around and it’s still not crystal clear and it’s definitely worth just typing STFC into Google Images and just seeing the heat map yourself is it makes a lot more a lot more sense and I wanted to make you a couple of resources that if you wanted to actually try some of this stuff yourself especially get your head round the the signal processing Concepts there’s one course in particular and it’s free it’s on coursera it’s made by the ecole polytechnique federale de Lausanne and it’s a simpler called the DSP cores digital digital

processing up at the link in the show notes the people who’ve made this super passionate about digital signal processing and it breaks down every single and Concepts in great detail would like a nice grass and even real world examples that’s fantastic for beginners and someone to talk about a couple of ways that you can actually start programming with this stuff and she said that you used MATLAB that right when you’re actually until like 2 years ago I was using met up all the time ok does my puppy is very easy to use easy to code just and the problem is you will need to pay for it yeah Jam Fury in a school as a student usually you your school has some agreement with MATLAB that they can get it and you can use it on on your school computers or I mean all you can connect to a VPN to your school and user data definitely understaffed is not the latest technology I mean every version of MATLAB have their this is why

known you know I’m so it’s very stable and especially very good at debugging the thing I mean when you write large of codes not small ones but larger Court especially you need a very good debugger to be able to solve issues because I mean well first the first programming language I learn pussy and I didn’t know how to use debugger I was using awk print all the time to you and what is going on there it’s a nightmare with very painful so if you have a deposit very very nice to bigger than everything is much better metal appears that and most most importantly you can draw every step and you can stop at a time you want on a debugger it has a consequential operations so you can go step-by-step through your code it was happening programming books from top to bottom right left yeah yeah you can stop at the point it is

the breakpoint in the bucket so so so that you can see what’s going on up to up to that point it’s very good in in Python python is a very good language as well so for two years I’ve been using Python is open source it’s everything is available especially I recommend anaconda if you install anaconda everything is install ready installed with absolutely here so there are very nice libraries in Python which is called numpy scipy does everything for you as I did notice the other Got a Hold signal processing library inside PS3 do everything that you want including the Year of the sdft in since it’s very popular now you can find lots of tutorials in example codes in the internet so right right anyway samples on github it’s actually it’s very easy to learn once you know programming language Python is a very easy language to learn and if you know if you know MATLAB tennis even

easier because it’s very close is very similar it’s very similar ok yeah I know it’s not like that so she could see if you’re a student you can get it for about \u00a330 and then you have to buy the signal processing toolbox on top of another \u00a36 but I mean that’s not going to break the bank if you really want to get into the stuff or download Python you get it for free

you just had part one of my conversation with duac battery on a postdoctoral researcher at cnrs the French national scientific research center now that you’ve got some of the fundamental signal processing Concepts Sunday about you’re ready for part 2 of the conversation with no Ash will be released shortly will dive into some of the more advanced topics that use the concepts we cover today including duchess research into audio fingerprinting alignment and melody extraction the stage in for that that’s all for today I hope you enjoyed listening as always you can find the show notes with links to resources mentioned of the episode at voice tech podcast. Come and find out more about direction is work now you can visit his personal website bypass around. Come and his get her page github.com Flashdance batarang you can also follow me on Twitter at voicetech calm and sign up for the monthly newsletter voicetechpodcast.com / newsletter support the show

just tell one friend or colleague about this episode and of course don’t forget to check out a new subreddit voicetechpodcast.com slash Reddit I’ll be back soon with another episode until then I’ve been your house, Robinson thank you for listening to the voice tech podcast

Subscribe to get future episodes:

Join the discussion:

Support the Voice Tech Podcast:

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.