Hum a Fingerprint, Extract a Melody – Dogac Basaran, CNRS – Voice Tech Podcast ep.009

Dogac Basaran

Episode description

This is the second part of my conversation with Dogac Basaran, a post-doctoral researcher at CNRS, the French national scientific research centre. If you missed the first part, you might want to go back and listen to the previous episode on Signal Processing Basics for Audio.

Today, in part 2 of 2, we explore Dogac’s research into audio fingerprinting, alignment, and melody extraction. By analysing the magnitude of frequency peaks and their relative spacing, Dogac shows us how it’s possible to create audio fingerprints that can be used to detect and match audio recordings, even if they contain noise or are incomplete. These fingerprints have a variety of uses, including aligning multiple recordings of a single speaker/performance, and identifying a particular recording.

We also discuss query by humming, the state-of-the-art technique that takes an audio fingerprint of a person humming a melody, and matches it to a database of music recordings. Dogac also explains why learning how to build neural networks has become an essential skill in this field.

Links from the show:

Episode transcript

Click to expand

Powered by Google Cloud Speech-to-Text

nowadays in speech the neural networks give you the state-of-the-art welcome back to the voice tech podcast this is the show that brings you the latest research and developments in the field of voice technology through a series of entertaining inspirational and informative conversations with voice technology experts you’ll hear about the latest products and Concepts invoice get new ideas for your own voice project and learn about the tools and techniques that would turn those ideas into reality today your hair the second part of my conversation with director RAM a postdoctoral researcher at cnrs the French national scientific research centre as this conversation was a bit more technical than the previous episodes I split it into two parts to make it easier to digest if you miss the first pa

you might want to go back and listen to the previous episode on signal processing basics for audio today in part 2 we explored Archers research into audio fingerprinting alignment and melody extraction we also discuss query by humming and why learning how to build neural networks has become an essential skill in the field while you listen to the podcast I recommend you sign up for the voice typed roll up the monthly newsletter for This podcast it contains the top 5 voice news tweets of the month links to the latest episodes as well as other unmissable goodies such as exclusive newsletter offers a quick easy and free to sign up just go to voicetechpodcast.com / newsletter so with that I bring you down so that brings us on to a your specialist area now though your area of specialisation and fingerprinting

an alignment with these two concepts of we’re going to cover it in detail and fingerprinting first of all what’s that and what why is it important ok so you know if fingerprint of a human being me is you need to the human being right that’s all you print out a ring on my finger it is unique to you it’s mine is you need to me so that if you take my fingerprint then you can know that that is me know that is used in forensics for a long time so so the idea is actually the same so for each specific audio people are trying to create some fingerprints that are called audio fingerprint so that each uniquely belongs to that specific audio I mean it is independent of its content it is just that audio I’m you can have

do Costa Consett recordings of a of an artist of this offer certain song for example yes but that it’s that is not the show me the content is not important it is the audio itself so you can have 10 separate recordings of the same concert and each other recordings would have a different audio think of eggs only got it all right and then so it was so what do we use fingerprinting for you want to use fingerprinting for to be able to understand the metadata office Orton song I mean you know everyone knows chasm these days Shazam yeah yeah most probably one I guess there are lots of other examples as well but SoundHound sound down there is also very well known fingerprints by Phillips I was using that in my PhD actually OK and there are a lot of ways to create a

fingerprints but the idea here is that last thing you have a million songs and someone makes you listen to a piece of a song which is probably noisy because usually you hear it while driving your car is Carlos people talking their right and you want to make this your mobile phone to listen to that sound and try to understand that right ok so it a noisy version of that particular Road you yeah the thing is even if it is noisy when you extract a fingerprint from that that would you there are lots of matches with the original value that’s the idea so its unique to their told you so that the original recorded audio will have one fingerprint and the and the sample that you take with your mobile phone in the car of the music playing on the radio on radio will have a different fingerprint but those two fingerprints will having nothing Common

but you can match them together ok to eat not one fingerprint from One audio you can extract lots of fingerprints because it’s it depends on your definition of a fingerprint ok so what Shazam does let Me Explain to each might enlighten that concept we are going to be icy I meet Siri easy thing you you get the spectrogram this program is the magnitude of sdft it’s a very easy thing to compute so you get the spectral gum at the spectrogram certain frequencies are more peaky I mean in in your music while you’re playing something more you while you’re singing something some frequencies are dominant and some frequencies are not so these are the areas on the SD a graph that we just described also called it and the more heat means more red for example or red parts will be preserved

I mean even even if it’s under very noisy conditions that Peaks should be reserved. Yeah yeah yeah so if you define something through these Peaks then ending in the noise aversion the relation won’t change much so in one frame you have a peak frequency do I say ok then in in the next trains you have other peaks and you take the time difference between those peace and the Frequency difference in frequency being different right you called it into a 32-bit hash what is what does that mean you quantize those values for example 1 frequency is dead let’s make this easy One frequency is 200 and the other is 190 episode is the time difference between is 10 Hz yes and the time difference is that let’s say

0.01 seconds so you discriminate those values into binary values like you use 10 bits to 3% 10 Hz and you use 10 bits to represent time and use two more bits for example to represent synchronisation or something else can you have 22 bits for one fingerprint and you do that to your representation in your own Space Program do you calculate after a fingerprint using two two frequency values over all time stats what day do they first find the Peaks in this program so you you can do this I mean with various ways but in the end you will have a like constellation type of a representation ok so the idea is if you extract the car

patient like representation from The Noisy version probably they will match a lot the constellations will Matalan this is the idea so the rest is efficient storage mechanism day turn them into bits they put them in a dictionary like if I remember correctly to 32 bits for each fingerprint so you have a think about it dictionary like every entry is 32 bits and you have a huge dictionary so every 30 to the entry represents the entire track no 32 bit represents a 1 songs 1 fingerprint 1 songs one thing about them they could be lots of fingerprints from one’s own ok ok so when you extract the fingerprint from the noise a virgin then you extract a 32-bit right and see in the dictionary with which entries does it

exactly exactly matching ok so you extracted lots of fingerprints from The Noisy version as well and you see which which fingerprints are matching and probably hopefully with your original fingerprints they have lots of matching exact matching ok to bits are exactly the same you have multiple fingerprints power file with the original and for these recorded in the car and you try and match as many of them as possible and then he produced some kind of probability that says it’s probably this file is when you put them in 32-bit there could be some wrong estimation as well right now because I’m in some of the fingerprints might matter to other songs as well but and the pink one I mean the most matching song will be your candidate ok let me love you song and it is pretty fast interesting yeah that’s because it sounds like you’re having a lot of data you got more things probably got millions of files in your musical database

there’s a huge cross reference Ryan has to go and check for millions a dictionary allows you to do that very fast because I already have the dictionary just what you do is to search through the dictionary Emits an exact since you’re searching for an exact match yes yes it isn’t so it can produce a definitive answer to what I said was not true this is the exact thing you have the 32 bits and 232 bits means of Ellie right so that is your entry that your index and editing lets you you see the song title deeds of a house is the index of the right so only have to look at that index drive as well I did my PhD on audio alignment multiple audio alignment most of the state-of-the-art methods are using fingerprinting that did the thing is the idea of

multiple of the alignment is this assume that you’re in a concert and you record a mean of Man of the people all of the people from ODEON start recording from the beginning of the song till the end and there’s another person in another place in the concert and he started recording from the course part and till the end of the next song there lots of people recording those are people in the audience required on each other they don’t know when to when they started recording you don’t know the offsets near the starting point the alignment problem is trying to align these audios according to each other on a timeline ok just want to maximize say 10 people recorded the same song starting from different points and you align them now probably be recorded them with mobile phones with video they having the audio and you are you allowed them according to the orders then now you have the Malta

perspective both audio and video perspectives of the same song then you can drink I make it clip or video clip of it yeah you can piece it together and have a video and audio recording of the entire event for multiple fingerprinting is also use you for it because it is the source is the same right you dare you recording the same audio yeah yeah yeah so Indian if you extract the fingerprint from that from these noisy representations most of the fingerprints will match your has the same ground truth is all coming from the stage everyone’s recording the same thing even if you don’t know if they’re coming from the same song I mean because the same person might record many times right you don’t know how many fish did the song all that song you don’t know that so you use fingerprinting the same technology you creating a print from everything then search for eating one by one and try to find the matching ones and then l

then use the fingerprint for alignment like that wonderful ok some point out that when we were talking about this before I was thinking of speech was the voicetechpodcast.com icon audio fingerprint of our own voices then we can use it to identify people but it’s not a similar item doesn’t work yeah it is I mean if people were recording you at the same time like 101 example I can give while I was working on the topic I saw her work that has made a grab many videos from YouTube that is a recording of Obama’s speech about my dear President Obama charger at a certain certain place so every recording was grabbed for from the same speech so they reproduced the speech by aligning all those recordings ok so I mean that’s that can also be

but as I said the sources the same so has to be the same here and so he can’t use it for to identify people based on their the voice because the utterances that they speak a different every time I exactly right I’m in bed. Other technologies to do that I can cry cry it’s me 192 melody extraction do this is what is this what you’re working on at the moment yeah actually I’m just doing a postdoc on it’s my name is dark hair ok my main topic is dominant male of the extraction on just performance is this my main topic as they were what’s that can you describe the easy concept actually when you listen to a song you listen to the main melody white men vocal melody or main the guitar solo to the owner May no longer be here it is what I’m trying to do with the computer I want the computer to extract to find the notes of the main melody even if it’s a vocal Melody

is if it is an instrument melody right so the the aim is to find the dominant one in my case they’re also polyphonic pitch estimation ok with you which could be harder but finding the dominant melodies a very hard problem itself and it’s not a solved problem I mean it’s not like fingerprinting I mean in fingerprinting there already lots of applications that’s a very nice fingerprints but in millilitres to mention if you’re not there still very active research topic yeah I mean I’m into Filipino in signal processing or generally in machine learning everyone is trying neural networks and all that’s also what I’m doing right now ok so I started with using some very important in this very important to actually said I’m using the negative matic

factorization of non-negative Matrix factorization did people to search for it because it’s widely used now in for separation is specially yeah yeah this pixel Gun itself is a non-negative matrix right ok so the sdft is a nonnegative matrix and you can and you can see you can separate that with 2 multiplication of two different matrices and one is called the basis the other cold is the activation of the basis ok we won’t go into 90 much detail but I just needed it is just think about it like that

in mathematics it is the bass eating a chew everything is connected with this one the free transform is the same thing right you represent things on top of some basis representation in your coordinate system you have x and you have my right yeah every vector can be represented with the values of x and y size are the basic components you’re breaking things out and touch onesies exponential in mathematics factorization each Vector in in your one matrix is the basis and each row in your activation matrix is the activation of those bases see each of these bases represent one feature of a guitar for example so you learn the dictionary for the guitar then you know exactly when it is activated so that you can understand it ok now you tell you if there is playing right now ok see you can separate an audio file into the parts that I just for the guitar

very busy concept of course it’s not working that’s a beautifully but I’m I was I was using a version of it is so she tryna mess to extract the main melody so it was it was a good representation itself but I wanted to enhance it so I started working on some new network Solutions on it and do I learn too many I’ve I’ve read many papers and I learnt many things about our own approach mostly about convolutional neural networks and recurrent neural networks and there must be popular now very popular in loss of Mir tasks actually MIR is music information retrieval I mean if you’re generally interested in musical signal processing but you do pay the keyword is MIR music information retrieval ok that you need to search for it and it’s me as MIR now is one of the more

popular conferences that you need to search for because most of the recent advances are represented there, ok is Mia it’s a very good conference that I mean most of the good papers are coming from there I had the next Izmir conference which is an international thing is coming to a party to a nice conference we are the top people in the MIR Fields are always coming there so you can you can visit and you can see them they’re ok

if you haven’t done so already had over to voicetechpodcast.com / newsletter and leave your email address the voicetech roll up is a monthly newsletter that contains the top 5 boys news tweets of the Month by engagement links to the latest episodes as well as other interesting bits and pieces such as exclusive newsletter offers it’s only one email the month so you’ll always be glad to see it arrive in your inbox didn’t go to voicetechpodcast.com newsletter now and enter your email address valid extraction using enum are you meant to say me about humming humming detection coming databases that has of Israel

you know when you extract the melody it’s it itself is a very charismatic cool things right so if you need no other purpose than extracting the metal but if you are able to extract a melody good enough then you can have lots of other applications one thing is, detection let’s think about to Ben’s playing the same song ok they’re playing it differently they probably if I mean if you’re a musician you promo musical perspective want to change the original yeah you add elderflower you keep the main melody yeah yeah if you’re able to extract the main melody good enough from both songs

then you can understand that they are the same song from the middle dalliance I say I can see you looking for that common thread that that the main melody that’s reproduce easily used for cover detection if it is good enough and why and who is interested in cover detection it was to detect covers first of all copyrights, I normally I’m in if you’re not paying for it you’re not entitled to play Someone some evening in a bar even higher DJ whatever people do but that shouldn’t yeah ok so the record company is really travelling to this technology because they wanna they want to ring every last penny out of the other thing thing is a version of this actually I mean you have the original song and you have the melody you are trying to remember the melody of a song but you don’t remember it’s name

humming is that you hammered like and you want something to find the original song right this is so called this is exactly called query by humming I guess ok query by humming yeah so until now

do I don’t really know the most recent technology but they are never getting some humming melodies already I mean by crowdsourcing the cross of the database of having 200 people to sing this melody if you find another person trying to harm that movie and then you you you try to make it a new you try to find a representation yeah so that they’re similar right, Brenda meme is exactly. So if you’re able to extract the main melody from original songs good enough then the humming will be the same and not the same but similar know it’s about matching the humming to the memory system when you have something when you say honey due to having then you eat if you can find the origin

song from a million database as amazing as the number database is not a look for the show has the chang I think I T the iOS has passed to and probably many more there a lot of people have tried to tackle this problem but together humming databases how we do it I’m not sure I like you so there must be some kind of crowd-sourced system Online together with our hands are quite difficult to get that data as well because the audio conditions love you’re a different sometimes you can’t control environment yeah well what is humans what we understand is something with a reference usually if you’re not a Pitch Perfect person if I buy you a door or a c then then play you are not a note then you can tell me what it is according to be rain tomorrow

if I give you any if I give you a reference so what is humans we gather the information is actually the reference so we can understand the interest very good

so when you do the humming probably you do the internal good enough even if you start from a different Note or I say yeah right right right might not be good-looking turbos could be probably the end of ok so the difference between the notes is accurate yet another does my son see I could you can hummus tune in various different pictures and still sounds write as long as you get there such changes correct ok for the famous male of these I mean even with small mistakes people understand those so melody melody extraction on melody detection needs to take account of that as well and he still really look at their the differences between the notes supposed to do the humming the action query by humming on on top of it then yeah it’s even more of a challenge an extra complication alright brilliant well that’s that’s been fascinating liver

entry cistern all sorts of concept so you’re not really considered well I hope that it is not too messy now I’m sure a lot of this concept can be applied to voice technologies as well and somewhere another voice assistance and audio technology advances in the world will see what I can say is that if you’re started on speech technologies you should break the I mean checking out the Sorting algorithms is very good as I said frozen pork soda for me it changed my life because I can do the best vocals for myself no solar solar heating cream is overlapped and admitted it’s one of my love the message that you can do the time stretching picture Yeah Yeahs is a vocoder is that right or something

so do I I urge you to check the technologies continues but nowadays in speech the neural networks give you the state-of-the-art so wrinkly continue with this network researchers seeing reading things like that will help you ok size wise if you want to do you wanna build cutting-edge voicing and music the applications now you need to understand your networks and start working with them ok excellent well that’s fantastic thank you very much once again so what was so what’s on Horizon then what are we going to be focusing your energies on over the next 35 months well I will be more fully working on male the extraction for jazz specific jazz music yeah because my project is it will have a very huge data set of just performances like we have the improvisation part of many

jazz artists like Louis Armstrong let’s people and we are trying to extract the melodies from all those in positions and there’s another thing that is trying to find the patterns in those melodies so that they can say it all ok Liz Arsenal played that here but 10 years ago someone has played there that in this in this recording well yeah dude it’s a really cool project so I’ve been trying to enhance my system to be to give better accuracy on top of jazz music when I can I will be working on that for the next few months I can have a good luck with that thank you it is pretty pretty I’m also listening to those performances in learning lots of stuff you know and I was playing I was nothing to Jazz a lot but now I mean I have this or pursued

Bridgend I have to listen to know if you’re researching and you’re doing research of something the key point is to understand the data if you understand the data good enough then you can you can come up with ideas yeah yeah so I need to listen to jazz ok. Where can people find more about how you’re working there and what you up to I have a YouTube page so you can Bacon find some codes there ok great my PhD actually basically yeah but I’m going to put some more for my recent works as well so you can do a quick search for Dad for Sharon ok as it is written about which English learning I see I guess that’s how the sea erodes cedilla is its ability and see our special and s in my surname is also specialises coming from Turkish letters are a bit just use the normal Roman alphabet

do I have a website deichmann.com x., you can also search my girlfriend ok gradwell up of everything on their the Sheerness area so if you want to find out more just had to the web page for topic are so, now I’m waiting on that stuff thanks once again that’s it thank you bye bye bye

you just had the second and final part of my conversation with duac data on a postdoctoral researcher at cnrs the French national scientific research center we covered some pretty complex topics over this conversation so don’t worry if you haven’t absorbed it all bdsp make that I mentioned is a great way to get started and understand the concepts better but as always the best way to properly learn this stuff is to roll up your sleeves and belt and projects of your own as Python is free there’s nothing stopping you from kodi something up with sci-fi right now just grab a random well file generate the stft of it then plot the magnitude spectrogram with matplotlib is only a few lines of code and you’ll feel much better for it just a couple of bonus bits before we finish and I wanted to introduce you to mirror x 2018 lyrics is the music information retrieval evaluation exchange which is a competition organised by the school of information sciences

at the University of Illinois at urbana-champaign or you are you see they host lots of competitions in various categories including audio fingerprinting audio classification identifying what sound is playing audio melody extraction which is what you heard I do actually working on audio cover song identification which is one of the bigger the big research show Focuses at the moment a drum transcription and also music and all speech detection so the the lyrics 2018 community will hold its annual meeting as part of the 19th international Society for music information retrieval conference or Izmir 2018 which you heard are talking about and in 2018 this one’s gonna be held in Paris France from 23rd to 27th of September so as directors at most of the best research papers on this topic come out of this conference

so it’s well worth checking out at you can find all the information on it at music. IR. Org / lyrics I put the link insurance obviously I thought it was also important to give you some real-world examples of the Tech that we just discussed and also give you the chance to use audio fingerprinting and query by humming straight away so I recommend that you take a look at acrcloud to provide automatic content recognition Cloud services based in Beijing China they are the Champion of the mirror X audio fingerprinting task last year and they have one of the largest music fingerprint databases in the world with over 50 million tracks and being updated every day so acrcloud offer a range of cross platform SDK on iOS Android see Python Java etc and also web apis of course which you can integrate into your project and let you access

the range of services so some of the services they offer. Almost music recognition with third party ID and decorations which basically means that acrcloud music recognition service allows developers like you to match directly to online music services such a Spotify Deezer YouTube accept an offer Direct links to your users to either play or purchase the tracks instantly metadata is provided as well in the in the response they also offer a query by humming service which uses the audio fingerprinting technology that directions to just described they offer I Live lyrics synchronisation service is this basically provides a timestamp of the place that you are in in the audio which is then matched to a timestamp of the lyrics are in an app for instance as an example news

next match at is a platform for users to share and search for lyrics and they use acrcloud to synchronise the lyrics before returning them to use to their users in their app as I see iCloud I’ve just developed and released their latest feature which is cover song detection this is always been one of the unsold challenges in the industry but acrcloud have cracked it and it’s now available for you to use it will identify the original song that any cover song is based on so you can try it with the one of the many covers of songs like I can’t get no satisfaction by the Rolling Stones or Imagine by John Lennon or yesterday by The Beatles which according to the Guinness book of records has been covered 7 million times in the 20th Century with official versions ranging from Frank Sinatra to Wet Wet Wet and boys to men anyway I see iCloud very well

established with 10000 premium clients including some very big names that are recognised and I’ve already got 25000 developers registered on the platform you can try it for free for 14 days and I just going to Console VCR cloud.com sign up login and you can use the full functionality of all the services that I just described as there’s just a 5000 requested a limit and you get a maximum of two channels for their streaming radio monitoring service we should be more than enough to try it out so sign up login and then you’ll get all the docks in tutorials to explain how to use it as well as pricing information and if you choose to continue after the 14-day free trial had a really nice chat with others so I’m gone and I don’t have to get on the show at some point to explain the technology in the functionality here in more detail ok so that’s all for today I hope you enjoyed listening as always you can find show notes with links resource

voicetechpodcast.com once again check out duchess work at Deutsche bathroom.com or on github.com sash Direct bus or an hour follow me on Twitter at voicetech car sign up for the monthly newsletter voicetechpodcast.com / newsletter back soon with another episode but until then I’ve been your house car Robinson thank you for listening to the voice tech podcast

Subscribe to get future episodes:

Join the discussion:

Support the Voice Tech Podcast:

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.