Sunday 14 April 2024

Teaching Robots to Listaten with Machine Learning: Giving Pepper the Ability to Hear

 This is me with pepper the robot at the stem for Girls Camp at IBM think 2023 you see my friends at IBM and I have the unique opportunity to not only build really fun machine learning techniques but then we also get to demonstrate it on really cool Hardware like pepper now while we were at think the kids absolutely had a blast they asked pepper all sorts of creative questions like what's the meaning of life and pepper was able to hear the question know what was asked understand it generate a response and even then vocalize it like so that's a great question the meaning of life is different for everyone but for me it's about learning really and making a positive impact on the world now in my opinion that experience is just magical getting to interact with a robot with this custom machine learning pipeline behind it now admittedly this was a lot more impressive before chat GPT came out because ever since then everybody just seems to think that we're calling chat gbt in the back end anyway but really there is a custom pipeline line that goes behind the scenes to make Peppers so natural to interact with as a matter of fact there are a couple of key challenges that we Face specifically deploying to Pepper number one the experience has to be incredibly natural that means that we have to be really fast and really accurate because otherwise we're breaking immersion if pepper doesn't understand what you say or it takes a long time to understand it it doesn't really feel human away it kind of defeats the purpose of using pepper to begin with but additionally we also have really noisy data to deal with because pepper doesn't give us the greatest Fidelity of data at times for example the fans and Pepper's head can sometimes add a lot of noise to the microphone data that we get there are four microphones and with this extra noise it can be really difficult to do things like speech recognition now if you take a minute to think about what the architecture behind that application like the one you just saw might look like you might immediately think of a few different obvious components one of them is the one that figures out what the user said obviously speech to text we also need the actual conversational engine that's able to determine how to respond to the user and we also need the part of it that's responsible for vocalizing that response the text to speech and of all these components I know it might seem like the one that would be most complex to put together is the one that determines how to actually respond what is it that we want to say to the user because that requires some level of thinking almost but with today's language model technology it's actually not that hard to put together a conversational agent like the one we have on pepper instead we notice that there is a different challenge that is really hard to solve with pepper and that's figuring out what exactly a user said to begin with now I know that speech recognition seems like a task that has been established and there are pretty standard Solutions these days and so why exactly was it so difficult to implement speech recognition for pepper well to understand why let's travel back in time a bit I've been recording YouTube videos and Publishing them to this channel for years now and a couple years ago I got the opportunity to record a short video with my mentor IBM fellow John Cohn now this video was recorded at IBM interconnect 2017 I believe it was and we were talking about iot the Watson TJ bot and let's take a listen to a very short clip of that video I'm John I've been in IBM 35 years I'm the chief scientist for the Watson iot Division and I'm based in Munich and helping bring up our new headquarters there and we're going to get him here the reason I wanted you to watch that clip is because I wanted you to notice the audio quality we recorded that video in a pop-up room at a conference Expo and so there was a little bit of physical isolation between us and the thousands of other people making noise but I was also using a relatively high quality microphone and I applied noise cancellation in post considering all of that the audio quality still wasn't great so you can imagine now extending this to Pepper sometimes we're willing to demonstrate pepper in conference Expos that have no physical isolation between us and tons of people making noise as a matter of fact Pepper's microphones are also lower quality than the ones that I used in this video additionally there's no real-time noise cancellation and there's also fan noise from Pepper's head and so putting all that into account we're dealing with some pretty challenging input data in order to do speech recognition from pepper now still modern machine learning advances mean that speech recognition on this data actually isn't as difficult as it used to be open AI whisper for example which is one of the best speech recognition models available open source can actually compensate with this kind of noise really well for example the clip that you heard of me and John would easily be able to be transcribed by Whisper and it would even add punctuation and sort of stylization to the text it's really good at it and I can also confirm having used it with pepper that it's really good at listening to the data coming from Pepper's mics and telling us exactly what was said by the active voice now despite the fact that Whispers really accurate if we were to give it audio data that we captured from pepper it has one architectural Quirk that makes it particularly difficult to deploy in the specific application that we want to build which is just open-ended natural conversation and that architectural Quirk is that whisper is meant to analyze 30 second audio clips at once using what's known as Global attention so what this means is that when you feed whisper an audio file it's actually taking a look at the whole thing and using information from across the clip to transcribe beginning to end so even when it's writing the very first word that it hears in the audio file it's actually already quote unquote heard the rest of the audio as well intuitively this can be useful for things like noise cancellation and determining what the active voice is based off of context in the audio clip but still though that makes it harder to apply here now the way that whisper works is in contrast to the way that the majority of speech recognition systems work for personal assistance most of them do what's known as streaming or causal ASR automatic speech recognition now when speech recognition is causal or streaming this means that as new audio data shows up the model is actively making predictions with just that data the availability of more data in the future will not change the predictions of the data that was seen in the past this is not the case for whisper and so now we effectively have a model that is really good at telling us what you said if we can somehow figure out when you set it because if we can't figure out when you set it then there's no way for us to click the audio file and if we can't click the audio file then we don't have something for whisper to look at and so it's either we don't know when you started speaking and so we never record and so we never get a transcription or we just immediately assume that you are speaking when you run the application and then we don't know when you stop speaking and so once again we don't know when to clip and we don't know when to feed and to whisper to get a transcription most personal assistants solve problems like this using something like a wake word so if I said something like Okay Google or Alexa I'm priming these assistants to begin and to start listening to me and we actually did something really similar hey Pepper what's the meaning of life me saying hey Pepper there is actually what causes pepper to start listening and to then transcribe what I say next the way we did it is by using a custom technique but not by training any kind of custom machine learning models and I actually really like that I mean even though I love training machine learning models I actually love this pipeline because we didn't need to train any machine learning models because it's a wonderful example of how a little bit of creativity with how you use existing tools can enable you to do all sorts of things that you wouldn't have thought you could do at first glance as long as you know how the tools actually work you see the way that whisper works is it has an encoder and a decoder the encoder is the part of the model that's responsible for quote unquote hearing your audio and effectively well encoding it into an internal representation that effectively represents what this audio means to the model the decoder is then responsible for taking the meaning of that audio then from the encoder and turning it into actual words or tokens the way the decoder works is auto regressive it's able to take a look at the entire input audio sequence but at any given step the only thing that the decoder outputs is the next word that it believes was present in the audio file in the beginning it sees no words and outputs a single one after it outputs a single word that same word is then fed back into the model for it to predict the next word and so on and so forth we keep doing this until the model predicts end of sequence it says okay I've seen all these words I think there's nothing left and that's where we stop what's called decoding Loop this decoding Loop is what enables us to get the actual transcription of an audio file from whisper now if you do use whisper this way you've effectively implemented speech recognition for your application however you can also use the decoder in different ways and one of the ways that I came up with is one that enables a different use case and so of speech recognition we can do transcript likelihood scoring or more specifically if you're given an audio file and you're given a transcript what is the likelihood that transcript is in fact the correct one for the given audio file and if you can do that transcript likelihood scoring then suddenly you've basically already implemented wake word detection because now all you got to do is stream in a bunch of audio and as you get that stream you continuously check the likelihood of that audio stream's real transcription being something like hey Pepper or whatever else your wake word is immediately as you detect a spike or like a peak or like you cross a threshold in the probabilities of that transcription that represents your wake word you know that someone has said the Wake word and you can wake up even if that's not necessarily the number one prediction from the model the way this technique works is by kind of hijacking the decoder of the model you see normally at every step the decoder is not outputting an actual word instead the decoder is outputting the likelihood across its entire vocabulary of any one of those words being the next word and what we generally do is we choose for example the token with the highest probability and feed that token back into the model for it to make the prediction for the next token until it finally predicts end of sequence at that point the model saying I'm done making a prediction this sequence is over however by hijacking the decoder and instead of choosing its own highest output probability instead by feeding it the token that I know should come next specifically from the transcript that I want to score my candidate transcript what I can effectively do is force the decoder to Output the probabilities over all of these steps of the next token that I know comes next being the correct token by multiplying these probabilities together I can get one final score that represents How likely the candidate transcript is to be the correct transcript for example if I said hey Pepper and if I were scoring hey Pepper then at the first step the word hey would have a high probability and the next step the word Pepper would have a high probability and so multiply them together I would still get a relatively high probability however if I were to say something like hey Man then what's going to happen is that the first step hey has a high probability so I'll start with a high probability but multiplying the probability for pepper after I said Man will give me a very low value multiplying those two together brings the entire score a lot lower and Alpha were to say something like similarly hello Pepper then at first I start with a low value and so even if the next step for pepper gives us a high value multiplying them together still gives me a low value and so effectively I have turned this into a network that's capable of scoring a candidate transcript rather than necessarily having to generate a transcript from scratch and so if this all works out then in theory we should be good to go and we should have wake or detection problem solved let's try it out as you can see I've got a program here if I were to go ahead and run this binary it's going to start listening to me in one second increments and it's going to see um when I say the Wake word so right now it's in background audio mode but right as I say hey Pepper how are you doing today as you can see it worked it detected the Wake word in real time but it doesn't know when to stop listening that is our next challenge how does the program know when to stop listening this is what's known as endpoint detection now what's happening is that pepper doesn't really know when to stop listening and so we're unable to know when to clip the audio to send it over for a real transcription we cannot use traditional voice activity detection which is what most personal assistants do because they're based off of the human voices of the frequency range and because that's what they're based on it wouldn't work in a noisy environment where the noise is other people's voices like a conference Expo instead we need to do something a little bit more creative and my solution to this problem was to use speaker embeddings you see there are certain kinds of models that are trained specifically to be able to take clips of people speaking and to embed them into a space that represents what the active speaker's voice sounds like and so effectively what it does is it Maps a person's voice into a point in a high dimensional Vector space where any points that are closer together in this High dimensional Vector space represent that they probably originated from audio clips with the same speaker because of course they sound similar in a way you're kind of fingerprinting what someone's voice sounds like mathematically and you might be able to see where this is going you see because of the Wake word detection we have a guaranteed audio clip from the person who is actively speaking to Pepper actively engaging with pepper specifically you would have said hey Pepper because we now have the capability to detect this audio in real time we immediately know that we have an audio clip with the correct sort of active speaker so what if we fingerprinted this audio clip and then all we got to do is look through the audio stream as it continues to come in from pepper and immediately as we detect that that fingerprint is no longer present for some period of time we can assume that well whoever was speaking the pepper has now finished their utterance and we're good to click the audio and send it over to whisper and that's exactly what we do for pepper the core engine of this entire system is written in Rust so that it's performant and runs in real time it's able to accept audio from any input source for example from Pepper or even from my Max microphone so I can test that out right now as a matter of fact we're running whisper and speech brains speaker embedding models locally I've compiled them through tort script and I'm actually calling them from the torch C API bindings for rust once that's all in place and once we use all those components to detect the audio clip of you know when someone starts and stops speaking once we have that clip it's fed into open ai's whisper API that runs it through a much larger version of The Whisper model to give me the final transcription making it as accurate as it can be but enough talking let's actually take a look at the system in action so as you can see I'm going to go ahead and run the same binary as last time I've made a couple modifications and recompiled as you can see immediately we are in background mode but theoretically I should be able to say the Wake word and an utterance and it should transcribe it let's try it out hey Pepper how are you doing today as you can see it was able to detect the start it was able to detect the end point it was printing out the cosine similarity of the voice with distance rather in real time and it even gave me the transcript as a matter of fact I can continue to speak right now and it's not getting confused because it's not detecting the Wake word it's not exiting background mode so it's not recording any of this and it's not being sent over for real transcription however once again I could immediately in my utterance just go and say something like hey Pepper could you tell me what the weather is like in Germany right now and just like that as long as I pause long enough so that my voice is detected as an endpoint it goes ahead and gives me the transcription of what I say just like so now in the real world pepper wouldn't be able to actually give me the weather in Germany because it's based off of a language model and it's not retrieving that information but that's something we could always extend it and do in the future but that's a demo of the core speech engine that goes behind pepper today now I think this is a really really creative combination of different machine learning techniques that weren't really even meant to do specifically this but using them in this sort of architecture enables us to build something really really cool and this is why I say code is an art because while of course the implementation of this was certainly not straightforward and you can take a look at what I mean by that down in the description below you can take a look at the actual GitHub repo for this still it wasn't the actual implementation that was most difficult it was just coming up with what to do in the first place that is the thing that took weeks for us to iterate through and try different designs and try different architectures and techniques that is what really matters at the end of the day and so thank you very much for joining today I hope you enjoyed that if you have any suggestions or feedback leave it down in the comment section below or any questions we're happy to get back to you once again thanks for joining goodbye

No comments:

Post a Comment

Connect broadband

Metrics To Evaluate Machine Learning Algorithms in Python

The metrics that you choose to evaluate your machine learning algorithms are very important. Choice of metrics influences how the performanc...