Have you ever wanted to change Alexa’s voice? You’re not alone. As you’ll see from following some of those links, though, the official methods leave a little something to be desired. Changing Alexa’s default accent or language will get you a voice that sounds a little different, but it’s still definitely Alexa. You can’t decide you’d rather talk to, say, J.A.R.V.I.S..
This is a reasonable choice on Amazon’s part. They’re trying to create a specific “personality” for Alexa, and they can’t have people mucking about with the brand they’re building. Once you realize that, though, there’s a very natural question to ask if you’re a voice developer: What about the brand that I want to build?
What about that, indeed. Why should your voice experience have to sound like it’s provided by Amazon? This is where I finally have some good news for you: it doesn’t!
How to Do It
Rather than making you read a couple sections of exposition before I get to the details, I’m going to give you step-by-step instructions for a working demo of using a custom voice in an Alexa skill first. This might not be great storytelling, but it’ll help you get a prototype up and running. After that, if you’re interested in what’s really going on, come back here and read past the instructions for the full story.
Setting Up the Skill
- Log in to the Alexa developer console.
- On the “Skills” tab (which is selected by default at the time of writing), click Create Skill.
- On the “Create a new skill” screen:
- Enter a name for your skill. Any name will do. Given the content of the skill we’ll be making, I recommend a playful fair-use take on Stuart Smalley.
- Under “Choose a model…” select “Custom” (selected by default).
- Under “Choose a method…” select “Alexa-hosted (Python)“.
- Click the “Create skill” button (you may have to scroll up to see it).
- On the next screen (“Choose a template…“), click the “Import skill” button.
- In the “Import skill” box, enter the URL of a sample repository we’ve set up just for this purpose: https://github.com/spokestack/alexa-custom-tts.
- Click “Import“.
Once you’ve clicked “Import“, Amazon will take care of copying over the code and creating a new sandbox for your skill to run in. When the import completes:
- Click on the “Code” tab to finish setup. This will open the lambda_function.py file in a code editor.
- Look for the “Customize your skill here!” section and customize it at will.
- The only things you need to change are SPOKESTACK_CLIENT_ID and SPOKESTACK_CLIENT_SECRET, replacing the default values with a set of credentials from your account.
- When you’re finished making changes, click “Save” at the top of the page.
- Click “Deploy” (next to “Save“).
- Click over to the “Test” tab while you’re waiting for the deployment to finish.
- In the dropdown next to “Test is disabled for this skill” (at the top of the page), you’ll want to select “Development“. This will let you test your skill directly on the page or on any Alexa-enabled devices connected to the account you used to create this skill.
That’s it! When you invoke your skill, you should hear a dynamically generated response reading the text at the top of lambda_function.py.
So what’s the big deal here? Why aren’t custom voices on skills more common? Read on for a bit of background.
Peeking Behind the Curtain
Let’s start by talking about the method we’re using to skip having our response read by Alexa. To do this, we generate audio and use SSML markup to include your audio in an element, like so:
To see why, let’s first go over some basic terminology. The process that turns the text you provide to Amazon (or Google, or any other smart speaker/voice platform) into audio is known as speech synthesis, or text to speech (TTS). The history of TTS technology is fascinating and goes back further than you might think. Most modern systems are powered by neural networks, which today do everything from guiding self-driving cars to generating April Fools’ pranks. In order to train a neural network to synthesize speech, you need a large collection of text and speech audio that matches it.
Oh, and lots of computing power.
Of course, Amazon and Google have all these things at their disposal, but so far they haven’t been interested in creating tools to make them accessible to us humble developers. What they have done is set rules on the format of audio that can be played in the element we used above. Here are Amazon’s rules for Alexa. So that’s another thing to think about when you’re looking for a TTS system.
Finally, we have to consider latency, or how long it takes for your model (or service) to produce audio from the text it receives. Your users will get upset if they have to wait seconds for their response, and if you take too long to send audio, the smart speakers will throw an error before they even play it. Ideally you want a system that generates an audio URL immediately and “streams” the synthesizer’s results, producing audio faster than it takes to play and delivering that audio a small chunk at a time as it’s produced so there’s no noticeable delay.
Luckily, Spokestack has done that work already, along with the crucial last step of keeping the trained model running and available 24/7. Their TTS API provides faster-than-real-time streaming audio that conforms to Amazon’s requirements, allowing an Alexa skill to use a completely dynamic custom voice (or more than one!). You can even use a subset of SSML or Speech Markdown to customize pronunciation, just like you can for the default smart speaker voices.
Spokestack’s free tier lets you take the service for a spin to gauge voice quality and response time. If you upgrade to the Maker tier, you can replace Spokestack’s free voice with a completely custom voice that you create yourself with their easy-to-use web interface. This is a step that can set your experience apart from other TTS providers that use off-the-shelf voices from the big vendors.
Moving Into Mobile
If you decide to take your voice experience beyond smart speakers and into an app, Spokestack has mobile libraries that will let you use the same custom voices for synthesis, as well as the wake word, speech recognition, and natural language understanding technology that the smart speakers handle for you. There’s even a Python library so you can take things full-circle and create your own smart speaker if you want, using a Raspberry Pi or similar device.
As voice experiences become more common, there’s no reason users should be locked in to hearing the same voice answering every question. The sound of an app’s voice should be a point of connection and familiarity just like its UX is. With a service like Spokestack, that’s finally possible.
Share this article
If you found value in this article, please consider helping others by sharing to your network. Just click one of the links below.