How to: perform Text-To-Speech (TTS) with a Microsoft Teams bot using Bing Speech API and Teams Calls & Meetings API
Introduction
Text-to-Speech (TTS) is the ability of a system to convert a string into an audio file. This can be very useful when working with audio-based bots, such as when creating IVR solutions or other automated workflows that involve a system ‘talking’ to a user.
In Skype for Business, UCMA provided TTS capabilities, enabling a UCMA bot to dynamically ‘speak’ to a user.
Today, the Calls & Meeting API for Microsoft Teams does not include the ability to perform TTS. However, using the abilities it does have, plus the Bing Speech API, we can recreate the same functionality, enabling us to create similarly rich solutions in Microsoft Teams.
The Problem
The newly released Calls & Meeting API includes the ability for bots to play media over audio to listening humans. The media must already exist though – the input to the API is the file location of a previously recorded WAV file.
This is done via the PlayPrompt call – either via a RESTful API call or via the C# wrapper.
I’m going to concentrate on the C# wrapper, but most of what I’m going to say is equally applicable to the RESTful API call as well, as it takes the same input.
The PlayMediaPrompt takes (oddly) an array of media to play. (probably because it’s been built from the JSON of the HTTP call, which has an array of prompts):
ICall call = this.Client.Calls["*id of the call*"]; await call.PlayPromptAsync(*A list of media prompts to play*).ConfigureAwait(false);
The list in that argument is a POCO containing a URI which is the location of the file to play.
This is all fine, but what if we want to do TTS – and dynamically specify what we want to say? We can’t use pre-created audio files in this case.
Intro to Microsoft Bing Speech API
The Microsoft Bing Speech API is a simple API that nicely abstracts the whole process of performing TTS. This is a really brief summary of how it works, so that when we go through the code you’ll know what’s happening.
It’s part of Azure Cognitive Services and is free to try for 7 days with a trial key. After that, you can create a key in Azure with varying pricing structures.
To get started, go to azure.microsoft.com/en-us/try/cognitive-services.
If you already have an Azure account and just want to jump straight to creating a key there, create a new Bing Speech resource, then go to Keys to copy your subscription key.
Once you have a subscription key, you have to exchange it for a Bearer token. You do this by POSTing to https://api.cognitive.microsoft.com/sts/v1.0/issueToken with your subscription key set in a header named Ocp-Apim-Subscription-Key. The response is your Bearer token.
POST /sts/v1.0/issueToken HTTP/1.1 Host: api.cognitive.microsoft.com Ocp-Apim-Subscription-Key: YOUR_SUBSCRIPTION_KEY Cache-Control: no-cache
Once you have your Bearer, then you can POST to https://speech.platform.bing.com/synthesize using it as the Authorization. The body of the POST message is Speech Synthesis Markup Language (SSML), a XML-based markup language for defining how TTS should be performed. Here’s an example of some SSML:
<speak version='1.0' xml:lang='en-US'> <voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)'> Hello, can you hear me?! </voice> </speak>
Here’s an example of that POST:
POST /synthesize HTTP/1.1 Host: speech.platform.bing.com Content-Type: application/ssml+xml X-Microsoft-OutputFormat: riff-16khz-16bit-mono-pcm User-Agent: Your App Name Authorization: Bearer ey..vv8 <speak version='1.0' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)'>Hello, can you hear me?!</voice></speak>
The response from this POST is a stream, representing a WAV audio file (exactly what type is dependent on some headers you can set, details later).
Bing Speech API + Teams API
The tricky part here is that the Teams Calls & Meeting API will only accept a pre-recorded audio file, but we want to dynamically generate one. The way I’ve worked around this is to create a class which calls the API, converts the stream into a WAV file on the server, then returns the location of that file back so that the Teams API can call it. Sort of Just-In-Time file generation. The Bing Speech API is pretty fast, so that can all happen real-time without the user noticing.
Disclaimer: this is the point at which I should warn you, this is all sample code. You need to have a good think about whether this approach will work for you. It’s going to generate a new file on your server every time you call it, which may have an impact for you. There are definitely housekeeping things you can do in order to make this work at scale, but I’m not covering them here. At your own risk!
Here’s how I’ve done it. There’s a single public method which takes in the text message to be converted to speech. The return is the location on file to the wav file:
public async Task<string> GenerateTTSFile(string message) { var accessToken = await GetAccessToken(YOUR_BING_SPEECH_API_KEY); var fileStream = await GetTTS(accessToken, message); var filename = Guid.NewGuid(); using (var stream = File.Create("wwwroot/audio/" + filename + ".wav")) { fileStream.Seek(0, SeekOrigin.Begin); fileStream.CopyTo(stream); } return "audio/" + filename + ".wav"; }
This method obtains an access token, calls the Bing Speech API, and then processes the resulting stream into a file, before returning the file location.
You’ll notice that I’m generating a unique filename each time. One optimisation I thought about, but didn’t include, was hashing the message string, and avoiding creating the same speech file multiple times, if it was likely that the same message would need to be played repeatedly.
The GetAccessToken method is a simple POST call, passing the subscription key and returning the bearer token:
private async Task<string> GetAccessToken(string subscriptionKey) { HttpClient client = new HttpClient(); client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey); var response = await client.PostAsync("https://api.cognitive.microsoft.com/sts/v1.0/issueToken",null); return await response.Content.ReadAsStringAsync(); }
The GetTTS method does the actual call to the Bing Speech API:
private async Task<Stream> GetTTS(string accessToken, string message) { var xmlMessage = string.Format("<speak version='1.0' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)'><prosody rate='+0.00%'>{0}</prosody></voice></speak>", message); HttpClient client = new HttpClient(); HttpContent content = new StringContent(xmlMessage); content.Headers.ContentType = new System.Net.Http.Headers.MediaTypeHeaderValue("application/ssml+xml"); client.DefaultRequestHeaders.Add("X-Microsoft-OutputFormat", "riff-16khz-16bit-mono-pcm"); client.DefaultRequestHeaders.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("Bearer", accessToken); client.DefaultRequestHeaders.Add("User-Agent", "Your App Name"); var response = await client.PostAsync("https://speech.platform.bing.com/synthesize", content); return await response.Content.ReadAsStreamAsync(); }
There’s a couple of things to note:
- I’m hardcoding the style of the speech message in the SSML, i.e. the gender, the voice name, etc. These are all changeable, and SSML allows for changing the rate, pitch, style etc of the speech. There’s a hard limit of 1,024 characters for the entire SSML string though (including tags). The full SSML syntax is in the W3C specification.
- You’ll see that I’ve added a X-Microsoft-OutputFormat header. This tells the API what format to return the WAV file in. There are various kinds, but riff-16khz-16bit-mono-pcm is the only one (I’ve found) which works with the Teams Calls & Meeting API.
- I’ve also added a User-Agent header. This is required and must be less than 255 characters
- For full details about this API call, see the Bing Text to Speech API reference.
Once we have generated the file and returned the file location, then calling it in the Calls & Meeting API is very similar to any other file. We just need to create a new MediaInfo object to contain the file information, and add it to the list of MediaPrompt objects:
var audioBaseUri = new Uri("http://public_path_to_url_base/"); var tts = new TTS(); var fileName = await tts.GenerateTTSFile(message); var sb = new StringBuilder(); var ttsMedia = new MediaInfo { Uri = new Uri(audioBaseUri, fileName).ToString(), ResourceId = Guid.NewGuid().ToString(), }; var ttsMediaPrompt = new MediaPrompt() { MediaInfo = ttsMedia, Loop = 1 }; await this.Call.PlayPromptAsync(new List<MediaPrompt> { ttsMediaPrompt }).ConfigureAwait(false);
Just one gotcha here. Remember that this call to play the audio is being converted into an HTTP call and being sent to Microsoft to be processed remotely. That means that the Uri for your newly-created audio file needs to be publically accessible so that the API can access it. Using a local path reference isn’t going to work (and won’t work if you generate only on localhost either – unless you use ngrok).
Conclusion
And that’s it! That’s how you can bring dynamic TTS to the Teams Calls & Meeting API, even though it’s not natively supported. It’s not too complicated, is fast enough to be used real-time and seems to work well.