Building a Speech Transcription and Translation Bot in Skype for Business using UCMA and Microsoft Translator API
Ever since I first saw the Presentation Translator about a month ago I wondered if it was possible to use the technology behind it in Skype for Business calls. Having both a transcript of a conversation and optional real-time translation would be huge, and a genuine advantage to businesses everywhere.
It turns out that it is possible (watch with sound) 🙂
There’s a full explanation below, but here are the big takeaways:
-
I’m open-sourcing the code on GitHub here. Read the disclaimer about the code before using it – it’s NOT production ready!
-
If you’re looking for something that IS production ready, or you’re interested in having something like this that’s supported, contact me to talk about options with my employer.
How it works
How it works – simple:
What you’re seeing in the video is me having an audio call with a Skype for Business bot. The bot shows in my contact list like a person but is connected to a program running on a server using an API called UCMA. The bot is listening to the audio and passing it through to the Microsoft Translator API, which is an online service hosted by Microsoft. You can send the Translator API audio, and it will send back both the transcript of what you’re saying and any translation you require. (it can optionally also send back audio of translated voice, but I’m not using that here).
How it works – technical:
UCMA can capture audio of a call, but only in a very limited sense and the only supported implementation is to write the audio to a WMA file. Translator API is a Web Socket which you send chunked WAV audio to. However, because the WMA is written to in real-time (the file starts at zero and grows during the call, it isn’t written all in one go at the end) I can both write to it and simultaneously read from it. That gives me a stream of audio, which I pipe into a converter to get it into WAV, then chunk it up into pieces and send it to the Translator API. As the results come back I write them into the IM channel of the call.
Limitations Today
There are several limitations with the code right now. I hope to iron them out over time but if you’re going to take this and do something with it then you should be aware of them. Right now, the input language is English, and the translation is German. That’s easily changed in the code by modifying the API endpoint (which takes all those options as parameters) but it would be better if the user could ‘tell’ the bot to change languages in real-time. There are also lots of opportunities to do something better with the transcript after the call, for instance, to write it to a central store, or email it to all participants etc.
There are some limitations with the technology being used as well. Because this uses UCMA it means that it only works with Skype for Business installed on-premise, it won’t work with Skype for Business online. There are definitely ways around that though if you wanted to make it work and were prepared to put in the effort (see end of post for contact info) but currently the cloud-based versions of these APIs dont’ really support this.
Possibilities
Just this basic proof-of-concept does a good job of highlighting what’s possible with technology today. Having a bot you can pull into conversations and translate between people of different languages has huge implications for worldwide trade and partnerships. It questions the way that English is a default language for many conversations and opens up partnerships which would previously have been impossible. For compliance-heavy sectors having a written, searchable transcript of what was said might be invaluable. For everyone else, it might just be a really useful way to quickly find what was said in previous conversations.
Another interesting use case is when joining a meeting remotely, using a bad or unreliable network. Your network may mean that the audio is missing or not understandable, but the IM transcript is more likely to make it to your device unscathed as the payload is much smaller. Rather than missing important sections of conversation, you can use the transcript to fill in the gaps.
Open-source, code on GitHub
Like many of the projects I do, all the code for this is on GitHub. I encourage you to look at it, to improve it, to add features and make it better.
Code Disclaimer
To quote James Whittaker: “you might be a creative if … you are less proud of the code you wrote than the possibilities it creates.”
This code is written in a hurry. It’s written in time snatched at home after work, at weekends, in between doing chores. It’s messy. Class files are large, nulls abound, comments are scarce. The potential for an exception to break everything is high. It’s still a Console Application.
And that’s OK. Because I’ve spent the limited time I’ve had in proving a point, in doing something which (as far I know) hasn’t been done yet. This code isn’t ready for production use, it’s meant as a learning aid, as a stepping stone to something greater. If you’re the sort of person that cares more about how the code looks than what it does, this code probably isn’t for you. Feel free to step out of the way of those of us that are happy to hold our noses in order to get stuff done. There is absolutely a time for well-written, easily maintainable code, but a POC is not it.
How to Use the Sample
You’ll need to fork or download the code, and build it in x64. Then you’ll need to create a UCMA trusted application and application endpoint (see this blog post for that).
To use the Translator API you’ll need a key. Full instructions for getting one are here: https://www.microsoft.com/en-us/translator/getstarted.aspx
Once you have the Translator API key, set it in the app.config. Whilst you’re there set the UCMAAppID key to match your UCMA Trusted Application ID.
Run the sample. The Console window will give you basic output and will tell you once the UCMA endpoint has been discovered. After that either audio call the Trusted Application endpoint directly, or drag it into a conference. Full stops in the console window represent audio chunks being sent to the server.
Warning: I’m not sure how it deals with more than one call. It might be fine, but I haven’t tried it 🙂
Credits
There’s no way I would have been able to do this project without some serious help from others. In no particular order:
- The Microsoft Translator API team, for democratizing complex AI and making language transcript and translation so easy to consume. Check out the API, it’s awesome.
- The Speech Translator demo application. I borrowed heavily from this, including the idea of chunking audio into a queue and communicating with the API. Thanks to anthonyaue, Chris Wendt, SychevIgor and anyone else involved in this I don’t know about.
- The people who maintain NAudio, which is a fantastic open-source project to make doing anything with audio easy. I used them to convert a WMA file into a WAV stream.
- The UCMA Task Extensions compiled by Michael Greenlee which make using UCMA with async/await possible.
What’s Next
I’m hoping to find the time to build this out a little bit more. Some of the things which I think would be cool are:
- ability to IM the bot a specific language and then get the translation in that language. Maybe if one user IMs “German” and another user IMs “French” then the bot opens up separate “back-channel” IM conversations with them in their own language.
- I want to try and improve the delay between something being spoken and the text being shown. I think there’s some tweaking of the various buffers between the WMA and the chunked audio that goes to the Web Socket which could improve things.
- A nice feature would be to also capture the active speaker in the conversation and append that to the transcript information.
If you’re interested in a more stable version of this proof-of-concept with specific features or options, contact me and we can talk through options for delivering it via my employer.
Hey Tom,
Nice post. Not really agree that its not done before though. I have a similar implementation for a proprietary system, with the exception that I am not using Microsoft Translator API, instead using ReadSpeaker TTS engines. I am not a big fan of TTS, so probably will switch to API – and thanks for showing its usability. I am sure the dev community will love this as do I. Cheers
There’s a new-ish platform from MS (Real-time media calling) which may be easier to use
https://docs.microsoft.com/en-us/bot-framework/dotnet/bot-builder-dotnet-real-time-media-concepts
This is for Skype (consumer) only though I think, not Skype for Business, and definitely not for on-premise Skype for Business.
THAT’S Amazing Tom! you are the best !