AWS and DXC collaborate to ship customizable, close to real-time voice-to-voice translation capabilities for Amazon Join

Offering efficient multilingual buyer assist in international companies presents important operational challenges. By means of collaboration between AWS and DXC Expertise, we’ve developed a scalable voice-to-voice (V2V) translation prototype that transforms how contact facilities deal with multi-lingual buyer interactions.

On this submit, we focus on how AWS and DXC used Amazon Join and different AWS AI companies to ship close to real-time V2V translation capabilities.

Problem: Serving prospects in a number of languages

In Q3 2024, DXC Expertise approached AWS with a crucial enterprise problem: their international contact facilities wanted to serve prospects in a number of languages with out the exponential price of hiring language-specific brokers for the decrease quantity languages. Beforehand, DXC had explored a number of current options however discovered limitations in every method – from communication constraints to infrastructure necessities that impacted reliability, scalability, and operational prices. DXC and AWS determined to arrange a centered hackathon the place DXC and AWS Resolution Architects collaborated to:

Outline important necessities for real-time translation
Set up latency and accuracy benchmarks
Create seamless integration paths with current programs
Develop a phased implementation technique
Put together and check an preliminary proof of idea setup

Enterprise affect

For DXC, this prototype was used as an enabler, permitting technical expertise maximization, operational transformation, and value enhancements by:

Greatest technical experience supply – Hiring and matching brokers based mostly on technical data reasonably than spoken language, ensuring prospects get prime technical assist no matter language boundaries
International operational flexibility – Eradicating geographical and language constraints in hiring, placement, and assist supply whereas sustaining constant service high quality throughout all languages
Price discount – Eliminating multi-language experience premiums, specialised language coaching, and infrastructure prices by pay-per-use translation mannequin
Comparable expertise to native audio system – Sustaining pure dialog move with close to real-time translation and audio suggestions, whereas delivering premium technical assist in buyer’s most popular language

Resolution overview

The Amazon Join V2V translation prototype makes use of AWS superior speech recognition and machine translation applied sciences to allow real-time dialog translation between brokers and prospects, permitting them to talk of their most popular languages whereas having pure conversations. It consists of the next key parts:

Speech recognition – The shopper’s spoken language is captured and transformed into textual content utilizing Amazon Transcribe, which serves because the speech recognition engine. The transcript (textual content) is then fed into the machine translation engine.
Machine translation – Amazon Translate, the machine translation engine, interprets the shopper’s transcript into the agent’s most popular language in close to actual time. The translated transcript is transformed again into speech utilizing Amazon Polly, which serves because the text-to-speech engine.
Bidirectional translation – The method is reversed for the agent’s response, translating their speech into the shopper’s language and delivering the translated audio to the shopper.
Seamless integration – The V2V translation pattern undertaking integrates with Amazon Join, enabling brokers to deal with buyer interactions in a number of languages with none extra effort or coaching, utilizing the Amazon Join Streams JS and Amazon Join RTC JS libraries.

The prototype may be prolonged with different AWS AI companies to additional customise the interpretation capabilities. It’s open supply and prepared for personalization to satisfy your particular wants.

The next diagram illustrates the answer structure.

The next screenshot illustrates a pattern agent internet utility.

The consumer interface consists of three sections:

Contact Management Panel – A softphone consumer utilizing Amazon Join
Buyer Controls – Buyer-to-agent interplay controls, together with Transcribe Buyer Voice, Translate Buyer Voice, and Synthesize Buyer Voice
Agent controls – Agent-to-customer interplay controls, together with Transcribe Agent Voice, Translate Agent Voice, and Synthesize Agent Voice

Challenges when implementing close to real-time voice translation

The Amazon Join V2V pattern undertaking was designed to attenuate the audio processing time from the second the shopper or agent finishes talking till the translated audio stream is began. Nevertheless, even with the shortest audio processing time, the consumer expertise nonetheless doesn’t match the expertise of an actual dialog when each are talking the identical language. That is as a result of particular sample of the shopper solely listening to the agent’s translated speech, and the agent solely listening to the shopper’s translated speech. The next diagram shows that sample.

The instance workflow consists of the next steps:

The shopper begins talking in their very own language, and speaks for 10 seconds.
As a result of the agent solely hears the shopper’s translated speech, the agent first hears 10 seconds of silence.
When buyer finishes talking, the audio processing time takes 1–2 seconds, throughout which era each the shopper and agent hear silence.
The shopper’s translated speech is streamed to the agent. Throughout that point, the shopper hears silence.
When the shopper’s translated speech playback is full, the agent begins talking, and speaks for 10 seconds.
As a result of buyer solely hears the agent’s translated speech, the shopper hears 10 seconds of silence.
When the agent finishes talking, the audio processing time takes 1–2 seconds, throughout which era each the shopper and agent hear silence.
The agent’s translated speech is streamed to the agent. Throughout that point, the agent hears silence.

On this situation, the shopper hears a single block of twenty-two–24 seconds of a whole silence, from the second they completed talking till they hear the agent’s translated voice. This creates a suboptimal expertise, as a result of the shopper won’t make sure what is going on throughout these 22–24 seconds—as an example, if the agent was capable of hear them, or if there was a technical subject.

Audio streaming add-ons

In a face-to-face dialog situation between two those who don’t converse the identical language, they could have one other particular person as a translator or interpreter. An instance workflow consists of the next steps:

Individual A speaks in their very own language, which is heard by Individual B and the translator.
The translator interprets what Individual A stated to Individual B’s language. The interpretation is heard by Individual B and Individual A.

Primarily, Individual A and Individual B hear one another talking their very own language, and so they additionally hear the interpretation (from the translator). There’s no ready in silence, which is much more essential in non-face-to-face conversations (comparable to contact middle interactions).

To optimize the shopper/agent expertise, the Amazon Join V2V pattern undertaking implements audio streaming add-ons to simulate a extra pure dialog expertise. The next diagram illustrates an instance workflow.

The workflow consists of the next steps:

The shopper begins talking in their very own language, and speaks for 10 seconds.
The agent hears the shopper’s authentic voice, at a decrease quantity (“Stream Buyer Mic to Agent” enabled).
When the shopper finishes talking, the audio processing time takes 1–2 seconds. Throughout that point, the shopper and agent hear refined audio suggestions—contact middle background noise—at a really low quantity (“Audio Suggestions” enabled).
The shopper’s translated speech is then streamed to the agent. Throughout that point, the shopper hears their translated speech, at a decrease quantity (“Stream Buyer Translation to Buyer” enabled).
When the shopper’s translated speech playback is full, the agent begins talking, and speaks for 10 seconds.
The shopper hears the agent’s authentic voice, at a decrease quantity (“Stream Agent Mic to Buyer” enabled).
When the agent finishes talking, the audio processing time takes 1–2 seconds. Throughout that point, the shopper and agent hear refined audio suggestions—contact middle background noise—at a really low quantity (“Audio Suggestions” enabled).
The agent’s translated speech is then streamed to the agent. Throughout that point, the agent hears their translated speech, at a decrease quantity (“Stream Agent Translation to Agent” enabled).

On this situation, the shopper hears two brief blocks (1–2 seconds) of refined audio suggestions, as an alternative of a single block of twenty-two–24 seconds of full silence. This sample is far nearer to a face-to-face dialog that features a translator.

The audio streaming add-ons present extra advantages, together with:

Voice traits – In circumstances when the agent and buyer solely hear their translated and synthesized speech, the precise voice traits are misplaced. As an example, the agent can’t hear if the shopper was speaking sluggish or quick, if the shopper was upset or calm, and so forth. The translated and synthesized speech doesn’t carry over that info.
High quality assurance – In circumstances when name recording is enabled, solely the shopper’s authentic voice and the agent’s synthesized speech are recorded, as a result of the interpretation and the synthetization are finished on the agent (consumer) facet. This makes it troublesome for QA groups to correctly consider and audit the conversations, together with the numerous silent blocks inside it. As an alternative, when the audio streaming add-ons are enabled, there aren’t any silent blocks, and the QA group can hear the agent’s authentic voice, the shopper’s authentic voice, and their respective translated and synthesized speech, all in a single audio file.
Transcription and translation accuracy – Having each the unique and translated speech obtainable within the name recording makes it simple to detect particular phrases that will enhance transcription accuracy (through the use of Amazon Transcribe customized vocabularies) or translation accuracy (utilizing Amazon Translate customized terminologies), to ensure that your model names, character names, mannequin names, and different distinctive content material are transcribed and translated to the specified end result.

Get began with Amazon Join V2V

Prepared to remodel your contact middle’s communication? Our Amazon Join V2V pattern undertaking is now obtainable on GitHub. We invite you to discover, deploy, and experiment with this highly effective prototype. You’ll be able to it as a basis for creating revolutionary multi-lingual communication options in your personal contact middle, by the next key steps:

Clone the GitHub repository.
Take a look at totally different configurations for audio streaming add-ons.
Evaluation the pattern undertaking’s limitations within the README.
Develop your implementation technique:
1. Implement sturdy safety and compliance controls that meet your group’s requirements.
2. Collaborate together with your buyer expertise group to outline your particular use case necessities.
3. Stability between automation and the agent’s guide controls (for instance, use an Amazon Join contact move to mechanically set contact attributes for most popular languages and audio streaming add-ons).
4. Use your most popular transcribe, translate, and text-to-speech engines, based mostly on particular language assist necessities and enterprise, authorized, and regional preferences.
5. Plan a phased rollout, beginning with a pilot group, then iteratively optimize your transcription customized vocabularies and translation customized terminologies.

Conclusion

The Amazon Join V2V pattern undertaking demonstrates how Amazon Join and superior AWS AI companies can break down language boundaries, improve operational flexibility, and scale back assist prices. Get began now and revolutionize how your contact middle communicates throughout language boundaries!