Speech to Speech Translation

In this class, we have talked a lot about machine translation. We have focused on MT as task that transform one sequence of text into another sequence of text in a different language. In this project, we'll consider translating another modality, speech. This project is going to be a hackathon-style project, where you'll be piecing together several existing libraries to develop a speech translation system. The overall goal is to develop a system that can translate speech in one language and output speech in another language. Doing this with a single model is an active area of research with recent successes (e.g. this). This is called end-to-end speech translation. We're going to use a simpler pipeline approach:

src speech --[ASR]--> src text --[MT]--> tgt text --[TTS]--> tgt speech

Training models for each of these components can be time and resource intensive, so we don't want you to be doing that. Try to find existing libraries and APIs if possible, otherwise use pretrained models. If you train your own model, it will take a long time unless you have the resources (a GPU). Here are some suggestions for each piece:

  1. Speech Recognition: speech_recognition
  2. Machine Translation: py-googletrans
  3. Text-to-Speech: pyttsx3

The end result should be something like the Google Translate app: you talk to your phone, it translates it, and then speaks it back in the foreign language.