Tutorial on Building an Automated Dubbing Service with AI Technologies

Introduction

The arrival of highly advanced text-to-speech technology in recent years has opened the floodgate of so many innovative and cutting-edge AI-powered products. No longer are we limited to the awkward and robotic synthesized speeches generated by text-to-speech technology of the olden days. Recently, a company called ElevenLabs has upped the ante by providing us with features centered around voice generations. From creating and designing our custom voices to synthesizing speeches using the voices that we create or using pre-made voices provided by ElevenLabs.

In this tutorial, we will build an automatic dubbing service using text-to-speech technology by ElevenLabs. Not only that, we will also identify the necessary steps from retrieving the video from Youtube link to combining the video with the generated dubs.

Introduction to ElevenLabs

ElevenLabs is a highly innovative company that offers a powerful but easy-to-use API for voice generation. They boast cutting-edge technology, their voice generation API, that is trained on large swathes of audiobooks and podcasts, resulting in the ability to generate natural-sounding and expressive speeches. Thus, ElevenLabs' API can serve as an ideal choice for a wide range of voice-centric products, such as story/audiobooks narration and voiceover for videos.

What is OpenAI's Whisper?

Whisper is an audio transcription service, or speech-to-text developed by OpenAI. It's reported to be trained on 680,000 hours of multilingual and multitask supervised data collected from the web, providing improved consistencies in detecting accents, background noise, and technical language. Whisper is capable of transcribing speeches in multiple languages as well as translating from non-English languages.

Introduction to Anthropic's Claude Model

Claude is an advanced AI model developed by Anthropic, based on their research into promoting and training helpful, honest, and harmless AI systems. It is designed to help with various language-centric use cases such as text summarization, collaborative writing, Q&A, and coding. Early reviews from various users reported that Claude is much more averse to producing questionable and harmful responses, easier and intuitive to work with, and more "steerable." Bottom line, Claude can produce human-like responses, making it perfect for building services expected to deliver humane and excellent user experience. Because of Claude's renowned excellence in dealing with language-centric tasks, we will use Claude to help us translate our video transcript.

Prerequisites

Basic knowledge of Python, experience with Streamlit is a plus
Access to ElevenLabs' API
Access to Anthropic's API

Outline

Identifying the Requirements
Initializing the Project
Adding Video Transcription Feature using OpenAI's Whisper
Adding Translation Feature using Anthropic's Claude
Adding Dubs Generation Feature using ElevenLabs' API
Final Touch - Combining the Video with the Generated Dubs
Testing the Auto Dubbing Service

Identifying the Requirements

Let’s ask ourselves again, what features do our service need as an automatic dubs generator for YouTube videos? Well, let’s trace the steps involved in generating dubs. Starting from retrieving the video to combining the dubs with the video.

Retrieve the Video from YouTube Link

We can use the popular Python library pytube for this purpose. By providing the link to the YouTube function, we can retrieve the video and audio streams, along with metadata such as title, description, and thumbnail. For this step, we will require a text input field for the YouTube link and a button to initiate the stream download process.

Transcribe the Audio Stream

After downloading the audio stream from the YouTube video, we can start transcribing the audio with OpenAI's Whisper. For performance reasons, we will first cut the audio into one-minute durations before transcription. After obtaining the transcript, we will display it in a DataFrame listing the start, end, and text spoken at specific time points.

Translating the Transcript

The translation process will begin immediately after obtaining the transcript. We’ll use the anthropic library to access Claude model and send our prompt for the translation of the transcript originally in English.

Generating the Dubs

Once we receive the response from Claude, we'll proceed to generate audio using ElevenLabs’ API. By importing functions from the elevenlabs library, we can generate speech using pre-made voices and a multilingual model, specifically for non-English translations.

Combining the Dubs with the Video

Finally, we will retrieve the video stream from the YouTube link and combine it with the generated audio. For this task, we will use a command-line software called ffmpeg. Once the video is combined with the dubs, we will display the video via a player in our user interface!

Initializing the Project

To build our auto-dubbing service’s user interface, we will use the Streamlit library, allowing us to manage this entire project in a single Python file. Some extra steps will ensure that our project runs smoothly.

Create the project directory

Navigate to your coding projects directory in the terminal, create a project directory, and enter it:

Create and activate the virtual environment

Next, create and activate your virtual environment. This prevents dependency issues by ensuring our project dependencies don’t leak into the global environment. The terminal should show the name of your activated virtual environment.

Installing the dependencies

Install all necessary dependencies using pip. This may take some time, so feel free to get a coffee! The main libraries installed include:

Streamlit: For building the user interface easily.
Anthropic: To connect to Anthropic’s API.
ElevenLabs: For request handling to ElevenLabs API.
Pytube: To retrieve YouTube video metadata and download streams.
Pydub: For creating dubs and managing audio files.
Whisper: To transcribe the downloaded audio.
MoviePy: Used for video manipulation and stream management.
Pydantic: Ensures version compatibility with ElevenLabs library.

Fixing the Pytube bug

Navigate to the Pytube library directory to edit cipher.py, specifically removing the semicolon from the regex pattern to fix RegexMatchError.

Installing ffmpeg

Refer to official documentation for ffmpeg installation on various operating systems.

Creating our Streamlit secret file

Create a secrets.toml file to manage sensitive API keys and credentials needed for different AI services utilized in this application.

Creating the autodubs.py file

Create autodubs.py, add essential UI elements like title, text input, and button.

Adding Video Transcription Feature using OpenAI's Whisper

We begin by adding functionality to the "Transcribe!" button using Pytube and Whisper libraries to download audio streams and transcribe them. Afterward, the results will be displayed in a Pandas DataFrame.

Adding Translation Feature using Anthropic's Claude

A newly added function will generate translations using Claude based on the earlier transcription. This function will directly return results without additional text.

Adding Dubs Generation Feature using ElevenLabs' API

Utilizing ElevenLabs' API, we'll generate dubs from the translated text using a prerecorded voice. We will write an mp3 file from the generated speech.

Final Touch - Combining the Video with the Generated Dubs

In this penultimate section, we will download the video stream and merge it with the generated audio using ffmpeg, finally updating the UI to show the successfully dubbed video.

Testing the Auto Dubbing Service

Run the application, click the "Transcribe!" button, and wait for the entire process to finish, including requests to the APIs and file manipulations. Once all is done, the video with dubs should be ready for viewing in the video player.

Conclusion

In this detailed tutorial, we explored one of the exciting uses of ElevenLabs' API to generate dubbing for YouTube videos. By combining OpenAI's Whisper for transcription, Claude for translation, and ElevenLabs for dubs, we built a full-fledged automatic dubbing service leveraging industry-leading AI technology!