How to Create an Automated Dubbing Service with ElevenLabs and OpenAI

Introduction

The arrival of highly advanced text-to-speech technology in recent years has opened the floodgates to many innovative and cutting-edge AI-powered products. No longer are we limited to the awkward and robotic synthesized speeches generated by earlier text-to-speech technologies. A company called ElevenLabs has upped the ante by providing features centered around voice generation. From creating and designing our custom voices to synthesizing speeches using either our creations or pre-made ones, ElevenLabs offers powerful tools.

In this tutorial, we will guide you through the creation of an automatic dubbing service using ElevenLabs' text-to-speech technology. Additionally, we will identify the necessary steps, from retrieving videos via YouTube links to merging the videos with the generated dubs.

Introduction to ElevenLabs

ElevenLabs is a highly innovative company offering a user-friendly API for voice generation. Their cutting-edge voice generation API is trained on extensive audiobooks and podcasts, resulting in the ability to produce natural-sounding and expressive speeches. As a result, ElevenLabs' API can effectively serve a range of voice-centric products, including narrative story/audiobooks and video voiceovers.

Introduction to OpenAI's Whisper

Whisper is an audio transcription service, or speech-to-text module, developed by OpenAI. It's trained on an impressive 680,000 hours of multilingual and multitask supervised data collected from various online sources, ensuring improved performance in detecting accents and technical language amidst background noise. Notably, Whisper is capable of transcribing speeches in multiple languages and translating from non-English languages.

Introduction to Anthropic's Claude Model

Claude is an advanced AI model developed by Anthropic, focused on promoting helpful, honest, and safe AI systems. It excels in diverse language-centric tasks, including text summarization, collaborative writing, Q&A, and coding. Feedback from users indicates that Claude is much less likely to produce harmful responses, is user-friendly, and allows users to achieve their desired outputs with minimal effort.

Ultimately, Claude is designed to produce human-like responses, making it ideal for services aimed at delivering humane and excellent user experiences. In this tutorial, we will utilize Claude to help translate our video transcripts.

Prerequisites

Basic knowledge of Python; familiarity with Streamlit is a plus.
Access to ElevenLabs' API.
Access to Anthropic's API.

Outline

Identifying the Requirements
Initializing the Project
Adding Video Transcription Feature using OpenAI's Whisper
Adding Translation Feature using Anthropic's Claude
Adding Dubs Generation Feature using ElevenLabs' API
Final Touch - Combining the Video with the Generated Dubs
Testing the Auto Dubbing Service

Identifying the Requirements

To build an automatic dubbing service for YouTube videos, we must consider the features essential for the dubbing process: retrieving the video, transcribing audio streams, translating text, generating dubs, and finally, combining dubs with the video.

Retrieve the Video from Youtube Link

To retrieve the video and audio streams along with metadata like titles and thumbnails, we will use the pytube library. Users will input the YouTube link into a text field and click a button to trigger the download process.

Transcribe the Audio Stream

After downloading the audio stream, the next step is transcription, utilizing OpenAI's Whisper via the Whisper library. Due to performance reasons, audio will be divided into one-minute segments before transcription. A DataFrame will display the transcribed content, listing start times, end times, and spoken texts.

Translating the Transcript

Using Anthropic's Claude model, we will translate the transcript from English into the selected language. We'll execute this using an anthropic library call.

Generating the Dubs

Upon receiving the translation results from Claude, we will generate the audio using ElevenLabs' API. By importing relevant functions from ElevenLabs' library, we can synthesize speech using a pre-made voice and the multilingual model to support non-English translations.

Combining the Dubs with the Video

Finally, we'll retrieve the video stream from the previously provided YouTube link and combine it with the generated dubs. The ffmpeg command-line software will facilitate this last step. Following the combination process, a video player will be displayed in our user interface!

Initializing the Project

Streamlit will be used to create the user interface for our auto dubbing service, allowing us to consolidate our work into a single Python file. Some crucial initial steps ensure that our project runs smoothly.

Create the Project Directory

Open your terminal and navigate to your coding projects directory.
Create the project directory and enter it.

Create and Activate the Virtual Environment

Next, we'll create and activate our virtual environment, preventing dependencies from leaking into the global environment. Your terminal should indicate when the virtual environment is activated.

Installing Dependencies

Now, install the necessary dependencies via pip. This may take some time, so take a break during the process. Here is a breakdown of the libraries we're installing:

Streamlit: Used for building the user interface.
Anthropic: Connects to Anthropic's API.
ElevenLabs: Serves as a wrapper for ElevenLabs API.
Pytube: Retrievers metadata and streams from YouTube.
Pydub: Manages and edits audio files easily.
Whisper: Transcribes audio downloaded.
MoviePy: Initially intended for speedy video/audio combining but switched to ffmpeg.
Pydantic: Must install the version locked to avoid errors in ElevenLabs.

Fixing Known Issues

As you set up your project, keep in mind potential issues that could arise:

Fixing the Pytube Bug: Navigate to the Pytube library directory, locate the cipher.py file, and remove the semicolon from the problematic regex pattern on line 287.
Installing ffmpeg: Detailed installation instructions can be found online. Successful installation will allow the ffmpeg command to run in your terminal.

Creating Streamlit Secret File

Our auto-dubbing app will use various AI services that require access keys. It is best practice to manage these keys in a separate file. Streamlit utilizes a secrets.toml file to store sensitive information. After creating this file, specify your API keys accordingly.

Creating the autodubs.py File

Create a new autodubs.py file in your project directory. Begin writing your app, incorporating additional UI elements and functionalities.

Adding Video Transcription Feature using OpenAI's Whisper

Next, add a handler for our "Transcribe!" button. This will download the audio stream and transcribe it using OpenAI's Whisper. Import the necessary libraries and define functions to manage audio handling before invoking the echo function.

Adding Translation Feature using Anthropic's Claude

Develop a translation function to send prompts to Claude using the anthropic library, instructing it to get directly to the translation without preliminary remarks.

Adding Dubs Generation Feature using ElevenLabs' API

Generate dubs by defining the corresponding function, utilizing the ElevenLabs API to create an audio output from the translated text.

Final Touch - Combining the Video with the Generated Dubs

Retrieve the video stream and combine it with the generated dub audio file using ffmpeg. Ensure that this process completes successfully before displaying the merged video in your user interface.

Testing the Auto Dubbing Service

Run your app, test its functionality, and ensure that everything works correctly. The video player will appear, allowing users to enjoy the newly dubbed video.

Conclusion

Throughout this tutorial, we successfully built an automatic dubbing service for YouTube videos! With tools like OpenAI's Whisper for transcription, Anthropic's Claude for translation, and ElevenLabs for voice generation, we’ve created a streamlined, effective auto-dubbing process. Many thanks to the Streamlit library for simplifying our UI development!