Tutorial: Building an Auto Dubbing Service Using ElevenLabs

Introduction

The arrival of highly advanced text-to-speech technology in recent years has opened the floodgates for many innovative and cutting-edge AI-powered products. No longer are we limited to the awkward and robotic synthesized speeches generated by text-to-speech technology of the past. Recently, a company called ElevenLabs has upped the ante by providing us with features centered around voice generation. From creating and designing custom voices, to synthesizing speeches using the voices that we create or using pre-made voices provided by ElevenLabs.

In this tutorial, we will build an automatic dubbing service using text-to-speech technology by ElevenLabs. Not only that, we will also identify the necessary steps from retrieving a video from a YouTube link to combining the video with the generated dubs.

Introduction to ElevenLabs

ElevenLabs is a highly innovative company that offers a powerful and easy-to-use API for voice generation. They boast cutting-edge technology with their voice generation API, trained on vast collections of audiobooks and podcasts, resulting in the ability to generate natural-sounding and expressive speeches. Thus, ElevenLabs' API can serve as an ideal choice for a wide range of voice-centric products, such as story/audiobook narration and voiceover for videos.

Introduction to OpenAI's Whisper

Whisper is an audio transcription service, or speech-to-text, developed by OpenAI. It is reported to have been trained on 680,000 hours worth of multilingual and multitask supervised data collected from the web, ensuring improved consistencies in detecting accents, background noise, and technical language. Whisper is also capable of transcribing speeches in multiple languages as well as translating from non-English languages.

Introduction to Anthropic's Claude Model

Claude is an advanced AI model developed by Anthropic, based on their research into promoting and training helpful, honest, and harmless AI systems. It is designed to help with various language-centric use cases such as text summarization, collaborative writing, Q&A, and coding. Early reviews from various users reported that Claude is particularly effective at producing safe and reliable responses, making it easier to work with and more intuitive. This makes Claude ideal for building services that are expected to deliver a humane and excellent user experience. We will use Claude to help translate our video transcript.

Prerequisites

Basic knowledge of Python; experience with Streamlit is a plus.
Access to ElevenLabs' API.
Access to Anthropic's API.

Outline

Identifying the Requirements
Initializing the Project
Adding Video Transcription Feature using OpenAI's Whisper
Adding Translation Feature using Anthropic's Claude
Adding Dubs Generation Feature using ElevenLabs' API
Final Touch - Combining the Video with the Generated Dubs
Testing the Auto Dubbing Service

Discussion

Before diving into the coding part, let’s reflect on the features our automatic dubbing service should include. By considering the requirements and intended use cases, we can ensure our service delivers adequate solutions. With that in mind, let's begin!

Identifying the Requirements

To generate dubs for YouTube videos, we need to trace the steps involved:

1. Retrieve the Video from YouTube Link

We can use the popular Python library, pytube, to retrieve the video and audio streams, as well as metadata such as title, description, and thumbnail. We’ll provide a text input field for the YouTube link and a button to trigger the stream download process.

2. Transcribe the Audio Stream

Once the audio stream is downloaded, we can transcribe it using OpenAI's Whisper via the Whisper library. To improve performance, we will cut the audio into one-minute durations before transcription. The transcript will be displayed in a DataFrame.

3. Translating the Transcript

Next, we will use the anthropic library to access Claude, sending a prompt to ask for translation of the transcript, originally in English.

4. Generating the Dubs

After receiving the response from Claude, we will generate dubs using ElevenLabs' API, employing the multilingual model to accommodate non-English translations.

5. Combining the Dubs with the Video

Finally, we will retrieve the video stream from the YouTube link and combine the generated audio with the video using ffmpeg.

Initializing the Project

We will use the streamlit library to build our user interface. The project will involve creating a single Python file and following steps to ensure smooth operation:

Creating the Project Directory

First, navigate to your code projects directory and create a project directory.

Creating and Activating the Virtual Environment

Create the virtual environment and activate it to prevent dependency conflicts.

Installing the Dependencies

Using pip commands, we will install all necessary dependencies including Streamlit, Anthropic, ElevenLabs, Pytube, Pydub, Whisper, and others mentioned. Make sure to address any potential issues such as bugs in Pytube and installation of ffmpeg.

Creating the Streamlit Secret File

Store API keys and sensitive information in a secrets.toml file within the project directory for security.

Creating the auto_dubs.py File

Using a code editor, write the initial layout of the app with a title, text input, select box for languages, and a button that triggers the transcription process.

Adding Video Transcription Feature using OpenAI's Whisper

Add a handler to the "Transcribe!" button to download the audio stream, cut the audio, and process transcription with the Whisper library, displaying results in a Pandas DataFrame.

Adding Translation Feature using Anthropic's Claude

Incorporate the translation feature by creating a function that sends the transcription to Claude for translation, following up with a user prompt to direct Claude effectively.

Adding Dubs Generation Feature using ElevenLabs' API

Incorporate the dub generation feature using ElevenLabs, using the multilingual model to generate natural-sounding speech based on translated text.

Final Touch - Combining the Video with the Generated Dubs

Combine video and audio using ffmpeg, ensuring that the overall process runs smoothly before presenting it to the user.

Testing the Auto Dubbing Service

Now, let’s test our app by clicking the "Transcribe!" button and verifying that everything functions as intended, including playback of the video with the new dubs.

Conclusion

This tutorial has demonstrated an innovative approach to generating translation dubs for YouTube videos using ElevenLabs' API. The combination of OpenAI's Whisper, Anthropic's Claude, and the multilingual capabilities of ElevenLabs results in a seamless user experience. Through Streamlit, we were able to present all necessary features in a user-friendly interface.

Now, we can automatically generate dubs for YouTube videos, showcasing the potential of combining various AI services to achieve remarkable results!