AI Tutorial

ElevenLabs Tutorial: Building a Simple Word Spelling App Using Speech Synthesis

A screenshot of a simple word spelling application built using ElevenLabs and Streamlit.

Introduction

Nowadays is one of the most exciting times for software development, what with the emergence of various "generative AI" tools in the market. Just name it: cover letter generation? Check! E-mail generation? Check! Automatic code comment generation? Check! Even outside coding and software development, the use case possibilities are enormous. Now we can generate images with text prompts using various image generation models. This innovation allows us to incorporate generated assets into our various products. The next question is, how about voices? The trend of user experiences in the past few years has emphasized "voice command" as an emerging trend. It is only natural that the software we build will incorporate voices as one of the features. In this tutorial, we will showcase the "Speech Synthesis" feature offered by ElevenLabs in a simple app that generates random words and has it spell them out. To build the UI for this Python-based app, we will use Streamlit, a new UI library for sharing data science projects.

Introduction to ElevenLabs

ElevenLabs is a voice technology research company that offers a speech synthesis solution. With an easy-to-use API, it allows developers to generate high-quality speech using AI. This is possible due to an AI model that has been trained on a vast amount of audiobooks and podcasts. The training allows the AI to deliver predictable and high-quality results in speech generation. ElevenLabs features two primary capabilities: VoiceLab, where users can clone voices from recorded audio and/or existing pre-made voices, and design voices based on gender, ages, ethnicities, and races. Once users have the voices to work with, they can proceed to Speech Synthesis, allowing them to generate speeches using their designed voices or pre-made ones.

Introduction to Anthropic's Claude Model

Claude is the latest AI model developed by Anthropic, an AI research organization focused on improving the interoperability, robustness, and safety of artificial intelligence systems. The Claude model is designed to generate human-like responses, making it a powerful tool for various applications, including content creation, legal, and customer service. Unlike other AI models, Claude emphasizes safety, which enables it to refuse outputs it considers harmful or untruthful for users.

Introduction to Streamlit

Streamlit is an open-source Python library that simplifies the creation and sharing of visually appealing and customized web apps for developers and data scientists. Streamlit enables users to build and deploy fully-featured data science apps in minutes through its straightforward and intuitive API, turning data scripts into UI components.

Prerequisites

  • Basic knowledge of Python and UI development using Streamlit
  • Access to Anthropic API
  • Access to ElevenLabs API

Outline

  1. Initializing our Streamlit Project
  2. Adding Word Generation Feature using Claude Model
  3. Adding Speech Generation Feature using ElevenLabs API
  4. Testing the Word Generator App

Discussion

In this tutorial, we will navigate four essential steps. First, we need to initialize the Streamlit-based project to familiarize ourselves with developing user interfaces using Streamlit. Next, we will add more features, starting with engineering a prompt to get Claude's model to provide us with a randomized word that is commonly misspelled. We will then incorporate text-to-voice generation provided by ElevenLabs to demonstrate how the multilingual model pronounces the words. Lastly, we will test the simple app.

Initializing our Streamlit Project

Let's get into the coding action! First, create a directory for our project and enter it. This directory will serve as the foundation of our Streamlit project. Since a Streamlit project essentially comprises a Python project, we need to perform several steps to initialize our Python project, such as defining and activating our virtual environment.

Once activated, the output of our terminal should show the name of the virtual environment (env), like so:

...

Next, it's time to install the libraries required for this project! We will use the pip command to install the Streamlit, anthropic, and elevenlabs libraries. Note that we will also install a version-locked Pydantic library to prevent any Pydantic-related errors in one of the ElevenLabs functions.

With all the project's requirements set up, we can now dive into the coding part! Create a new file inside our project directory and name it randomwords_app.py.

After creating the file, open it in your preferred code editor or integrated development environment (IDE). To begin, we will build the simple app from the simplest components: a title and a caption text!

To conclude our project initialization, let's try a test run of the app. Ensure that our current working directory is still within the project and that our virtual environment is activated. Once everything is ready, use the streamlit run command to execute the app.

The app should automatically open in our default browser, displaying the title and text for now. Next, we will add the random word generation feature using Anthropic's Claude model.

One last consideration: we need to provide our app with the API keys for the services we intend to use, specifically Anthropic's Claude model and ElevenLabs' Speech Synthesis feature. As these keys are considered sensitive, we must keep them in a secure, isolated location. This time, however, we will not store them in a .env file. This is because Streamlit handles environment variables differently. According to the documentation, we need to create a secret configuration file inside a .streamlit directory. We can create the directory in our project and then create the file.

We'll then edit the TOML file we created, noting that the TOML file uses different formatting from a traditional .env file.

Adding Word Generation Feature using Claude Model

In this step, we will add a button to generate the random word, a heading element to display the generated word, and a subheading to present the meaning of the word. However, coming from a web development background, I believe UI elements should be arranged within containers. So, we will do just that.

Import Necessary Libraries

First, let's add the import statements. We need to import the anthropic library to generate our random words.

Defining the Word Generation Function

In this function, Anthropic's Claude model performs the heavy lifting (Thanks, Claude!). Our responsibility is to ensure that Claude returns the exact format consistently. To achieve this, we need to instruct Claude to "strictly follow the format" and provide an example response after our initial prompt. Finally, we can ask Claude to "Remember to only respond following the pattern." The function concludes by returning Claude's response.

Updating the UI

Next, we will edit the UI by adding a container with several elements inside it: a header, subheader to display the random word, and a text element for the meaning of the word. Additionally, we will include a hint on how to use the app and a button below it.

In Streamlit, we can declare click event handlers using conditional statements that return true when the button is clicked. In this scenario, we invoke the generate_word() function that returns the generated word and meaning, splitting the results into separate variables for clarity. Ultimately, the subheader and text element will be updated to reflect the word and meaning.

Final Form

Let’s double-check our code! It should contain the import statements, the function for generating the random word, and the updated UI with subheader and text elements as well as a button that generates the word by invoking the generate_word() function.

Testing the Word Generation Function

Let's run the app again with the same command. Alternatively, we can rerun the app by clicking the upper-right menu and selecting "Rerun" if we have it running previously.

The updated user interface should now appear.

Now, let's try clicking the Generate button!

One of the great features of Streamlit is its built-in loading functionality and loading indicators. We should see the indicator in the upper-right corner, along with the option to "stop" the operation. Neat, huh?

After a few seconds, the result should display in the UI. Perfect! Notice that the app correctly split Claude's generated text into the word and the meaning. However, if the result does not adhere to the expected format, we can always click the Generate button again.

The next step is to incorporate speech generation into our random word generator. In addition to demonstrating how to generate an audio file using the API provided by ElevenLabs, this step will also showcase the capabilities of ElevenLabs' multilingual model.

Adding Speech Generation Feature using ElevenLabs API

As expected, the first step in this section is to add more import statements! We will include functions from ElevenLabs that we’ll use for the speech generation feature.

Defining the Speech Generation Function

Next, we'll define the function responsible for speech generation. Thanks to the simplicity and readability of Python, along with ElevenLabs' easy-to-use API, we can generate speech with just this code!

The function accepts the random word we will use to generate the speech. We will also specifically use the eleven_multilingual_v1 model, which is perfect for our purpose of demonstrating the pronunciation of foreign and commonly misspelled words! We will utilize the "Bella" voice for this tutorial, one of the pre-made voices provided by ElevenLabs.

Adding Audio Player

Next, we’ll add an audio player to play the generated speech. Just below our most recent code, we will create a variable to store the generated speech and run it using the audio player provided by the st.audio function from Streamlit. At this point, our app should function as expected, only displaying the audio player when there is a random word available to read.

Curious how it works? Me too! Let’s test the app now!

Testing the Word Spelling Feature

We can run the app again using streamlit run or simply rerun it if we have it running already. It should look identical to where we left off. Now, let’s try clicking the "Generate" button this time!

Amazing! This time, the app also displays an audio player! Let’s try playing it. Using the multilingual model, the generated speech should reflect the accent and intonation relevant to the original language of the word. For example, "entrepreneur" should be pronounced with a French accent.

Conclusion

In this short tutorial, we have explored the exciting capabilities of speech generation technology offered by ElevenLabs. With the multilingual model, generating speeches designed for non-English listeners is a seamless experience. In our use case, we needed the multilingual model to assist us in finding the correct way to pronounce and spell non-English words that are commonly misspelled.

Reading next

An example of an automated social media ad generated using LLaVA and Fuyu-8B technologies.
An illustration showing the architecture of an AI Research Assistant built with AutoGPT using Flask and ReactJS.

Leave a comment

All comments are moderated before being published.

यह साइट hCaptcha से सुरक्षित है और hCaptcha से जुड़ी गोपनीयता नीति और सेवा की शर्तें लागू होती हैं.