The Future of Voice-Based Research

Imagine a future where "every single movie you can watch in high quality, and every [piece of] educational content you can access yourself."

Specialising in the field of Speech and NLP, I always love to keep an eye out for any start-ups within this space. ElevenLabs caught my attention as a young, hungry company looking to have an impact in the voice research field. Because of this, I felt like I had to get in touch and hear more.

I reached out to one of their co-founders, Mati Staniszewski, and set up an interview which you can read below.

What drove you to build your start-up?

Me and my co-founder Piotr met back in high school. We spent years living together, working together, and doing weekend projects, which were a lot of fun. Having someone that I enjoyed working with was a huge part in driving us to build a company together.

After working for more than 7 years at different companies you realise there’s a lot of potential. But of course, the impact of your decisions is very different from when you drive the projects. In founding ElevenLabs both Piotr and I have the ability to make a change on the problem that we care about and actually see the impact of these changes. That signal from a decision to impact is a lot better.

You’ve mentioned that there was a problem you were looking to solve with this start-up. What exactly is it that you are looking to fix?

So, there was a combination of two things that happened in 2021. One was that one of the weekend projects we were doing was focusing on analysing speech. We asked ourselves with everything now driven online, can we provide guidance on how we speak; what are the pronunciation we use, what are the emotions that we use? We really enjoyed doing this and through the process learned a lot about general voice research and technology.

The second thing that followed and looking back is an obvious one. Piotr, my co-founder, and his girlfriend were watching a movie in Poland, they both speak Polish as a common language, so they needed to watch it in Polish. But foreign movies in Poland are frequently voiced over with a single monotone voice. It’s a horrible experience and upon realising that we both grew up with this, that all our parents and grandparents rely on this process, and knowing what is possible in the voice research space, we decided to change that.

So that’s how the idea got born which is: being able to produce content in any voice, in any language, with the same high tonality of emotions and distinctive voice characteristics that are in the original. To be able to solve this problem you need to solve a lot of subsequent voice component problems that we are now onto.

And I guess it does all of that without any human input, so you don’t need different actors coming in and it can save so much time and money with just this one product.

Exactly, currently, the dubbing process costs over $100 per minute with dubbed audio. If you include the studio space, the transcription translation, voice space, paying voice actors, and then postproduction. It’s all a pretty intense process and it takes time. We believe in the future it will all be available at the click of a button, where you can use original voices and automate the whole package at home. Of course, you could extend it to enable others to give them their own voice and earn the money and royalties that come with that. But in general, it will be done at home.

Have you managed to estimate how much money it could save the film industry if they could use your product instead of having to pay for the whole dubbing procedure?

So, we know there are a number of problems within this industry. The first is frequently movies and other similar industries such as audiobooks and podcasts all go through the same process of dubbing to make their content localised.

Within that, three things can happen. One, things are not dubbed at all – if you think about languages, the less niche the language, the less likely it will be dubbed. An example of this is that education across the world is not dubbed from the original language because there are no resources to do it. The second thing that happens is that things are poorly dubbed so they are just horrible to listen to. The third thing is of course professional use cases where we can make the process simpler, quicker, and cheaper.

Within all of this, there will be different estimates. But in the creator space, there are around a hundred-thousand creators that could take their content as it is and dub it into other languages. The cost of doing that through our work would be cheaper than the revenue they could generate. Of course, for each different vertical, these numbers will be different.

This is what we think will happen over the next year when we started doing this as a component of our work. So, before we can tackle the dubbing, scale, and quality. We are tackling long-form speech synthesis, so your classic text-to-speech. But in our case, trying to take the context of the words into account to make it sound extremely good.

After that is voice cloning, so taking a voice and replicating it in text. Within that category what we are currently targeting is publishers. So, we are currently working with independent authors, news agencies and newsletter providers to help them voice their content. This will make it more accessible to people and encourage them to engage with their content more and potentially enjoy it more as well.

That segways us nicely into a question I wanted to ask which was why have you chosen these specific industries to start with?

We started ElevenLabs at the beginning of the year [2022] and until September it was mostly focused on voice research. We were doing everything we could within this field along with hiring talent to make sure we could position ourselves as the company that has a novel approach to producing this research.

It was only after September that we started looking into how we could productionise this research and where the biggest problems were. Based on the iterations we had so far and conversations with a number of users.

One industry we frequently came across was publishing, where current solutions don’t consider long content. So, you still need to go through the process of using text-to-speech to modify speech and make it sound better. That’s where we came in thinking about how we could scale that process and make it a lot more seamless.

We were tracking that work so we could produce it directly from the text and understand what was being said to produce the right intonation. Then within publishing, last month was focused on releasing our data platform and working with users. We were considering news, newsletters, and authors and we found that independent authors are where we can help the most. Readers are asking about accessibility options because the solutions available are just so expensive. To do a one-million-character book it takes around $5000 which is a huge cost, and we can do that at a tenth of the value. We can do it for $500 for the same kind of audiobook that sounds as good as the human, and you can have it read with the author’s voice, which is incredible.

We are now working with these authors and it's somewhere we feel this urgency and pull where: You can do it easier your audience will enjoy it and finally, you don’t have to go through those laborious editing processes or resort to a bad quality of speech. You can have them both, great quality and low cost.

And you mentioned there that up until September you’ve been focusing mainly on research. Could you tell me a bit about what your current approach to research is at the moment and how this differs from other companies in the tech space?

In general, in the voice space, there is an interesting phenomenon where most of the building blocks needed to start the work that we are doing now appeared in 2021. One of the big developments for other spaces across AI was the transformer model, which kick-started a slightly different way of thinking within these spaces. Therefore, at the beginning of the year, we could start fresh with a very different starting point and approach.

So, kudos to both my co-founder and the team who are extremely talented people, coming from the broader AI space and driving most of this research. In any new field, you are relying on those few individuals to try to figure things out and have those breakthroughs in the approach. That’s what we are spending most of our time doing, figuring out what’s broken and what’s different.

Because our patent process is ongoing I’ll talk about the higher levels of our approach, one of which takes context into account. This problem in general is pretty hard, how can you make the model understand the text? For example, ChatGPT kind of understands what’s being said which is why the sentence it produces makes sense. We took that approach and applied it to speech. We tried to make sure that the model understands in the same manner and because of this you get a lot of data intonation and emotions.

Then there is compression, so we compress a lot differently. 100 times more than MP3. Which allows us to encode the features which are important. Along with a few other elements mostly within speech generation.

Then for voice cloning, we also took another approach. The biggest result of this is that we can clone voices with zero shots. With just a minute of data, we can already make a good resemblance to the original, currently, that’s not possible.

So, to summarise the output of what we’ve done. First, you have the context element influencing the speech and making it sound better because of the compressing. Second is voice cloning which relies on a lot less data for extremely high quality. Then combining those two allows us to produce speech in a slightly different way. Based on how you say things that will be represented in the speech. So, for example, if you are emotional in the speech then that will go into the TTS. If you are sad that will also influence it.

So it translates your emotions across into the text?

Exactly.

Could you talk me through the process you’ve gone through building Eleven Labs, along with some of the problems you have faced building the company and your product?

That’s a big one. It’s a rollercoaster, definitely a fun one. Like a xsinx curve, it’s going up we feel, but then you have those ups and downs all the time. As we shift things, something I considered to be a problem in the past I do not consider to be one anymore, as we are faced with a whole new one. I feel like this will continue. It’s like crossing a bridge, then crossing another one and realising that there is now a whole new element.

So, I think the biggest difficulty, and truly that’s what we think is the biggest one for the company, is that if we were to think about one thing that is going to define what we do it is: Can we deliver on the research, can we have that different innovative approach to how the research is produced. Because you can construct a business that’s doing something, but if we are truly going to make all content available across languages, with the same high quality. That will take a lot of innovation and that’s the biggest risk.

We have realised over time that one of the problems is that we don’t have data for specific languages, because nobody ever dubbed something in them accurately. To give you an example let’s take Polish. Polish is okay but the baseline of what we can do is as good as it was done in the past, so we can take English to Polish dubs and that’s the highest quality you can achieve. But sometimes it’s not going to be good enough. So how do you work through that?

That’s a flavour of the research problems that we are working with. But to give you something more concrete, a problem that is probably the hardest right now is hiring. In our case, we want a very research-heavy team so that’s what we need to build and there’s no easy boilerplate for building a research company. Bar the advice we can read and get for building a product-focused company, or how we can get users. But for research, we are defining that as we go and we have a few ideas but, what is hard is how we make sure we hire the right people. Over the last few months, we have spoken with so many candidates whilst trying to find the right fit which is probably the hardest aspect of this continuous problem we have had.

I do know what you mean I think the first hires that you make in any business are always going to be the most essential ones, these are the people who are going to learn your plan moving forward and eventually help you out by bringing their own ideas. So they are always very important.

100% and I have one more. This one is also ongoing, and it ties in with both of these themes which is how you keep the research focused. To solve automated dubbing or to solve content across languages we need to continue with our research, we need to do at least a year of core research work. As we go along there are a lot of temptations from other problems which are out there that we know we can fix.

For example, we talked about publishing but there are different flavours around this field of work such as call centres. We had a number of companies doing call centre work wanting to fill in words to make that process more automatic. And we know we can do it, but that’s going to take around a month or two of our time away from our main research. These trade-offs are what we are grappling with all the time. How much of a trade-off will there be if we solve a small problem when we know there is a huge problem on the horizon if we continue the research? Both of these can be defining for the company because if we don’t solve some of the problems along the way we might not survive and have enough cash. But if we lose focus, we might be too late and another company will solve it.

So, it’s like a delicate balance between stopping to solve something you know you can fix or ploughing ahead towards your final objective?

Exactly.

ElevenLabs is around a year old now so quite young, let’s look forwards into the future like we’ve been discussing. Where do you see ElevenLabs in 5 years and then after that in 10 years?

In five years, the dream would be to have produced a Hollywood-type movie. So, you would have seen our company help dub a movie from an original language and it’s as good as people can think or better. What we track internally is this preference score. If you were a human and watched a dubbed movie by humans and by AI which one would you prefer? In five years, hopefully we have crossed the barrier and have created something better than what humans can do. That would solve, in our minds, dubbing. Every single movie you can watch in high quality, and every educational content you can access yourself. To be able to do that we would need to be THE voice AI research company. Like OpenAI for voice in some ways, if you were thinking about voice research companies you would be thinking about us.

I wish there were Oscars for dubbing but there are none yet. But that would be a pretty cool goal. I know there are some awards you can win, so maybe let’s say winning an Oscar or Emmy award or something like that for voice dubbing.

Ten years is a long way down the road but currently, we are focusing a lot on static content. Things you can pre-produce and then publish. I think in ten years there will be an increasing shift from just trying to preserve information. Now, you can go to a country and use google translate to say something in one language and then you can hear it in another.

In ten years, it’s going to shift from just trying to convert content, to trying to convert the emotions and the intonations in the speaker’s voice. That’s where we come in. So, we come from voice research, the translation element is already solved and in ten years you could have a conversation in real-time where you sound like you. You can go to a country and communicate with anybody, which would solve that huge barrier. If you think about globalisation people talk about how everything became so globalised, but at the same time, so many things are still inaccessible to people who don’t speak English. That’s such a big driving force across so many things you can access, I think that will disappear. In ten years, we would like to be the company that makes it happen and create an environment where information can truly be free.

So, you’re almost talking about a technology which would destroy the language barrier that exists between people. Anyone would be able to pick up their device and put their headphones in and just have a conversation with each other regardless of what languages they know.

Exactly. And then maybe in like 50 years there will be a neural link, where you can implement a chip with a language, and you can just speak it. But yes 100% I think it [the language barrier] is going to disappear.

One way I would like to round out the interview then is with this question. Could you recommend some research that’s being done or some papers that have recently been published that you think the reader of this piece or anyone else out there should be looking into?

I have a few thoughts crossing my mind. The first one is expected reading by everyone in this space if you are working with voice AI. It’s called Tortoise, TTS, it’s a Text-to-Speech model that was created. It’s great for a number of reasons, it’s probably one of the best or not the best quality for being able to produce a given voice in extremely high quality. It's slow but it's great at that. But the cool thing about it is that it was created by a guy, almost independently, on his side project while he was working at Google in a different domain. He wasn’t a researcher, which is just so beautiful, to produce something that people have worked on for years. It proves to us as well that if you are good at research, you can do it. You can have breakthroughs in personal projects and operationalising which leads to a huge thing. But the recent breakthroughs are dependent on a few people. So, Tortoise is definitely recommended reading at least in my mind.

A special thanks to Mati for taking part in this interview. Mati and his co-founder Piotr founded ElevenLabs at the start of 2022. They recently raised $2 Million in their pre-seed round of funding, along with opening their beta platform to everyone that you can find here.

If you are interested in learning more about TTS, you can do that via the link here.

Imagine a future where "every single movie you can watch in high quality, and every [piece of] educational content you can access yourself."

Haven: The Digital Sanctuary for Your Mental Health and Wellbeing

AI, Philosophy, Sci-Fi & Start-Ups