How easy is making a deepfake audio? All I needed was six minutes online
I’d assumed it would take a couple of hours to learn how to make a convincing audio deepfake. That turned out to be a large overestimate: on my first go, it took me less than ten minutes to clone somebody’s voice. And while it wasn’t perfect, the quality was good enough to fool their fiancée.
This week James Cleverly, the home secretary, became the latest politician to warn of the dangers that deepfakes — AI-generated voices, images or videos that might be mistaken as real — pose to elections.
Last month, police in Hong Kong revealed perhaps the most audacious AI-enabled heist so far, where an employee at a multinational firm was duped into sending $25 million to fraudsters who had used deepfake technology to pose as his company’s chief financial officer.
But deepfaking somebody’s voice didn’t involve negotiating the underworld of the illicit darkweb or wrestling with exotic software. Instead I went to a glossy user-friendly website of a UK-based company that has a special deal at the moment: subscribe for a month for less than £1.
Once I had logged in, it let me replicate the speech of my colleague Darryl Morris, a Times Radio presenter.
I uploaded a 30-second recording of Darryl speaking, though as little as five seconds (which is about as long as it will take you to read this sentence) could have worked. Just a couple of mouse clicks later and the AI was primed: you could type in any words you wanted, press the “generate” button, and it would read them aloud in a pretty good approximation of Darryl’s Bolton accent.
I was not asked for proof that he had given his permission. All I had to do was click a button saying I had the “necessary rights or consents” to upload the audio file and create the clone, and agree that I would not use it to commit fraud.
Altogether, it was no more complicated than putting something up for sale on eBay — and when his fiancée, Michaela, first heard the deepfake Darryl, she assumed that it was really him.
Andrea Miotti, who works for ControlAI, a non-profit campaign group, later showed me how to combine the same synthetic voice with images, to create a deepfake video.
The process was more complex and required a video of the real Darryl as well as open source software accessed online. But instructions on how to do it are available on YouTube and, again, it took just minutes. Viewed on a small smartphone screen at a low resolution, you could imagine a distracted viewer thinking our deepfake video was real. “What you must realise,” Miotti said, “is that this technology is only likely to get better.”
The privately-held company behind the voice technology we used is called ElevenLabs. Based in London and founded in 2022, in January it raised $80 million from a group of prominent Silicon Valley investors in a deal that made it one of the UK’s few technology “unicorns” — the name given to start-ups valued at $1 billion or more.
Mati Staniszewski, the ElevenLabs chief executive and co-founder, concedes that the technology has been misused. Trolls have used it to create deepfake audio of the actress Emma Watson reading Mein Kampfand Sir David Attenborough making racist comments.
A reporter from The Wall Street Journal was able to use an ElevenLabs deepfake of her own voice to fool an automated voice recognition system that her bank used to verify her identity over the phone.
In America the FBI has warned about blackmailers using AI-generated images and videos showing ordinary people in compromising situations to extort money. Onfido, a London-based technology company, has estimated that deepfake fraud attempts increased 30-fold last year.
But Staniszewski said that he was confident that his company is doing more good than harm. His customers use its tools to add voices to legitimate social media, to produce audiobooks and to dub films, he said. It also offers a voice translation service, which can convert spoken content to another language in minutes, while preserving the voice of the original speaker.
ElevenLabs has safeguards, he added. Attempts to clone leading politicians’ voices should be automatically blocked — it prevented an attempt to clone Sir Keir Starmer’s voice when we tried — and you can upload an audio clip to the ElevenLabs site to check if that is where it came from. Users must provide credit card details, which could be used to trace criminal behaviour.
There is free software available online that can be used to produce deepfakes that does not include these kinds of measures, Staniszewski said.
But others still fear that any gains delivered by this kind of technology are eclipsed by the potential downsides.
ControlAI recently ran an experiment where 12 members of the public were given a log-in to ElevenLabs and asked to clone their own voices. None of them had trouble using the site.
They then used the fake voices to make telephone calls to loved ones. One participant called a friend and let her deepfaked voice say that they had been hit by an unexpected tax bill that had drained her bank account. Could she send some money? Yes, of course, she told the AI.
In another example, deepfake audio was used to ask a father — who assumed he was speaking to his son — to leave a window open at home, so the son could climb in because he’d forgotten his keys.
As these sorts of stories circulate, some families are agreeing on secret codewords which they will use if they ever have to make an emergency phone call, to make sure it isn’t an AI on the line.
Meanwhile, Staniszewski said that he would welcome “more guidance and regulation, especially if it’s thoughtful regulation” on AI, to set limits for technology companies and to keep people safe.
He also called for more public education on how sophisticated these tools are becoming.
But Miotti believed that tougher measures were needed and that companies that allowed deepfakes to be made and shared should be responsible for any damage they cause. “To stop the harm from deepfakes, legal liability must cover production and dissemination,” he said.
“We have to hold to account the companies that produce deepfake technology, create and enable deepfake content, and that allow it to spread.”