How accurately can humans detect current state-of-the-art voice synthesis models?

Highlights and results from using RealTalk to build a fake audio test

A week ago the NYTimes released their documentary on our deep fake detection and creation research. This means we can now share results from our audio synthesis work as well.

Since releasing our RealTalk model at Dessa, there have been over 350k responses to our fake audio test The fake audio game was developed alongside our RealTalk model to show how difficult it is to detect real vs fake audio at the technology's current state.

(Edit note: a previous version of this piece called the game an "audio Turing test. As Scott Niekum points out, classifying this as a pure Turing test is likely a stretch. Whereas Turing Tests require dialogue between the interrogator and other side, the game we created simply represents listening to audio recordings and guessing if they are real or fake––an important distinction)

This project was done out of curiosity so the rigour may not be perfect, but I hope what I outline below gives an early glimpse at how we may deal with the trust of synthetic audio in the future.

Methodology for fake audio test:

  • 8 questions, each with 4-8 second audio clip of Joe Rogan talking––where listener was asked to determine if the audio clip was either real or fake

  • Randomized: upon every page load the questions were randomized. This was done so we could assess if listeners could learn accuracy. We could then look at the 8 question test results in aggregate to better understand learned accuracy

  • Immediate feedback: after each response, the listener was told if they were correct. This was done to understand how we learn to discern real from fake audio over time

  • There were 4 audio clips that were real, and 4 that were fake (Joe Rogan never said these sentences as far as we’re aware)

  • For the 4 fake audio clips, we created the text––aiming for them to be related to his usual subject matter

So, what did we see in the results?

Concern: Initial accuracy is low

1) Accuracy for the initial question was 55%. This shows that initial listeners had close to random (50%) probability of choosing accurately. This is significant because it highlights the possibility where if listeners hear audio outside of our test context in the real world, they’ll likely have trouble discerning real from fake.

Optimism: Accuracy can be learned, but plateaus

2) In aggregate listeners correctly guessed ~68.2% of the time, with accuracy increasing from 55% on the first question answered, to 72.6% on the 8th question. In relation to how listeners learn how to discern fakes after answering a few questions, most of the improvement in accuracy was seen in the first 4 questions, with accuracy learning becoming flatter from question 4-8.

Accuracy improvement q1-4: 12.9% increase

Accuracy improvement q5-8: 4.7% increase

Skepticism: trust of real audio

3) In total listeners scored 68% accuracy on real audio clips. Listeners accuracy grew to 74.1% accuracy on real audio by the 8th question. One interpretation of this result is that given the possibilities of fakes, the listener may loose trust in real audio. This pattern could also be seen as a listeners being cautious or skeptical of audio as they go through the test.

Audio in Summary

Total questions answered: 350k

Total fake audio clip accuracy: 66%

Total real audio clip accuracy: 68%

The proximity of these averages may also show how good the current technology is at creating synthetic voices.

Fake audio clip that had highest accuracy (81%):

Fake audio clip that had lowest accuracy (56%):

To reply you need to sign in.