Oxford University researchers have created a new artificial intelligent system that can make videos from still images of people and audio clips alone. The Speech2Vid method can render footage in real time to make talking faces say whatever people want.
Joon Son Chung was the lead author of the paper ‘You Said That?,’ along with two of his peers from the Visual Geometry Group of Oxford’s Department of Engineering Science.
The researchers hope the new system has practical applications in the field of dubbing since it could make it infinitely easier to dub videos in several different languages. They hope this AI tool is, one day, as widely available as top video editing software suites of today.
How does the Speech2Vid model work?
The name says it all, as the AI model is as straightforward as it can get. It turns speech clips into videos with some big science happening in the background. It needs only a visual input in any form to start creating new footage.
This input can come from a still image, let’s say, a picture of someone, or a video. The Speech2Vid model can take either the photograph or all the frames that compose a video as the visual source for its new creation.
Then, using facial recognition algorithms, the AI system gets to work and matches the person in the image with the content of the audio clip. The result is a talking face video similar to the popular mobile app Talking Tom.
“THIS WORK SHOWS THAT THERE IS PROMISE IN GENERATING VIDEO DATA STRAIGHT FROM AN AUDIO SOURCE. WE HAVE ALSO SHOWN THAT RE-DUBBING VIDEOS FROM A DIFFERENT AUDIO SOURCE (INDEPENDENT OF THE ORIGINAL SPEAKER) IS POSSIBLE.”
New technologies like these could have serious implications
Oxford’s team of researchers says they have high hopes for the technology, which could be used to dub entire computer-animated films functioning under the same principles.
However, the escalation of machine learning models will also soon make it possible to create realistic-looking videos of people, rather than the puppet-like mouths they show today.
The Speech2Vid system, then, could pose potential threats similar to Adobe’s VoCo, which lets users ‘Photoshop’ audio and make people say things differently and even utter words they never did.
Chung’s model takes things to the next level because it would imply that not only utterances would be realistic, but also the footage showing the person saying things they didn’t. Video evidence would be unreliable in courts, for example, as it would be impossible to distinguish between a real video and a doctored one.
Source: Cornell University Library