Saturday, December 28, 2024

Nvidia debuts AI model that can create music, mimic speech

Must read

Nvidia (NVDA) has developed a new kind of artificial intelligence model that can create sound effects, change the way a person sounds, and generate music using natural language prompts. Called Fugatto, or Foundational Generative Audio Transformer Opus 1, the model is a research project. Nvidia says it’s not announcing any plans to release the technology, but it could have broad implications for industries ranging from music and entertainment to translation services.

“The thing that’s so exciting about [Fugatto] is that having a model that you can prompt to ask it to make sounds in certain ways really opens up the landscape of things that you can imagine doing with it,” Bryan Catanzaro, vice president of applied deep learning research at Nvidia, told Yahoo Finance.

Shares of Nvidia fell 4% on the day.

What sets Fugatto apart from other models, Catanzaro explained, is that it can perform the tasks of several other models. For instance, there are models that can synthesize speech and others that can add sound effects to music; Fugatto, however, does it all. Think of it as a kind of complement to video- and image-generating models like Stability AI’s Stable Video Diffusion or OpenAI’s Sora.

“The foundational improvement here is that … we’re able to synthesize audio using language, and that, I think, opens up new prospects for tools that people can use to create amazing audio,” Catanzaro added.

According to Nvidia, Fugatto is the first foundational model with emergent properties, which means it’s able to mix the elements it’s been trained on and follow “free-form instructions.”

Nvidia CEO Jensen Huang before a baseball game between the San Francisco Giants and the Arizona Diamondbacks in San Francisco, on Sept. 3, 2024. (AP Photo/Jeff Chiu) · ASSOCIATED PRESS

The model can generate audio via standard word prompts as well as manipulate audio files that you upload. So if you have a file of a person speaking, you could translate that person’s words to another language while still making it sound like their voice. You could also take a simple tune and make it sound like an orchestral performance or add different beats to music.

You can also upload a document and have the model read it in any voice you’d like. What’s more, you can tell the model to produce voices that carry emotional weight. Want audio of a dejected English teacher reading Edgar Allen Poe? Fugatto should be able to do it.

Catanzaro, however, warns that the model isn’t always perfect. And some results are better than others.

Like generative image and video models, Fugatto raises questions about the potential impact on artists, sound engineers, and people in related fields. Catanzaro, though, says he hopes the technology helps musicians.

Latest article