FlowSynth: Simplifying Complex Audio Generation Through Explorable Latent Spaces with Normalizing Flows

FlowSynth: Simplifying Complex Audio Generation Through Explorable Latent Spaces with Normalizing Flows

Philippe Esling, Naotake Masuda, Axel Chemla--Romeu-Santos

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

Audio synthesizers are pervasive in modern music production. These highly complex audio generation functions provide a unique diversity through their large sets of parameters. However, this feature also can make them extremely hard and obfuscated to use, especially for non-expert users with no formal knowledge on signal processing. We recently introduced a novel formalization of the problem of synthesizer control as learning an invertible mapping between an audio latent space, extracted from the audio signal, and a target parameter latent space, extracted from the synthesizer's presets, using normalizing flows. In addition to model a continuous representation allowing to ease the intuitive exploration of the synthesizer, it also provides a ground-breaking method for audio-based parameter inference, vocal control and macro-control learning. Here, we discuss the details of integrating these high-level features to develop new interaction schemes between a human user and the generating device: parameters inference from audio, high-level preset visualization and interpolation, that can be used both in off-time and real-time situations. Moreover, we also leverage LeapMotion devices to allow the control of hundreds of parameters simply by moving one hand across space to explore the low-dimensional latent space, allowing to both empower and facilitate the user's interaction with the synthesizer.
Keywords:
Machine Learning: general
Human-Computer Interactive Systems: general
Knowledge Representation and Reasoning: general