Repository navigation
text-to-audio
- Website
- Wikipedia
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
Generate audiobooks from EPUBs, PDFs and text with synchronized captions.
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
A webui for different audio related Neural Networks
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation.
A family of diffusion models for text-to-audio generation.
StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
[NeurIPS 2025] PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching
PyTorch Implementation of Make-An-Audio (ICML'23) with a Text-to-Audio Generative Model
OpenMusic: SOTA Text-to-music (TTM) Generation
Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
Mustango: Toward Controllable Text-to-Music Generation
High-quality Text-to-Audio Generation with Efficient Diffusion Transformer
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
Official codes and models of the paper "Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation"
Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.
Subtitle to audio, generate audio from any subtitle file using Coqui-ai TTS and synchronize the audio timing according to subtitle time.
Pytorch implementation of SoundCTM