Gen4AVC Workshop | ICCV 2025

Workshop Overview

Seamless integration of audio and visual elements is crucial for creating immersive and engaging content. Audio-visual generation, involving the synthesis of one modality from the other or both jointly, has become a key research area. This capability holds significant potential for applications like virtual reality, gaming, film production, and interactive media, using advanced generative models to enhance multimedia quality and realism.

This workshop highlights the growing importance of audio-visual generation in modern content creation, bringing together researchers and practitioners from academia and industry to explore the latest advances, challenges, and emerging opportunities in this dynamic field.

Topics Include:

Vision-to-audio synthesis
Audio-to-vision synthesis
Joint generation of audio and video

Schedule

Morning Session, October 19th, 2025

Note: All times are in Hawaii Standard Time (HST).

8:55 - 9:00

Opening Remarks

Welcome and introduction to the workshop

9:00 - 9:30

Invited Talk 1: Danilo Comminiello

Sapienza University of Rome

"Weaving Time, Space & Semantics: Multimodal Alignment for Audio-Visual Generation"

Slides

9:30 - 10:00

Invited Talk 2: Andrew Owens

Cornell Tech

"Generating Sounds from Physical Interactions in 3D Scenes"

Slides

10:00 - 10:15

Coffee Break & Poster Setup

Authors set up posters

10:15 - 11:00

Poster Session (Exhibit Hall II)

Poster presentations - see poster list below

11:00 - 11:30

Invited Talk 3: Gunhee Kim

Seoul National University

"ViSAGe: Towards Scene-Aware Video-to-Spatial Audio Generation"

Slides

11:30 - 12:00

Invited Talk 4: Kristen Grauman

University of Texas at Austin

"Discovering and Generating Action Sounds from Video"

Slides

Poster Presentations (Venue: Exhibit Hall II)

Regular Posters

"LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters"
Poster #22

Authors: Haomin Zhang, Kristin Qi, Shuxin Yang, Zihao Chen, Chaofan Ding, and Xinhan Di
"Do State-of-the-art Audio-visual VLMs Understand Audio-video Temporal Misalignment"
Poster #23

Authors: Motonobu Kimura, Ren Ohkubo, Yue Qiu, and Yutaka Satoh
"Seeing What You Say: Expressive Image Generation from Speech"
Poster #24

Authors: Jiyoung Lee, Song Park, Sanghyuk Chun, and Soo-Whan Chung
"KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation"
Poster #25

Authors: Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, and Emad Barsoum
"Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model"
Poster #26

Authors: Sangjune Park, Inhyeok Choi, Donghyeon Soon, Youngwoo Jeon, and Kyungdon Joo
"High-Fidelity Talking Portrait Synthesis with Personalized 3D Generative Prior"
Poster #27

Authors: Jaehoon Ko, Kyusun Cho, JoungBin Lee, Heeji Yoon, and Seungryong Kim
"Dance Video Generation using Music-to-Pose Encoder Trained on Synthetic Dataset Generation Pipeline leveraging Latent Diffusion Framework"
Poster #28

Author: Nokap Tony Park
"Differentiable Room Acoustic Rendering with Multi-View Vision Priors"
Poster #29

Authors: Derong Jin and Ruohan Gao
"SpecMaskFoley: Efficient Yet Effective Synchronized Video-to-audio Synthesis via Pretraining and ControlNet"
Poster #30

Authors: Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, and Yuki Mitsufuji
"JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version I"
Poster #31

Authors: Xinhan Di and Kristin Qi

Invited Posters

"TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models"
Poster #32

Authors: Christian Simon, Masato Ishii, Akio Hayakawa, Zhi Zhong, Shusuke Takahashi, Takashi Shibuya, and Yuki Mitsufuji

ICCV 2025
"TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis"
Poster #33

Authors: Tri Ton, Ji Woo Hong, and Chang D. Yoo

ICCV 2025
"How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes"
Poster #34

Authors: Mahnoor Fatima Saad and Ziad Al-Halah

ICCV 2025

Call for Papers (Regular Track)

Overview

We welcome submissions on (but not limited to) the following topics:

Vision-to-audio synthesis
Audio-to-vision synthesis
Joint generation of audio and video
Cross-modal representation learning
Evaluation of audio-visual alignment
Datasets for audio-visual generation
Applications of audio-visual generation models

Submission Guidelines

Papers are limited to four pages in the ICCV style, excluding references and appendices.

Submissions are closed.

Important Dates

Paper Submission Deadline: July 9, 2025 23:59 AoE (Anywhere on Earth)
Decision Notification: August 8, 2025
Camera Ready Deadline: August 22, 2025 23:59 AoE (Anywhere on Earth)
Workshop Date: October 19th, 2025 (morning session)

Review Process

All submissions will undergo a double-blind review process. Please ensure that your submission does not contain any identifying information about the authors.

Publication

The workshop will be non-archival. Authors of accepted papers retain the full copyright of their work and are free to submit extended versions to conferences or journals.