HiWave: Innovation In Wavelet Diffusion Generation For 4K Images Without Additional Learning

31/07/2025

3 main points
✔️ HiWave is a method that can generate 4K images without additional training using a pre-trained diffusion model
✔️ Combines patch-by-patch DDIM inversion and frequency separation with wavelets to combine structure and detail
✔️ User studies show higher ratings than existing methods, with no overlap or breakdowns High-quality image generation with fewer duplicates and collapses

HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
written by Tobias Vontobel, Seyedmorteza Sadat, Farnood Salehi, Romann M. Weber
(Submitted on 25 Jun 2025)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper proposes HiWave, a method that uses a pre-trained diffusion model to generate ultra-high resolution (e.g., 4096 x 4096) images without requiring additional training or architectural modifications.

While existing patch-based methods can enhance local detail, they are prone to breakdown of the overall structure and overlapping artifacts.

HiWave first generates a base image at low resolution, then upscales it to high resolution and applies the DDIM inverse transform to each patch to estimate the initial noise. Furthermore, in the frequency domain, the low-frequency component is used for structure preservation, while the high-frequency component is induced to add detailed information.

In human evaluation experiments, the proposed method is evaluated to have higher quality than the conventional methods, and it is attracting attention as a new approach to high-resolution image synthesis.

Proposed Method

HiWave consists of a three-step process: "base image generation," "patch-wise DDIM inversion," and "wavelet-based detail augmentation.

First, a 1024 x 1024 base image is generated using a pre-trained diffusion model (e.g., Stable Diffusion XL), which is then enlarged by interpolation to 4096 x 4096 in image space.
Next, the enlarged image is divided into patches and a DDIM inverse transform is performed on each patch to obtain an initial noise that reflects the structure of the original image.
Finally, a DWT (discrete wavelet transform) is used to constrain the low-frequency components to preserve structure and the high-frequency components are corrected for detail enhancement based on CFG (classifier-free guidance).

By applying different guidance for each frequency, we succeeded in achieving both overall image consistency and high detail enhancement.

EXPERIMENTS

To validate the effectiveness of HiWave, experiments were conducted in this paper comparing it to Pixelsmith (patch-based) and HiDiffusion (direct inference type).

For the evaluation, 1,000 prompts were randomly selected from the LAION2B-en-aesthetic dataset, and each method generated 4096 x 4096 resolution images.
Visual comparisons showed that HiDiffusion produced structural breakdowns and blurry textures, while Pixelsmith tended to produce duplicate objects.

In contrast, HiWave had high detail accuracy while maintaining structural integrity and was significantly less likely to produce artifacts. In addition, A/B testing was conducted by users, and HiWave was preferred over other methods in 81.2% of all cases.

This quantitatively and qualitatively confirmed the high quality and natural high-resolution image generation performance of the proposed method.

Categories related to this article

nakata

HiWave: Innovation In Wavelet Diffusion Generation For 4K Images Without Additional Learning

Summary

Proposed Method

EXPERIMENTS

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation