Catch up on the latest AI articles

Enhanced Diffusion Models Utilizing Constraints Of 3D Perspective Geometry

Enhanced Diffusion Models Utilizing Constraints Of 3D Perspective Geometry

Computer Vision

3 main points
✔️ introduce new geometric constraints into the latent diffusion model training process to enhance perspective accuracy.
✔️ show that images of models trained using the constraint look more realistic 69.6% of the time than models trained without this constraint.

✔️ We demonstrate that downstream tasks that benefit from more geometrically accurate input (e.g., monocular depth estimation) improve up to 7.03% in RMSE and 19.3% in SqRel.

Enhancing Diffusion Models with 3D Perspective Geometry Constraints
written by Rishi UpadhyayHoward ZhangYunhao BaEthan YangBlake GellaSicheng JiangAlex WongAchuta Kadambi
(Submitted on 1 Dec 2023)
Comments: Project Webpage: this http URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Perspective is well studied in art, but recent high-quality image generation methods lack perspective accuracy. New generative models introduce geometric constraints and improve perspective accuracy through a training process. This produces more realistic images and improves the performance of the associated depth estimation model.


Recent image generation technologies have allowed researchers to get creative with text-to-image synthesis. While these models can generate paintings and photographs from a variety of text prompts, they are limited in their ability to meet physical constraints. Hand-drawn art emphasizes perspective geometry, and recent generative models also improve photorealism by considering perspective accuracy. Latent diffusion models that lack physical constraints introduce a new loss function that improves the physical accuracy and photorealism of the generated image. The accuracy of the perspective method strongly affects the consistency and realism of a scene, and the proposed model using the perspective loss produces more realistic images than the usual model. The generated images with this new loss are also beneficial to the accuracy of downstream tasks, suggesting improved performance of advanced models.

Related Research

Generation of composite image

Image generation is a challenging task due to high dimensional space and diversity. Adversarial Generative Networks (GANs) and Variational Automated Encoders (VAEs) are common methods; GANs can generate high quality images, but are difficult to train and can lead to mode collapse. Diffusion models have recently gained attention and produce high-quality images by reversing the diffusion process. This technique, combined with textual guidance, has improved the inverse process. However, since many diffusion models rely on prior distributions and text encoders, which do not guarantee physical accuracy, this study adds 3D geometry constraints to image generation to improve quality.

The specific task of the study is an edge-to-image composition problem, where the diffusion model is conditioned on both text prompts and edge maps. The study focuses on generating perspective-accurate images without access to edge maps and seeks to generate high accuracy with general and few inputs.

Vanishing Point in Computer Vision

Vanishing points are widely used in computer vision and play an important role in camera calibration, scene understanding, composite scene generation, and SLAM techniques. In addition to this, perspective is also used in computational photography to edit focal length and camera position, and to reduce distortion in wide-angle images. The evolution of these techniques contributes to the photorealism of image generators and their benefit to downstream tasks.

Monocular depth estimation

Monocular depth estimation typically requires image-depth paired data, and architectures such as Markov random fields, convolutional neural fields, and transformers have been employed from early research to the present. Supervised models are difficult to collect data for, and synthetic datasets are often used, but there is a Sim2Real gap. Methods to address this have been attempted, but in addition to monocular depth estimation, which is a common task, the same methods can be applied to depth completion tasks because the data format is the same.

Viewpoint Background

Linear Perspective

Perspective is particularly important in the context of art and photography and refers to techniques for accurately rendering objects in 3D space. Line perspective is the most common of them and takes advantage of the property that parallel lines in 3D space converge to a single point on the image plane. Typically, a drawing or image has one to three vanishing points, which determine the style and view. The horizon is a horizontal line at the height of the observer's eye, and usually at least one vanishing point is on this line. These principles are visually illustrated in Figure 2.

Perspective Consistency in Images

It is not easy to verify the perspective of an image because the vanishing point of an image is the point where parallel lines in 3D space intersect. For images that contain a set of parallel lines, the perspective consistency can be verified by extending those parallel lines and making sure that all pairs of lines intersect at the same point.

・Natural image

The perspective projection of a pinhole camera causes all sets of non-parallel parallel lines to converge on the same vanishing point.

・Composite image

Synthetic images generated by deep learning, unlike natural images, may ignore perspective and physical characteristics. This is because the model's loss function focuses primarily on image quality and prompts, an example of which is shown in Figure 1(a).

Improved fluoroscopic accuracy of generated images

To improve the perspective accuracy of the generated images, a fine-tuning model using codes from [Rombach et al. 2022b] and [Pinkney 2022] is available. This involves training with a conventional loss function with new terms added and a special dataset that provides ground-truth vanishing points.

The latent diffusion model performs the forward and inverse diffusion processes in the latent space. Encoders and decoders are introduced and are responsible for the transformation to and from the latent space. Training loss works by sweeping a line extending from the vanishing point of the image and computing the sum of the gradients of the image across that line. The latent diffusion model also has a perspective loss term to add a perspective prior distribution.

At a high level, this loss works by sweeping a line extending from the vanishing point on the image and calculating the sum of the gradients of the image across that line, as shown in Figure 3. The pseudo code for this algorithm is shown in the next figure.

The new loss function measures how "edge-like" the areas along the lines in the image are. This is introduced as perspective loss and helps improve the quality of image reconstruction. The loss is based on the set of vanishing points in the image and is computed at each randomly selected iteration. It is implemented in PyTorch, which is end-to-end differentiable.


Latent Diffusion Model Training

This model is trained on LAION 5B, a database of 5.85 billion image caption pairs. In this paper, we refer to this model as the baseline model.

・Data Set

The baseline model was adjusted using the HoliCity dataset. This dataset contains 50,078 actual images taken in London and vanishing point information for each image; MiDaS was used to predict the depth of each image, which was then used as a condition for the latent diffusion model. Captions generated for each image using the BLIP caption model are used for adjustment.

・Training Details

The fine-tuning model code is based on [Rombach et al. 2022b] and the original code is a modification of the one in [Pinkney 2022]. The baseline model loss function was updated and trained with an image resolution of 512 × 512, a learning rate of 1e-6, and 𝜆 = 0.01. Training took approximately 12 hours using four RTX3090 GPUs and the perspective loss was saturated. Along with text-to-image generation, the model also performs the task of repairing missing regions in the image, applying the proposed constraints and evaluating the results with the LPIPS metric. LPIPS uses a deep neural network to measure the perceptual similarity between two images.

Monocular depth estimation model training

In new experiments, monocular depth estimation models from DPT-Hybrid and PixelFormer were evaluated from baseline and fine-tuned models. These models were originally trained on the KITTI dataset and synthetic images were generated using depth maps from the SYNTHIA-AL and Virtual KITTI 2 datasets. The generated images were accompanied by captions generated using BLIP, and the depth estimation models were trained on images generated only from vKITTI. For training, we used a batch size of 16 with 19,500 steps and a learning rate of 5e-6 for DPT Hybrid and a batch size of 8 with 20,800 steps and a learning rate of 4e-6 for PixelFormer. This means that All Enhanced refers to the 155,000 images generated by the Enhanced model and All Base refers to the complete set of images generated by the Baseline model.

・Test Set

The depth estimation model is trained on the commonly used KITTI dataset and its performance is evaluated on the KITTI and DIODE outdoor subsets; from the KITTI dataset, the Eigen et al. test set and 500 images from DIODE are used.


Depth estimation metrics from [Ranftl et al. 2021] are used to evaluate the model. These include absolute relative error, relative error squared, root mean square error, log RMSE, and threshold accuracy at threshold 𝜏.

Human Subjective Testing Methodology

Researchers evaluated the photorealism of image generation by the fine-tuned model with a human subjective test on the Prolific Web site. Participants completed a ranking task to compare the photorealism of three sets of baseline, ablation, and enhanced images. Images were taken from the HoliCity dataset and generated based on depth maps; 50 participants randomly evaluated 80 sets of images and were given up to 90 minutes to complete the task.

Ablation Studies

The researchers conducted two ablation studies to evaluate the effect of the proposed constraints. First, a model in which the baseline model was fine-tuned on the same data set and trained under conditions of no loss updates (no loss/ablation model). Second, a model trained without loss by incorporating the vanishing point as a condition. The same dataset was used for both models to train the monocular depth estimation model. Ablation studies were performed on both the human subjective test and the repair task for the lossless model.


Fine-tuned latent diffusion model

In Figure 5, several representative images generated from the fine-tuned model are shown. In this figure, the depth map used to fine-tune the diffuse model is shown along with images generated from the baseline and enhanced models. The images from the baseline model show curves and distortions that affect perspective accuracy, especially in areas where high-frequency detail is difficult to generate accurately. In Figure 8, perspective lines are drawn on the images from the baseline and model.

Images from the model show more consistent perspective lines and accurate vanishing points, with less distortion. Baseline images are more distorted and appear to deviate from the natural image distribution. While the enhanced model has been fine-tuned on the urban landscape dataset, it has also shown no limitations in generating images of other natural, animal, and indoor scenes. Representative images are shown in Figure 6.

We further evaluate these images quantitatively using FID metrics [Heusel et al. 2017]. The model in this paper outperforms both the baseline model and the no-loss model.

The repair performance of the three models (baseline, ablation, and extension) is evaluated using qualitative (Figure 7) and quantitative results (Table 4) on both the HoliCity validation set and the landscape data set. The LPIPS metric is used to measure perceptual similarity, with lower values indicating better restoration performance.

As can be seen from Table 4, the enhanced model consistently outperforms the baseline and ablation models, with a 7.1% improvement over baseline and a 3.6% improvement over ablation in the combined data set.

Estimation of monocular depth

To evaluate the performance of the fine-tuned depth estimation model, both qualitative and quantitative measurements are used. A qualitative comparison is shown in Figure 9.

DPT Hybrid

Using the generated vKITTI dataset, the model fine-tuned from the original DPT-Hybrid outperformed the original DPT Hybrid model on both the KITTI test set and a subset of the DIODE Outdoor test set. Performance of models fine-tuned with images generated by the baseline model also outperformed all but one DIODE Outdoor metric (except SqRel). In particular, on the DIODE Outdoor dataset, the original DPT-Hybrid model outperforms the base model on five metrics, but outperforms the author's model without metrics. The author's model shows a 7.03% improvement in RMSE and a 19.3% improvement in SqRel, as well as a 3.4% improvement in SqRel and a 2.2% improvement in SiLog compared to the baseline model.

Figure 9 shows a comparison of the original DPT hybrid model with a model fine-tuned from images generated by the enhanced diffusion model. Each set includes the input images, ground-truth depth maps, and error maps from both the original and enhanced models, as well as the RMSE values for each depth prediction. The author's model captures high-frequency details more consistently and has lower RMSE values.

・Pixel Former

Fine-tune the base PixelFormer using both the generated vKITTI data set and the complete generated data set and evaluate it on the DIODE Outdoor test set.

After fine-tuning the base PixelFormer with images from the diffuse model and evaluating it using the vKITTI dataset and the generated images from the full dataset, the fine-tuned model outperformed the original model and models based on other training data on all metrics. In particular, the model trained on the full dataset achieved an 11.6% improvement in SiLog over the original model and a 2.4% improvement over the baseline model.

Human Subjective Tests

In subjective tests, images from the enhanced model appeared more photorealistic 69.6% of the time than the baseline model and 67.5% of the time than the ablation model, and the average rank was also better than the baseline and ablation models. The results indicate that the proposed geometric constraints contribute to the improved photorealism of the generated images.

Ablation Studies

Evaluation of the proposed constraints shows consistent improvements in edges and corners throughout the comparison of the enhanced and ablation models. Quantitative comparisons have also been made, confirming that the enhanced diffusion model achieved improvements in certain depth estimation models (see Figure 10).

Experiments based on the proposed constraints show that the DPT-Hybrid and PixelFormer enhanced models outperform models fine-tuned to the training data and lossless models. In particular, there is an improvement of up to 16.11% in RMSE and improved photorealism in human subjective tests. It is emphasized that the proposed constraints contribute to the improved performance of the model, not to the fine-tuning of the new image.

Table 5 shows that images of nonarchitectural scenes generated by the enhanced model outperform the baseline and lossless model in the FID metric. The lower FID scores indicate an improvement in the naturalness and quality of the generated images.



The main limitations of the approach are the need for a data set containing vanishing points to fine-tune the diffusion model and its slow generation speed. Also, while subjective testing shows improvement, the accuracy of the actual image details and physical characteristics is still inadequate.

Social impact

With the improvement of the generative model also comes concern. With the increased photorealism of synthetic images comes an increased risk of malicious use and abuse by tools for identification. The addition of new constraints should alleviate these concerns and reduce the potential for abuse of diffusion models.

Future Initiatives

While current research focuses on the perspective of the 3D geometry, other physical properties also affect the reality of the generated image. Examples include consistency of lighting and shadows, and consistency of physical laws. Future research is expected to pursue these constraints and explore ways to respect the laws of physics and improve performance in photorealism and downstream tasks.


Leon Alberta Battisti, an artist of the 1400s, laid the foundation for perspective and the evolution of hand-drawn realism. In this study, new geometric constraints are proposed for the first time that encode perspective into a latent diffusion model. The introduction of these physics-based 3D perspective constraints was demonstrated to improve the performance of subjective testing and monocular depth estimation.

It is interesting to see how the historical evolution of art affects image generation by AI. It will be interesting to see how the introduction of new constraints will contribute to photorealism and performance.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us