Catch up on the latest AI articles

To A Photo Like A Professional Photographer! Image Cropping Methods That Take Composition Into Account [CACNet]

To A Photo Like A Professional Photographer! Image Cropping Methods That Take Composition Into Account [CACNet]


3 main points
✔️ Cropping technology that automates the aesthetic composition of professional photographers.
✔️ Explicit modeling of photographic composition rules, giving interpretability to the predicted results!
✔️ Significantly improved performance, achieving accuracy on par with the state-of-the-art!

Composing Photos Like a Photographer
written by Chaoyi Hong, Shuaiyuan Du, Ke Xian, Hao Lu, Zhiguo Cao, Weicai Zhong
(Submitted on  2021)
Comments: Accepted by 
Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)


The images used in this article are either from the paper or created based on it.

first of all

In order to take a beautiful picture, There are many factors that need to be taken into consideration and it requires a lot of expertise, knowledge, and experience. Therefore, if you want to get such a picture, you should hire a professional photographer.

Automatic image cropping can be a technology that has the dreamlike potential to turn a photo taken by an ordinary person into a photo that looks like a professional photographer. However, the problem involves learning an abstract concept of photographic beauty, and it is difficult to judge the validity of the predicted results. Therefore, the authors propose a method that discriminatively learns "composition," an important element in photography, from images and explicitly incorporates the learned compositional cues into the model. This achieves image cropping where the prediction results are interpretable. In addition, the authors' proposed method significantly outperforms existing methods in terms of performance. The details are explained sequentially in the next section.

Photo composition

There are various types of composition, for example, the one shown below is typical. From left to right, they are called tripartite composition, center composition, horizon composition, and symmetrical composition. Professional photographers take pictures with these compositions in mind to create beautiful pictures.

The authors build an image cropping technique that explicitly models these compositional rules.

conventional method

The two main approaches can be broadly categorized as follows.

  1. Attention-Guided Image Cropping
    • Estimated by Saliency Map and energy functions
    • The idea that a good image cropping is to be able to leave a noticeable object or informative area is a good image cropping.
    • Commonly used for thumbnail images and other applications.
  2. Aesthetics-Informed Image Cropping
    • Beauty-based methods
    • Learning with annotated aesthetic labels ( classifications, ranking learning, etc.)
    • It is assumed that the model can acquire the aesthetic composition from the label

A combination of these methods has also been proposed. As shown in the figure below, the current mainstream policy seems to be to prepare a set of candidate cutouts from the image in advance, perform aesthetic scoring on them, and adopt the one with the highest score.

However, these methods have a performance problem because the scoring process must be performed by the model for the number of candidate images. In addition, it is difficult to interpret the prediction results because it does not explicitly model the factors (e.g., composition) that contribute to the beauty of a photo. Interpretability is important for problems that question the ambiguous concept of photographic beauty. Without it, it is difficult to measure the validity of prediction results, and it is also difficult to judge whether the model has improved in experiments.

Proposed method: CACNet

Now we will start to explain the authors' proposed method. The authors emphasize the importance of "composition" in good photographs and propose a cropping method that explicitly incorporates the identification of compositional rules into the model. This makes the proposed method's cropping As shown in the figure below one or more identified composition rules as shown in the figure below. Also, by following the composition rules used in the cutout result, the interpretability of the prediction result is improved.

The authors propose CACNet, shown in the figure below, as a network to achieve the above.

CACNet consists of three parts: a backbone network that extracts features from images, a Composition Branch that identifies nine types of composition rules, and a Cropping Branch that predicts the cropping position of an image. In addition, the Class Activation Map of the Composition Branch is extracted to create the KCM, which will be explained later.

CACNet is a network that successfully integrates composition identification and image cropping using it so that the final cropping position can be obtained with a single feed-forward. Let's take a look at each of these elements.

Composition Branch

Here, we learn composition rules as I explained earlier. In fact, the KU-PCP dataset has already been proposed as a dataset for identifying composition rules. The Composition Branch uses the KU-PCP dataset and trains it as an ordinary classification problem. There are 9 types of composition rules in the dataset as shown below.

The upper two rows are selected from the dataset, and the lower two rows are the actual results predicted by Composition Branch. The bottom two rows are the results of the prediction by the Composition Branch.

Key Composition Map (KCM)

KCM is responsible for conveying compositional clues to the Cropping Branch. When cropping an image, you need to consider the compositional rules that the image follows, and the combination of these rules. In the example image above, we can assume that the image may follow two composition rules, tripartite composition, and horizon line composition, and KCM can merge them and convey them to the Cropping Branch. The following is the procedure to create KCM.

  1. Create a CAM for each composition rule
    • $M_n = \sum^{C}_{c=1} w_{c, n} \cdot F_c$
  2. Combine each CAM to create a KCM
    • $KCM = \sum^{N}_{n=1} s_n \cdot \phi (M_n)$
  3. UpSumpling KCM to input image size

Cropping Branch

Finally, the Cropping Branch (the upper part of the figure in the CACNet architecture), this one adopts the anchor point method to train an anchor point regression model. The anchor points are uniformly set on the image with K strides, and the model is trained to predict the anchor point that corresponds to the cropping position among them. For the loss function, we use the general smooth l1 loss. KCM, the input from Composition Branch, is used to weight the anchor points. This enables composition-aware cropping.

Interpretability of CACNet

This is the end of the description of the elements of CACNet. We now review the key benefits of CACNet, which, given an input image, can produce not only a prediction result, but also gives us a lot of interpretable evidence about the prediction results, such as KCMs that highlight identifiable compositions, or one or more composition rules that the image follows. With this amount of evidence given, we can measure the validity of our predictions, which makes it easier to improve and run the model.

This made me think that interpretability plays a very important role in tackling these kinds of ambiguous reasoning problems.

Experiment and Evaluation

In order to demonstrate the effectiveness of the proposed method, the authors compare it with benchmarks using common image-cropping evaluation datasets (FCDB, FLMS). They are briefly described in this section.

The table provides a quantitative comparison of the accuracy of the cutouts. BDE and IoU are used as evaluation metrics, and FPS is used to measure performance. In summary, the authors' proposed method (CACNet) achieves the same or better accuracy than the state-of-the-art on both FCDB and FLMS datasets, and its performance is significantly faster (155 FPS) than conventional speed-up methods. This efficiency is due to the fact that CACNet is a single-step regression method, which gives it a significant advantage over conventional methods that score each candidate Box.

The qualitative results are also shown in the figure below, showing that CACNet's output is quite close to Ground Truth.

On the other hand, we were not able to record the state-of-the-art performance for FCDB as shown in the table. The reason for this is that CACNet tends to fail when the cutout Box of GrandTruth is relatively small, as shown in the figure below (red for GrandTruth and green for CACNet prediction results).

Because CACNet looks at the entire image and finds the cropping position in a single regression, it is possible to create candidate boxes (including small boxes) It is stated that the reason why FCDB could not record the state-of-the-art is that many of these GrandTruths are relatively small.

This is one of the limitations of the current method, but we believe that it can be improved by using CAM before merging with KCM, or by combining object detection results.


Finally, as an interesting experiment, we applied CACNet's inference processing to three pictures of the same scene in which the camera was fixed and only the position of the object (potted plant) in the picture was changed and compared the results.

From the above figure, we can see that the composition recognized by the difference in the position of the potted plant is very different and that the camera is not confused by the same scene, but accurately recognizes the important elements of the composition. It also shows that the camera is able to accurately crop the image considering the composition.


I have the impression that image-cropping is not a major research topic, but it has become an interesting research topic: "Automating the aesthetic composition of photos like a professional photographer". Like thumbnail generation. Only considering the informative part, However, from a business perspective, I think it's an interesting field because there are many ways to do it depending on the characteristics of the service being developed.

I also feel that explicitly modeling the basis for judging beauty and goodness, as we did here, would be an important advantage in that it would allow us to not only perform cropping but also filter and search for good photos.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us