Introducing The New YOLO Series, The Gold Standard In Object Detection!

Object Detection 28/08/2020

Three main points
✔✔️ New model of YOLO, the gold standard in the field of object detection, is proposed
✔️ Latest technology in the field of object detection is introduced and experimentally evaluated
✔️ Achieved 43.5% AP (65.7% AP50) and 65 FPS in MSCOCO Tesla V100

YOLOv4: Optimal Speed and Accuracy of Object Detection
written by Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao
(Submitted on 23 Apr 2020)
Comments: Published by arXiv
Subjects: Object Detection (cs: OB)

Code.

background

This article introduces the new model of the YOLO series, yolov4, which is the kingpin of the YOLO series of object detection, which is arguably the fastest growing field of deep learning in today's world.

What is object detection?

Object detection is a method to take an image (video) and detect the position and category (class) of an object from the image (video). For example, it detects the "cat" and "cat's position" from the following image.

The main difference from traditional CNN is that it needs to detect both "object type" and "location", which means that it needs to output both object prediction (probability map) and object location (continuous value). Specifically, we have to estimate a square surrounding the object, called a Bounding Box (rectangle).

　This bounding box consists of five elements: center coordinates (x, y), width, height, and object type. Therefore, these five elements are estimated by CNN.

　In addition, object detection in deep learning is defined by a model that solves the following two classification problems.

The problem of getting a fixed size window from an input image at all possible positions
The problem of inputting acquired areas (batches) into an image classifier and classifying them

　Once each window is found, the five elements needed for the Bounding Box can be passed to classification methods used in traditional machine learning, such as SVM, and to models that estimate object location as a regression problem.

　Therefore, the key to object detection is "how to obtain the window size", and the performance of the model is determined by how to obtain this point. One way to obtain the window size is to down-sample the input image, prepare multiple sizes, and then obtain a fixed window size. With this method, 64 steps of downsampling are performed and the window is obtained for each step. With this method, the problem of "size" and "position" on various objects can be identified.

Another issue is the aspect ratio, or "aspect ratio". This exists in various ways for different objects, such that the same person has different aspect ratios for sitting and standing. The Anchor Box, first proposed in the Faster R-CNN method, has been proposed as one of the solutions to this problem. I won't go into it in depth in this article, so if you're interested, please look into it!

As I've said so far, there's also a method that has been talked about that has made its basic premise all end-to-end detection. It's in the ECCV, which is currently being done, and AI-SCHOLAR has a "Finally! Really DETR! An Innovative Paradigm for Object Detection" is explained. I am amazed at the antennae of other AI engineers.

What is YOLO?

YOLO is one of the standard methods for real-time object detection systems and has been used by many users since it was proposed in 2015. Previous methods used to select a candidate object region from the entire image (selective search), and then classify it using algorithms such as SVM based on the selected region. However, although this method can improve accuracy, it is not suitable for detecting objects in real time because it needs to make assumptions for selecting the region once.

By skipping this step, YOLO uses a specialized real-time algorithm. First, it decomposes the entire image into an S x S (typically S=7) grid and predicts N bounding boxes and confidence on each grid. The confidence is calculated based on the "accuracy of the bounding box" and whether the bounding box actually contains the object (irrespective of its class); YOLO combines these two factors into a single regression problem, and then sets up the problem to select candidate regions It enables fast object detection without the need for

On the other hand, there are some restrictions such as "the size of the segmented grid is fixed," "there is only one class that can be identified in the grid," and "the number of objects that can be detected is two," and so on, and these restrictions have the disadvantage of reducing the accuracy of identification when there are a large number of objects in the grid. The later models, such as yolov2 and yolov3, solved these problems and improved the discrimination accuracy.

YOLO is an acronym for "You Only Look Once" (You only live once).

To read more,

Please register with AI-SCHOLAR.