EfficientDet: Towards scalable and efficient object detection

So the natural question is how to develop accurate and efficient object detectors that can also adapt to a wide range of constraints resources?

 “ EfficientDet: Scalable and Efficient Object Discovery”, adopted at CVPR 2020, introduces a new family of scalable and efficient object detectors. Building on previous work on scaling neural networks (EfficientNet) and incorporating a new bidirectional functional network (BiFPN) and new scaling rules, EfficientDet achieves modern precision, while 9 times smaller and uses significantly less computation compared to known modern detectors. The following figure shows the overall network architecture of the models.

Optimization of model architecture

The idea behind EfficientDet originated in as a result of efforts to find solutions to improve computational efficiency by systematically examining prior modern detection models. In general, object detectors have three main components: a trunk that extracts signs from a given image; a network of objects that takes many levels of functions from the trunk as input data and displays a list of combined functions that represent the characteristic characteristics of the image; and a finite class/box network that uses combined functions to predicting the class and location of each object.

After examining design options for these components, identified several key optimizations to improve performance and efficiency. Previous detectors mainly use ResNets, ResNext, or AmoeBanet as backbone networks that are either less powerful or have lower efficiency than EfficientNets. By initially implementing the EfficientNet backbone, much more efficiency can be achieved. For example, starting with the RetineNet baseline that uses the Resnet-50 backbone, our ablation study shows that simply replacing Resnet-50 with Efficientnet-B3 can improve accuracy by 3% while reducing computing by 20%. Another optimization is to improve the efficiency of functional networks. While most previous detectors simply use a downstream pyramidal network (FPN), we find that the downstream FPN is inherently limited by a one-way flow of information. Alternative FPNs such as PaneT add additional upstream through additional computing.

Recent attempts to use neural architecture search (NAS) have discovered a more complex NAS-FPN architecture. However, while this network structure is efficient, it is also irregular and highly optimized for a specific task, making it difficult to adapt to other tasks. To solve these problems, we offer a new network of bi-directional functions BiFPN, which implements the idea of combining tiered functions from FPN/PanET/NAS-FPN, which allows you to transfer information both in the top down and bottom up direction., using regular and efficient connections.

Comparison between BiFPN and previous functional networks. BiFPN allows functions (from low resolution P3 levels to high resolution P7 levels) to be passed repeatedly both from top to bottom to bottom.

To further improve efficiency, we offer a new technique for rapid normalized synthesis. Traditional approaches typically handle all input data in an FPN in the same way, even with different resolutions. However, we observe that input objects with different resolutions often make unequal contributions to the output functions. So we add extra weight to each input function and allow the network to learn the importance of each one. We will also replace all regular convolutions with less expensive, deeply separable convolutions. With this optimization, our BiFPN further improves accuracy by 4% while reducing the cost of computing by 50%.

The third optimization involves achieving a better trade-off between accuracy and efficiency under various resource constraints. Our previous work has shown that co-scaling the depth, width, and resolution of a network can significantly improve image recognition efficiency. Inspired by this idea, we offer a new composite scaling method for object detectors that jointly increases resolution/depth/width. Each network component, t. E. Trunk network, object and network with block/class prediction, will have one complex scaling factor that controls all scaling dimensions using heuristic rules. This approach makes it easy to determine how to scale a model by calculating the scaling factor for the specified constraints of the target resource.

By combining the new backbone and BiFPN, we first develop a small size EfficientDet-D0 baseline and then apply composite scaling to get EfficientDet-D1 to D7. Each sequential model has a higher computational cost covering a wide range of resource constraints from 3 billion FLOPs to 300 billion FLOPS, and provides a higher accuracy.

Performance model

Evaluating EfficientDet on the COCO dataset, a widely used reference dataset for object detection. EfficientDet-D7 achieves an average average accuracy (mAP) of 52.2, 1.5 points higher than the previous modern model using 4 times fewer parameters and 9.4 times less calculations

EfficientDet reaches the modern level of 52.2 mAP, 1.5 points more than in the preceding level of engineering (not shown since it is 3045B FLOP) on a COCO test device at the same setup. With the same accuracy limitation, EfficientDet models are 4—9 times smaller and use 13—42 times less computation than previous detectors.

Also compared parameter size and CPU/GPU delay between EfficientDet and previous models. With similar accuracy limitations, EfficientDet models run 2—4 times faster on the GPU and 5—11 times faster on the CPU than other detectors. Although EfficientDet models are mainly designed to detect objects, we also test their effectiveness in other tasks, such as semantic segmentation. To perform segmentation tasks, we modify the EfficientDet-D4 slightly, replacing the detection head and loss function with the segmentation head and loss head while maintaining the same scaled backbone channel and BiFPN. We compare this model with previous modern segmentation models for Pascal VOC 2012, a widely used data set for segmentation testing.

EfficientDet delivers better quality Pascal VOC 2012 val compared to DeePlabv3 + with 9.8x smaller computing at same setup without pre-training COCO.

Given their exceptional performance, it is expected that EfficientDet could serve as a new foundation for future research related to object discovery and potentially make high-precision object detection models are practically useful for many real-world applications. Therefore, we opened all the checkpoints of the code and pre-trained model on GitHub.com.

Leave a Comment