Learning-Deep-Learning

SOLO: Segmenting Objects by Locations

April 2020

tl;dr: Single-shot instance segmentation.

Overall impression

The paper proposes a simple framework for instance segmentation directly. Essentially it is a YOLO architecture predicting additional HxW values at each cell. The HxW values are warped into a mask with the same resolution of the feature map. <– However there is an important trick that reshapes the SxSx(HxW) into HxWx(SxS). MEInst showed that it is possible to directly predict the high dim vector per region, but for the entire image mask it is perhaps intractable.

Semantic segmentation classifies each pixel into a fixed number of categories. Instance segmentation has to deal with a varying number of instances. That is the biggest challenge. Instance segmentation can be sorted into top down approaches such as Mask RCNN and bottom up approaches such as Associate Embedding.

The decoupled SOLO idea is fabulous and I think is partially inspired by YOLACT by predicting prototype 2S masks.

This paper can be seen as an extension to the anchor-free object detection, such as FCOS and CenterNet, but with the important trick of reshaping the tensor. <– See discussion in TensorMask.

Direct spatial2channel leads to spatial alignment too poor to guarantee good mask quality. (see natural representation in TensorMask). However it should be enough to guarantee the SxS order.

Key ideas

Technical details

Notes