Abstract by Pei Guo
Fine-grained Visual Categorization using PAIRS: Pose and Appearance Integration for Recognizing Subcategories
In Fine-grained Visual Categorization (FGVC), the differences between similar categories are often highly localized to a small number of object parts. To address this, we propose extracting image patches using pairs of predicted keypoint locations as anchor points. The benefits of this approach are two-fold: (1) it achieves explicit top-down visual attention on object parts, and (2) the extracted patches are pose-aligned and thus contain stable appearance features. We employ the variant of fully convolutional networks to predict keypoint locations. Anchored by these predicted keypoints, an overcomplete basis of pose-aligned patches is extracted and a specialized appearance classification network is trained for each patch. An aggregating network is then applied to combine the patch networks' individual predictions, producing a final classification score. Our PAIRS algorithm achieves the state-of-the-art results on two public datasets.