Semi Supervised YOLOv2

Summary

This project was carried out during an internship at IRIT, Toulouse, with Axel Carlier as tutor.

The code provided is only an untested python scripts version of the Google Colab Notebook.

YOLOv2

The fisrt part of this project was about obtaining a functional and trainable model for image detection based on YOLOv2 1. This was done by extracting and adapting code snippets from an existing implementation to a Jupyter Notebook.

Results

I obtained, on the WildLife dataset, the following quantitative results :

YOLOv2 (30 epochs)
validation mAP0.65
buffalo mAP0.59
elephant mAP0.59
rhino mAP0.77
zebra mAP0.64
training mAP0.73
buffalo mAP0.85
elephant mAP0.59
rhino mAP0.89
zebra mAP0.61

And qualitative results :

2 zebras well detected
3 elephants well detected
only one rhino detected on 2

Semi-supervised Learning for Image Detection

The teacher-student semi-supervised training process
The teacher-student semi-supervised training process.

The second part of this project was the implementation of a semi-supervised training for object detection 2 on this functional model. This affected the loss of the model ; initially YOLOv2 has the following loss :

$$ \begin{align} l_{\textrm{yolo}} &= \lambda_{\textrm{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{1}_{ij}^{\textrm{obj}} \left[ \left( x_{ij} - \hat x_{ij} \right)^2 + \left( y_{ij} - \hat y_{ij} \right)^2 \right] \\ &+ \lambda_{\textrm{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{1}_{ij}^{\textrm{obj}} \left[ \left( \sqrt w_{ij} - \sqrt{\hat w_{ij}} \right)^2 + \left( \sqrt h_{ij} - \sqrt{\hat h_{ij}} \right)^2 \right] \\ &+ \lambda_{\textrm{obj}} \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{1}_{ij}^{\textrm{obj}} \left( C_{ij} - \hat C_{ij} \right)^2 \\ &+ \lambda_{\textrm{noobj}} \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{1}_{ij}^{\textrm{noobj}} \left( C_{ij} - \hat C_{ij} \right)^2 \\ &+ \lambda_{\textrm{class}} \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{1}_{ij}^{\textrm{obj}} \sum_{c \in \textrm{classes}} \left( p_{ij}(c) - \hat p_{ij}(c) \right)^2 \\ \end{align} $$ where $ \hspace{1em} \mathbb{1}_{ij}^{\textrm{noobj}} = 1 - \mathbb{1}_{ij}^{\textrm{obj}} \hspace{1em} \forall (i, j) \in S^2 \times B $

But to in order take into account the pseudo-labels generated by the teacher, I modified the loss as follows :

$$ \begin{align} l_{\textrm{yolo_ssl}} &= \sum_{i=0}^{S^2} \sum_{j=0}^B \left( \mathbb{1}_{ij}^s \lambda_s + \mathbb{1}_{ij}^u \lambda_u \right) l_{ij} \\ l_{ij} = &\lambda_{\textrm{coord}} \mathbb{1}_{ij}^{\textrm{obj}} \left[ \left( x_{ij} - \hat x_{ij} \right)^2 + \left( y_{ij} - \hat y_{ij} \right)^2 \right] \\ + &\lambda_{\textrm{coord}} \mathbb{1}_{ij}^{\textrm{obj}} \left[ \left( \sqrt w_{ij} - \sqrt{\hat w_{ij}} \right)^2 + \left( \sqrt h_{ij} - \sqrt{\hat h_{ij}} \right)^2 \right] \\ + &\left[ \lambda_{\textrm{obj}} \mathbb{1}_{ij}^{\textrm{obj}} (1 - \mathbb{1}_{ij}^{\textrm{noobj}}) + \lambda_{\textrm{noobj}} \mathbb{1}_{ij}^{\textrm{noobj}} \right] \left( C_{ij} - \hat C_{ij} \right)^2 \\ + &\lambda_{\textrm{class}} \mathbb{1}_{ij}^{\textrm{obj}} \sum_{c \in \textrm{classes}} \left( p_{ij}(c) - \hat p_{ij}(c) \right)^2 \\ \end{align} $$ where $ \hspace{1em} \mathbb{1}_{ij}^{\textrm{obj}} = 1 \hspace{1em} $ for all labeled and unlabeled boxes that contain an object, and $ \hspace{1em} \mathbb{1}_{ij}^{\textrm{noobj}} = 1 \hspace{1em} $ for all labeled and unlabeled boxes that does not contain an object and all unlabeled boxes for which the teacher has a confidence below a certain threshold.

Results

I obtained the following results :

Teacher (60 epochs)Student (30 epochs)Student² (30 epochs)
training mAP0.790.820.81
validation mAP0.670.700.68

Showing proof of concept that the student performs better than the teacher, but also that using the student to train a new student (Student² here) does not seem to make any improvement.


  1. J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 6517-6525, https://doi.org/10.1109/CVPR.2017.690↩︎

  2. Sohn, K., Zhang, Z., Li, C., Zhang, H., Lee, C., & Pfister, T. (2020). A Simple Semi-Supervised Learning Framework for Object Detection. https://arxiv.org/abs/2005.04757↩︎

Joceran Gouneau
Joceran Gouneau
CS Engineering Graduate - Listening for opportunities