STN-OCR, a single semi-supervised Deep Neural Network(DNN), consist of a spatial transformer network – which is used to detected text regions in images, and a text recognition network – which recognizes the textual content of the identified text regions. STN-OCR is an end-to-end scene text recognition system, but it is not easy to train. This model is mostly able to detect text in differently arranged lines of text in images, while also recognizing the content of these words. The overview of the system is shown in Figure 1.
Compared with most of the current text recognition systems, which extract all the information from the image at once, STN-OCR behaves more like a human. First, STN-OCR localizes text regions, and then recognizes the textual content of each text region. To do this, STN-OCR consists of two stages: text detection and text recognition. The complete stages are shown in Figure 2.
Text Detection Stage
The detection stage consists of three parts. (1)A function f_loc computed by a localization network. It is used to predict the parameters θ. (2) A sampling grid created by their predicted parameters. Sampling grid is used to define which part of the input features should be mapped to the output feature map. (3) Take the generated sampling grid and produces the spatially transformed output feature map O by a differentiable interpolation method.
The Localization network takes the feature map as input, and outputs the parameters θ of the transformation. In this system, it is used to take the image as input and predict N two-dimensional affine transformation matrices, and N can be a number of characters, words or text lines.
To be specific, the N transformation matrices A_θ^n are produced by using a feed-forward CNN together with an RNN. g_loc is also a feed-forward network. Each of the N transformation matrices is computed using the hidden state h_n for each time-step of the RNN.
The authors use ResNet as the CNN network, arguing that the residual connections of the ResNet help with retaining a strong gradient down to the very first convolutional layers, which makes the system faster and better performing compared with other structures like VGGNet. Bidirectional Long-Short Term Memory (BLSTM) unit is used as an RNN network. Furthermore, Batch Normalization is used in all experiments.
The gird generator produces N regular grids G_i^n of coordinates u_i^n, v_i^n of the input feature map by using a regularly spaced grid G_0 with coordinates y_h_o, x_w_o and the affine transformation matrices A_n^θ. By doing this, N resulting grids G_i^n with bounding boxes of the text regions can be extracted from the input feature map.
The N sampling grids G_i^n are used to sample values of the feature map at their corresponding coordinates u_i^n, v_i^n. However, these points will not always perfectly align with the discrete grid of values in the input feature map. Thus, the authors extract the value at a given coordinate by bi-linearly interpolating the values of the nearest neighbors. The authors get the values of the N output feature maps O^n at the location i, j by the following formulation. It is possible to propagate error gradients to the localization network by using standard backpropagation, for this formulation is (sub-)differentiable.
Text Recognition Stage
In the text recognition stage, these N different regions which produced by the detection stage are processed independently of each other. The authors also use ResNet architecture for the recognition stage, and they argue that using a ResNet in the recognition stage is even more important than in the detection stage, for the reason that the detection stage needs to receive strong gradients from the recognition stage in order to successfully update the weights of the localization network. A probability distribution yˆ over the label space L_epsilon are predicted in this stage, where Lǫ = L ∪ ǫ,
with L = 0 – 9a – z and ǫ representing the blank label. Each softmax classifier is used to predict one character of the given word.
The authors also suggest that using Connectionist Temporal Classification (CTC) to train the network and retrieve the most probable labeling by setting yˆ to be the most probable labeling path π.
L_epsilon^T is the set of all labels that have the length T and p(π|xn) being the probability that path π ∈ L_epsilon^T is predicted by the DNN. B is a function that removes all predicted blank labels and all repeated labels.
Experiments & Results
The authors have evaluated this network architecture on several scene text detection and recognition datasets, such as the SVHN dataset, Robust reading dataset, and the French Street Name Signs (FSNS) dataset. Table 1. shows recognition accuracy on the SVHN dataset. Robust reading dataset is used to explore the performance in detecting and recognizing single characters of the model. FSNS dataset is the most challenge dataset that is used in experiments. For the reason that it contains various lines of text with varying length embedded in natural scenes, and many images which do not include the full name of the streets. Figure 3. shows that this system is able to detect a range of differently arranged text lines and also recognize the content of these words.
Conclusion and Review
The authors proposed an end-to-end scene text recognition system – a single multi-task deep neural network. It contains a detection stage and a recognition stage, and the text detection stage was trained in a semi-supervised way. This system is able to detect a range of arranged text lines and recognize the content of these words. However, it is not fully capable of detecting text in arbitrary locations in images.
From my perspective, one of the highlights of this paper is that training the single neural network in a semi-supervised method. Because of the training data issue, unsupervised and semi-supervised deep mode learning will become more and more important in the future. In this model, the authors only use images and the labels for text contained in these images as input, and the overall loss function is only based on word recognition accuracy. The localization of the text is learned by the network itself. This means that this model can learn to decide where to look. This can also be considered as a kind of visual attention mechanism. But beyond that, it could be able to solve multiple problems such as translation, rotation, and skew by using affine transform. And I hope this semi-supervised solution for training a localization network could open up a new direction for future research not only for text, but also in the domain of object detection and recognition.
Technical Analyst: Ziyun Li | Editorial Review: Haojin | Paper source: https://arxiv.org/abs/1707.08831