Progress Update 2nd period evaluation

Overview

For the 2nd phase, almost everything is finished, including all the sub-modules in which I will explain shortly.

Detection

I was planing on using mask rcnn as the detection model, however due to training errors and server errors I couldn’t finish the training. So I worked on some backup plans. The first attemp was using cv2 and some other modules, with frames of [this video] (https://www.youtube.com/watch?v=4oml2IoZf70) as input (image has been processed so that it is better for the model to do detection):

input image

The simple model gave unsatiesfying result, although it works on other situations where the text is very small. (See file TextDetect.py in the repo and also http://stackoverflow.com/a/23565051)

input image detection1

The second model is the EAST text detection, with computation power limitations, I did not try all the pre-trained models, but the resnet_v1_50 showed very confident result:

input image detection2

As shown above, the detection result is great but not as perfect. In future works I will test all the pre-trained models and see which one works the best. But right now, the work-around is to give the result extra boundaries horizontally to cover all the text.

Problem with Tesseract OCR and EAST

Tesseract is a very tricky thing to use. It works very good on text that has a mono-color back ground, or VERY limited variation on the background. As will be shown later in the result, EAST detection was able to give boundary of roughly corrected areas, but even if the area is perfect, the bound that east suppose to give in their paper was too tight. Below are two images, the first one Tesseract gives perfect recognition while the other one gives empty string. The size difference is only 20 pixels. So far my work around is to manualy add some pixel range onto the boundary returned by EAST detection.

input image tesserect1

>>> file_to_string('./finalImage.png', 'ENG')
'PRESIDENT TRUMP SPEAKS IN THE ROSE GARDEN'

input image tesserect2

>>> file_to_string('./finalImage1.png', 'ENG')
''

Connecting EAST to the rest of the module

The EAST detection code is hosted on a server, which is lunched localy, then the main process sents the picture through a POST request to the server, then collect the detection result back.

Using EAST in this module

To use EAST detection in this module, make a clone of this repo, and download a pretrained model (or train it yourself) into /models,

replace the run_demo_server.py with the file that has the exact same name in the repo for this project,

cd to that folder,

and run:

python3 run_demo_server.py --checkpoint_path models/east_icpr2018_resnet_v1_50_rbox_1035k/

The path depends on the folder that you put the pretrained model at.

Text Merging and Text similarity

Text similarity is calculated based on 2 part for each text object: position and content. Position is calculated using center coordinate of each polygon box, and content difference is calculated using iversion pairs. If text is exactly the same, 0.8 threshold will allow two text object that varies 20 pixels in center coordinates to be considered as the same text object.

For text merging, the workflow is to first detect similarity based on context and position then do merging. This is to avoid situations where text is partially covered by some object in some of the frames. By doing so, it will not be recognized as 2 text rather just 1. This also covers some mis-recognition in the text recognition phase.

This is done through dynamic programming and cacheing so that the process is fast enough.

text_merge.merge_text(['read','reaed', 'rfeaed'])
-> 'reaed'

Test Result

Then collect the results and order them by frames. For test purposes I lower cased all of them. The sample result from the first 500 frames is stored in result.log under the repo.

The problem that it is occuring is the bounding box not perfect enough for the tesserect OCR to pick up and recognize, and due to the merging logic, texts that does not have enough similarities from the previous frame will be considered as a different object. Thus, there are situations (for now) where some texts should be considered as continus are seperated.

Another issue is also related to this, which I had already stated in the section “Problem with Tesseract OCR and EAST”. Tesseract is very picky with the text area. However, by manually bounding the box wider and removing the overlapped detection I got perfect result.

[{'text': 'bloomberg', 'start_frame': 1, 'end_frame': 499}, {'text': 'president trump speaks in the rose garden', 'start_frame': 1, 'end_frame': 30, }]

The result.log file also records the detected text area in each frame.

phase 3

In phase 3, I will be covering the following works:

  • 1, pack the codes into a singularity container. In phase 1 I had a lot of issues trying to invoke a GPU node inside a singularity container, with a lot of attempts. Since now the works does not strongly require a GPU, but rather just cpu, I have plenty of experience with that.

2, test different models of EAST

  • Right now the major issue is the bounding box of EAST is not so perfect. One solution is to see if other pretrained models are better, and another is to change the way similarity detection is working. right now it is based on both location and text, by emphasizing it more on location and make some other judgements it should also work well.

3, test the codes on different languages

  • theoretically, EAST and tesserect works across all languages. I needs to be tested, and implement a machenism to switch to different detection models.

4, format the results into Red Hen requirements.

Written on July 20, 2019