I spent some time on training several cascades to detect cars in ego-view automotive videos,
and will now document what I’ve learned.
I will use the existing OpenCV-tools.
-> pos/ 1000 images containing the desired object
-> pos.info (containing the filenames of the objects, number of objects in the frame and bounding boxes in the format x,y,width,height)
-> neg/ 2000 images that do not contain cars at all
-> negs.txt text-file containing the filenames of all negative images
For the positive images I used tight bounding boxes. You actually do not need as many negative images as you want to use negative samples later on, as the training-script will sample patches from the negative images given, so it can actually be less images than negative samples.
Some of the positive images, the bounding-boxes have been annotated by hand (ground-truth-data):
Some lines from the pos.info file:
——————- snip ——————–
pos/21_00011043.png 1 301 161 109 106
pos/21_00011044.png 1 302 161 110 106
pos/21_00011045.png 1 304 161 109 106
pos/21_00011046.png 1 305 161 109 106
pos/21_00011047.png 1 307 162 109 106
——————- snap ——————–
To prepare the data for the cascade training:
opencv_createsamples -info pos.info -num 1000 -w 32 -h 32 -vec cars_32_32.vec
-info name of the info file
-num number of positive samples to generate
-vec the vector to which the samples will be stored
To view if everything went well, and visualize the samples that are stored in the vector:
opencv_createsamples -w 32 -h 32 -vec cars_32_32.vec
As already said, you do not need as many negative images as you want negative samples (-numNeg), as patches are sampled from your negative images.
The maxFalseAlarmRate seems high at first, but remember that this is a cascade, so you actually have to potentiate it by the number of stages you are using,
so assuming you are using 15 stage: 0.5^15 = .000030517578125.
opencv_traincascade -data cascade_32_32_20_haar -featureType HAAR -vec cars_32_32.vec -bg negs.txt -numStages 20 -minHitRate 0.999 -maxFalseAlarmRate 0.5 -numPos 1000 -numNeg 2000 -w 32 -h 32
So, how do these measures actually look on a video-sequence ?
In the video below, the upper left shows the manually annotated ground-truth boundingboxes, upper right shows a HAAR-cascade, lower left: HOG-cascade, lower right: LBP-cascade. All cascades have been trained on the same data (width 32, height 32, 1000 pos, 2000 neg, falseAlarmRate of 0.5, 17 stages).
Sorry for the bad quality, but I had to find some kind of compromise between image quality and file size..
The detections are quite OK, but they could be better. So what could be the issues ?
Scaling most certainly is. If you look at these three images, you can already see that their appearance in pixels is quite different, even though it is the very same object. Top-row shows the original size in the image, the bottom row shows them all scaled up to the same size. We can already see that their appearance varies drastically, so the most probably their feature descriptors that are used for classification will also:
Let’s have a look at the data we are using to train. I took the positive images from three different annotated sequences,
every one of these sequences have from 30-40 different annotated cars. Of course we get a certain variation in illumination, viewpoint and appearance,
but is that enough ?