Category Archives: Machine vision

Tracking Objects from Video Feed Part III: Detecting Object Displacement

In the previous part we provided one solution for detecting and identifying a stationary object of certain shape in video feed. In this part we focus on tracking the object and try to analyze a simple path of a moving object. By simple, I mean *really* simple, we try to detect the “from” and “to” positions of the object – where it started and where did it end up.

When milliseconds count

Compared to detecting objects from a static image or frame, detecting object displacement presents us a new, tough requirement: We have to analyze the frames real-time and thus, performance is the key. We cannot simply use all the methods described earlier, since, especially on mobile devices, they simply take too much time to compute. Ideally, depending on the framerate and the estimated speed, relative to our field of view (FoV), of the moving object, our operation for tracking the image should take less than 10 milliseconds per frame. It is quite obvious that the complexity of any algorithms we use is relative to the frame size – the less pixels we have to analyze, the faster the operation.

Instead of using all the methods described earlier (chroma filter, object mapping, convex hull etc.) to track the object, we utilize them to “lock” the target object. In other words, we identify the object we want to track and after that we can use far lighter methods to track its position. We don’t have to process the full frame, but only the area of the object with some margin. This helps us to reduce the resolution and run our operations much quicker.

Since our target object can be expected not to change color (unless we’re tracking a chameleon), we can do the following:

  1. Once we have detected the object from the image/frames and we know its position and size (number of pixels horizontally and vertically where the object is thickest) we can define a rectangular cropped area with the object in the center and with a margin of e.g. 15 %.
  2. Apply chroma filter to this cropped area for each frame and keep track of the position, which is defined by the intersecting point of virtual lines placed where we have most pixels horizontally and vertically. Figure 9 illustrates tracking the locked target object.
    • If the center point displacement exceeds our predefined delta value, we move to the next phase, where we analyze the object movement.

VaatturiLockedToObjectScaledFigure 9. Target object locked, and tracking limited to the region marked by the green rectangle.

It moved, but where did it go?

How do we implement the next phase then? It seems that for more accurate analysis of the object movement, we must use more complex methods than we used for detecting the initial displacement of the object. What if we record the frames for later analysis? Since we may not know or forecast when the object is going to move, depending on the frame size, the video we record might be huge! Fortunately, there is a way to store the frames while still keeping the required size fixed: A ring buffer (also known as circular buffer). In short, ring buffer is a fixed size buffer and when you reach the end, your start again from the beginning and replace the frames recorder earlier. See this article about buffering video frames by Juhana Koski to learn more. Because we observe the initial displacement of the object in real-time, we can record few more frames (the estimated time until the object exists our FoV) and then stop. After this we no longer have the real-time requirement and we can take our time analyzing what happened to the object after its initial displacement.

Let’s say that we want to get the last frame of the object until it leaves the FoV. We could use the following algorithm:

  1. Start iterating from the last recorded frame towards the frame of the initial displacement:
    1. Treat each frame as we did in the beginning when we found the desired object from the image using chroma filter, object map, convex hull and shape analysis.
    2. If we find an object satisfying our criteria, we stop expecting it to be the object we were tracking.
  2. We now have the object position from the beginning of its movement to the last known position in our FoV (see figure 10). This means we can at least calculate the angle and relative velocity of the object.

 VaatturiObjectMotionCapturedScaledFigure 10. Object (USB cannon projectile wrapped with pink sticker) motion captured.

Challenges and future development

Lighting challenges are typical with image pattern recognition solutions. Changes in lighting conditions affect the perceived color and that makes the selection of parameters (YUV value and threshold) for chroma filtering difficult. Camera hardware and its settings play a significant role here: Longer the exposure time, easier it is to detect the object properly. However, with long exposure time, it’s harder to capture the object movement. The object in motion will have a distorted shape and its color will blend with the background. Thus, it becomes more difficult find the object in the frames when it’s moving. On the other hand, if we use short exposure time, we get less light per frame and the color difference of the object compared to the background might be insufficient.

The current implementation of the solution relies on manual parameter setting for both color and threshold. In the future, we could try to at least partially automate the parameter setting. We would still have to roughly know the shape and size of the object we want to find. We could apply edge detection algorithms to boost the color filter and get more accurate results with stationary objects. Of course, when an object is moving fast, the edges may blur. However, since the current implementation provides us with the frame of the initial object displacement, we can compare that to the later and see the changes in e.g. chroma. The moving object will leave a trace even if it’s blurred with the background or distorted.

And then there was the code…

The related code project is hosted in GitHub:

See the file delivered with the project to learn more. The project is freely licensed so you can utilize any bits of the code anyway you like. Have fun!

ThatsAllFolks…or is it?

EDIT: Turns out that’s not all folks. See how everything turns out here.

Tracking Objects from Video Feed Part II: Identifying a Stationary Object

In the first part we introduced the raw image data formats, which we receive from camera. Now it’s time to get to the good stuff. First we formulate our goals and start with trying to extract an object from video feed. In order to track something we first need to find the “something”.

Think before you act?

When solving a complex problem, it may be sometimes useful to start from the desired result and work your way from the result to the beginning in the problem solving pipeline. This way we can break the problem down to several resolutions – steps in between leading to the final solution if you like. In case of our problem, this reversed pipeline would look something like in the table below.

Table 1. Problem solving pipeline reversed.

What is the result? How do we get there?
An object, with a specific shape, identified in the image (location, size, shape etc.) Find the center of mass of the two dimensional object and its outline.
Outline of the detected object. Apply convex hull algorithm to object map (extended binary image, where the background is removed).
Object map, where all significant (suspected) objects are separated and the background is removed. Individualize objects from a binary image.
Binary image in which objects have value 1 and background value 0. Apply algorithm to extract objects with some criteria from the background. E.g. chroma filter.
Chroma filter implementation to extract objects from image. Start coding!

Chroma filter

Here the word “chroma” is a bit misleading, since we also inspect luma (Y) value when filtering the image. The algorithm is as follows:

  1. Set a desired YUV value as a target. This is basically the color we want to find from the picture i.e. if a ball is blue and the background is orange, we try to set the value as close to the blue color of the ball as possible. We also need to set a threshold value, which defines the allowed difference between the target value and the blue color we accept.
  2. Iterate the image data (byte array), pixel by pixel or block by block. By block we mean the unit shown in figures 1 and 2. In the case of NV12, there are four Y values, one U and one V. We could calculate the average of the four Y values, but since the values are likely to be almost the same, for the sake of optimization it’s enough to choose just one. Then we simply compare the set target values to the measured ones, as pseudo code:

IF Difference(target_value_Y, measured_Y) < threshold AND
Difference(target_value_U, measured_U) < threshold AND
Difference(target_value_V, measured_V) < threshold
(Mark this pixel/block as selected)
(Mark this pixel/block as not selected)


 VaatturiOriginalImage2Scaled  VaatturiChromaFilterRedScaled

Figure 4. The original image on left and the image with chroma filter (red) applied on right.

In the case of NV12 we can utilize the Y plane (since it matches the full size of the image) and convert it into a virtual binary image: Y value 0 indicates 0 and Y value 255 indicates 1 (as done on the right-hand-side image in figure 4). We can then use it to map objects as explained in the next chapter.

In our code project we have a native effect, which executes the aforementioned chroma filter for a single frame: ChromaFilterEffect.

Mapping objects

The simplest source for mapping objects is a binary image e.g. a bit array where 0 denotes background and 1 an object. What we need to do is to identify objects that are not joined (their pixels don’t touch each other). We can do this by assigning each object a unique number. The end result can be, for instance, an array of integers, where 0 still denotes background, but a value greater than 0 is an ID of a certain object. Figure 5 illustrates this process from the binary image to resulting object map.

BinaryImageAndObjectMapFigure 5. Binary image (left) and object map (right).

The principle of the algorithm for mapping objects is very simple: Check each pixel of the binary image and assign a unique value to those pixels which are adjacent. If we break this down into more detail, we can have something like this, when the image is in form of an array:

  1. Create an object ID counter (let’s call the counter c), which provides a unique value for the pixels of each object. You can start with value c = 1.
  2. Go through each pixel starting from 0 to N – 1, where N is the number of pixels:
    • If the pixel has value 0 (background):
      • Do nothing.
    • If the pixel has value 1 (an object):
      1. If the previous pixel had value 0:
        • Set the object ID counter value so that the new value is guaranteed to be unique (let’s call this unique ID U), c = U, since this can be a new object. It is important that the value is truly unique – mere increment by one is not enough.
      2. If there is an adjacent non-zero pixel above (see figure 6):
        1. Get the object ID of the adjacent pixel (let’s call this value as A).
        2. Backtrack (with a separate iteration) and replace all pixels, which have the current value of counter c with value
          • You can stop backtracking, if you encounter a line with no pixels having the current value of the object ID counter.
        3. Set the value of the object ID counter to the value of the adjacent pixel (c = A) and continue the original iteration from the index before backtracking.
      3. Set the value of the current pixel to match the current value of the object ID counter (value of c).

ObjectMapCreationProcessFigure 6. Object map creation process.

See ImageProcessingUtils::createObjectMap method where the aforementioned algorithm is implemented with C++.

Note that after applying the algorithm described above, you will not end up with an ordered object map (map with ordered IDs: 1, 2, …, N). Instead, you will have a map where each object has a unique ID, but they can be arbitrary, e.g. 2, 5, …, N + 100. Sorting the map is quite trivial: Get a list of current IDs (one iteration), create a map where each existing ID has an ordered counterpart (2 -> 1, 5 -> 2, …, N + 100 -> N) and replace the values in the object map (second iteration).

Identifying object shapes

Now that we have an object map, we can easily see the size of each object. But what about the shape? It depends on the way we want to classify shapes e.g. are we interested in round shapes or squares. In addition, depending on our object extraction methods, the objects may not be perfectly extracted; some of what was part of the object, could have been interpreted as background. For instance, in figure 6, the largest object could have been a ball with a large stripe on it that was lost when our chroma filter was applied. Luckily, there is a simple algorithm for “fixing” the shape, called convex hull.

Convex hull can be considered as an elastic band wrapped around an object. If we apply the algorithm to the first object in the object map we created earlier, the stripe in the center is discarded and a circle shape emerges:

ConvexHullFigure 7. Convex hull algorithm applied to the first object in the object map.

 VaatturiConvexHullScaledFigure 8. Convex hulls (green lines around the bird and red lines around small piece on X-Wing figher side panel) drawn to surround the filtered red areas.

There are number of different convex hull algorithms. The one used in this solution is called monotone chain (see ImageProcessingUtils::createConvexHull method). For more information about monotone chain including code samples can be found from Wikipedia:

Convex hull gives us a list of points (pixels coordinates), which form the outline of an object. However, we still have to determine the shape of the object somehow. If it’s a ball shape we’re after, we could find the center of the object – e.g. center of mass where one pixel equals one mass unit – and then see how much the outline differs from a circle, where the radius is width or the height of the object divided by 2. The less the difference, the more the shape is like a circle, which of course is the shape of a ball in two dimensional presentation.

to-be-continued-back-to-the-futureNow that we have found our object let’s try to track it. See the next and the final part of this thrilling trilogy and find out if we succeeded in our goal to detect the changes in object position!

Tracking Objects from Video Feed Part I: Image Data Formats

It so happened that a couple of months ago my colleague and I were introduced an interesting challenge: Could we use a phone camera to capture and analyze the path of a moving object. There were (and still are) some wild ideas what we could do in the not-so-distant future, but somewhat excited we decided to investigate what we could do with the current resources (hardware and APIs) available. I personally prefer setting the bar high, but, of course, one needs to start from the ground up – at least with no super powers.

VaatturiLockedToObjectScaledThe might eagle will never know what hit him…

In this three part blog jamboree I’ll unveil where we got (so far) and how we did it.

Getting Started

So we had our challenge formulated: Identify an object from the video feed (the stuff that the camera on the device feeds you) while it’s stationary. Then lock into that object – watch it like a hawk – and if it moves, try to see where it went. To be precise “where it went” means finding its last position in the field of view (FOV).

It was apparent to us from the beginning that we wanted to go native to get the maximum performance. At this point we only had very vague vision of the algorithms (and their complexity), which we would likely use. Anyhow, we expected to be dealing with some heavy stuff. Luckily, another nice fellow at work hinted us to checkout this MediaCapture API sample, and it turned out it was a good starting point. Thereon we focused understanding the image data formats in order to be able process the data.

Image Data Formats

The image data received from camera hardware (the camera of a smartphone, tablet or simply an external webcam) comes in YUV color space. In short, Y (luma) defines the brightness while U and V (chroma) define the color of a pixel. The notion Y’CbCr would be more accurate as Y’UV refers to an analog encoding scheme, but for the sake of simplicity we use U to denote Cb and V to denote Cr in this article.

Depending on the hardware, typically two type of specific YUV formats are used: YUV2, which is commonly used by e.g. webcams, and NV12, commonly used by smartphone cameras. Although they are both based on YUV color space, their structure – size and order of Y, U and V bytes, are different. Thus, the frames need to be either converted to some specific format prior to processing or we have to implement separate methods to process the frames correctly based on the format.


YUV2 format has the following properties:

YUV2 can be seen as a byte array where the first and then every second value is a Y (luma) value while chromas (U and V) fill the blanks in between so that U comes first as shown in figure 1. Each chunk has the size of 4 bytes (32 bits).

YUV2Figure 1. YUV2 format.

For more details about YUV12 format, visit the following pages:


NV12 format has the following properties:

  • 4:2:0 chroma subsampling
  • 12 bits per pixel

Let’s consider a VGA image (640×480 pixels) encoded in NV12 format. The size of the Y plane covers the whole resolution a byte per pixel i.e. the Y plane is 640×480 bytes. The U/V plane, which follows, is half of the Y plane: 640×240 bytes (see the figure below).

NV12PlanesFigure 2. NV12 image format.

The figure below depicts four pixels in NV12 format. As you can see each pixel has a dedicated Y value, but only ½ of chroma (¼ of U and ¼ V values to be precise).

FourPixelsNV12Figure 3. A four pixel section in NV12 image data.

The block in figure 2 on the right-hand-side describes the corresponding location of the bytes in NV12 byte array A, where h is the height of the image and w is the width of the image (in bytes).

For more details about NV12 format, visit the following pages:

Coming next…

Check out the next part where we dive straight into the methods of extracting objects from video frames.

BatmanTrappedWill the caped crusader overcome the challenge or will he be trapped by the algorithm villains? Tune in tomorrow (or right now if you like) – same dev-time, same dev-channel!