Category Archives: C++

Finding 3D ROI With Point Clouds

I have been lucky and, as a perk of the job I do, have had the pleasure of following and working with many exciting and innovative start-ups. It gives me twice the joy if the said start-up focuses on making the world a better place – which I am happy to say happens quite often! One of those start-ups is Neuro Event Labs. Their goal is to make the treatment of epilepsy and other neurological diseases easier and more affordable. Needless to say, their work has many different moving parts, but technologies they utilize include the use of depth cameras. In case you’ve never heard about depth cameras: In addition to Microsoft Kinect, Intel has created wide range of RealSense devices, that will unlikely break your bank should you want to get one to play with. Depth cameras basically just measure the distance to what they see; the pixel values define distance, not color. While they solve bunch of complex problems, I got interested in one specific ‘bit’ of a problem and wanted to write my two cents on the matter…

Here’s the general problem statement: How to find the 3-dimensional region of interest (ROI) of cyclic motion (e.g. an object following the same path repetitively) from a set of (depth) frames?

Why cyclic motion? If the object follows a random path and/or even goes outside of the frame, what’s the point, really?

As Simple as Subtraction

In one my ‘Tracking Objects’ blog posts I wrote about calculating delta between two frames. If instead of color images, where a single pixel can have, and normally does, multiple values (e.g. R, G and B), depth frames make calculating the delta easier since one pixel corresponds to a single value. Thus, calculating the difference is but a matrix subtraction!

Delta frame calculationFigure 1. Delta frame calculation. From left to right: F1 – F2 = Fd

As seen in the example above in an ideal case (no noise) only the pixels that indicate change (motion) have a nonzero value. Of course, in real life you tend to have some level of noise. Perhaps, the most intuitive way to reduce the effect of noise is to use some threshold value. Let us indicate that value as T. By observation it is not difficult to find a good threshold value, but you can also calculate the approximate T as long as you have frames that you know for certain do not contain any motion. The latter is helpful, when the input device or the environment varies. Needless to say that noise in depth frames goes both ways (closer and farther) so T will indicate an absolute value (|t|).

If the contours in the frame are fairly smooth (gradients are gentle), more sound solution to deal with noise is to use some convolutional method, where the evaluation of a pixel depends on the other pixels surrounding it. This post is not about noise reduction, but if you want to know more, there’s no shortage in information in the internet – start by checking the Wikipedia page on the subject.

At this step we could do more than just settling on the subtracted matrix; we could, for example, replace the nonzero values with the actual distance of the object from the camera. We have the necessary data in frames 1 and 2. Then the combination of all the delta frames would trace out the path of the moving object and the resulting matrix would already give us the 2-dimensional ROI! But let’s not stop here…

Cloudy with a Chance of Points

If our object is strangely oriented and its path of motion extends in all axis, we may want to retain some of that depth information for more precise analysis. Should our delta matrix then be 3-dimensional? Well, why not! One way to represent a 3D matrix in compact form (i.e. with less bytes when the matrix contains a lot of zeros) is to use a point cloud.

When I first heard my colleague, Sachin, who specializes in machine learning (among other things), use the term “point cloud” I had to ask what it meant. And I felt stupid after getting the answer. Yet, I think “point set” would be better (because “set of points” is too long, obviously). At the time of writing this, the first sentence in Wikipedia defines a point cloud as follows: ‘A point cloud is a set of data points in some coordinate system’. Simple as that. The definition does not say whether the points are ordered or not etc. I guess the cloud bit gives one the impression that the points do not have to obey any specific shape as in being bounded by something.

But to the point (pun intended)! In the following figure we have the single delta frame (from figure 1) represented as a point cloud (the blue dots in figure 2 match the red pixels in figure 1):

Delta frame point cloud scatter plotFigure 2. Point cloud constructed based on the calculated delta frame in figure 1. Created with Plotly.

Now if we would’ve replaced the points with the actual depth values (z axis) of the object prior to populating the point cloud and input the frames to cover the entire motion cycle, we had ended up with the 3D space occupied by the ‘shell’ of the object! Neat, huh? Furthermore, this allows us to construct a 3-dimensional ROI. Sorry for not having another image to point this out, but I hope you can form the mental image.

If you’re lucky or can control the environment so that the object’s path of motion is parallel to one of the axis, you can construct the 3D ROI by simply creating a bounding box around the points in your cloud using xmin, xmax, ymin, ymax, zmin and zmax. However, if the orientation is not under your control, you can still do this after applying a linear transformation (rotation) first. The transformation that provides you with a bounding box with the least volume is the one.

Note that you can have the bounding box updated every time when you add a point to your cloud by updating min/max values (if no transformation is necessary). While this does not reduce the complexity compared to adding all the points and calculating the bounding box after – this is by the way a delusion that I tend to fall into continuously – it divides the load evenly, which may be useful.

So that this wouldn’t be all talk, here is some code too:

It’s a simple C++ point cloud implementation of my own. It’s certainly not the fastest nor the prettiest, but I believe it provides a nice description in code form that is easy to understand. I’ve played with it a bit, but if you want to put it into some serious use, I recommend running some tests just to make sure it’s rigid.


Tracking Objects from Video Feed Part IV: New Hope for Circular Shapes

So, turns out part III wasn’t the last of it – I present to you part IV, which I promise will be at least as good as The Kingdom of the Crystal Skull was as far as sequels after the third one go… In the previous chapter of this story, I mentioned few ideas to improve the reliability and robustness of various bits in the object tracking pipeline. I did, in fact, implement a very simple noise removal functionality, edge detection and chroma delta i.e. the difference between two frames presented as a binary image. I also made the system to be biased towards things, which are round, like balls (or circles if you will since we are dealing with 2D images – but I like the words ballzz better being an immature halfwit).

What’s with the noise!?

There are number of methods to remove noise from an image. Removing noise is essential especially in cases where you want to apply any sort of edge detection. One of the obvious choices is using a Gaussian filter and it really isn’t hard to find ready-made algorithms from the interwebs regardless of the coding language you chose (MUMPS and Brainfuck excluded). Did yours truly then incorporate a Gaussian filter into the project? You bet your ass he did not. Instead, I opted for, although quite similar coding-wise, a nearest neighbor smoother: For every pixel in the image calculate the average of all eight neighbors of said pixel and apply that value to the pixel in question. Super simple! And effective! However, even though this method is slightly quicker than using e.g. 5×5 Gaussian filter, it still takes its toll: I averaged approximately 350 milliseconds per 720p frame on Lumia 930[1].

Software developer enchanted by a lantern
Edges extracted from the original image with noise (center) and with noise removed (right). Hint: Click the image to enlarge.

On the edge

Canny edge detector is a pretty good choice for your edge detection needs. It’s robust, works well and has a funky name. Did I decide to go with a different solution? You betcha! Why? Well, because I’m lazy, and settled for lesser result. My implementation also performs better i.e. less milliseconds spent per frame. What I do is as follows:

  • For every pixel starting from the top left corner handling a horizontal line at a time:
    1. Calculate the difference of the current pixel to the one on the right (if one exists) – we could call this difference “gradient”
    2. Calculate the difference of the current pixel to the one below (if one exists) – let’s go crazy and call this a “gradient” too
    3. If the sum of the gradients exceeds a set threshold value, mark this pixel as part of an edge. If not, it’s not an edge.

“So, what about the pixels on the left and above?” one might ask. “I don’t give a ….” is my answer. So far my solution is sufficient for my needs and that’s all I really care about.

Chroma delta

As described in the previous part, chroma delta takes two frames and calculates their difference and presents it as a binary image. Each pixel has components with three values, whether it’s RGB or YUV, and calculating the overall difference of each of these values for each corresponding pixel in both frames and using a threshold value, we get a binary value for the corresponding pixel in the delta frame. In my case I utilize the Y-plane and set the value 0xff (255) for the pixel, which has changed significantly, or 0x0 (0) for the ones that remain fairly unchanged.

Chroma delta gone bananas
How to make a code monkey mad? Touching his/her banana, that’s how!

So to make it clear: The right-most image displays the change between the two others. See how even the noticeable change in the lighting condition still yields a result. If we know that the banana was in the place depicted in the left image, by looking at the delta image, we can deduce that it’s no longer in that position and by looking at the other shape it’s quite obvious that it moved to the right.

Them balls

When the application is in charge of choosing the object to track, it goes without saying that you have to device some sort of traits for the desired object. My life experience has thought me that round things are more likely to move than rectangular things. I suppose the first guy who realized this invented the wheel. Thus, I’m just as clever. But how do you programmatically evaluate whether a thing is round or not? Especially when despite of your more or less advanced image processing methods do not fully extract the shape of an object?

“Just call me Darth Balls… Bong.” – Jay (from Jay and Silent Bob Strike Back, 2001)

I came to a conclusion that I would have to find a way to calculate how well my neat convex hulls would match a circle. Again, I looked for an answer in the interwebs and discovered that the problem I needed the solution for is called Smallest-circle problem and luckily a sharp, Austrian dude, Emo Welzl, had already proposed a recursive algorithm, which not only solves the problem but does it in mere O(n) time. However – being a lucky bastard – since I already had my convex hulls, I could use a more straightforward solution to create my minimal enclosing circles – here’s how:

  • Find the two vertices, in the convex hull, furthest apart. Their distance is the diameter of your enclosing circle.
  • The center point of the line segment between those two vertices is also the center point of the circle (see the image below).
Minimal Enclosing Circle
Minimal enclosing circle based on the convex hull reconstructs the shape of the ball.

Now that you have the circle, you can compare the traits of the convex hull to see how well they align with the minimal enclosing circle. I tried calculating the difference of the vertices of the convex hull to the circumference of the enclosing circle and it worked out pretty well. I’m sure there are other and even better ways to evaluate the roundness (or “eccentricity” as intelligent people would say). Do also keep in mind that you need to normalize the error based on the object’s size unless you want to be biased towards smaller objects.

Teh codez

Here is the code corresponding to the scribbles in this blog post:

Note that, as of writing this, the pipeline of the Object tracking demo consists of something old, something new, something blue i.e. it’s not really functioning properly.

[1] Lumia 930 is a mobile phone designed and manufactured by a Finnish company, Nokia, which sold its mobile phone division to Microsoft on April 2014 when it decided to focus on Jedi lightsaber-combat-practice-laser-shooting-sphere-things instead.

Tracking Objects from Video Feed Part III: Detecting Object Displacement

In the previous part we provided one solution for detecting and identifying a stationary object of certain shape in video feed. In this part we focus on tracking the object and try to analyze a simple path of a moving object. By simple, I mean *really* simple, we try to detect the “from” and “to” positions of the object – where it started and where did it end up.

When milliseconds count

Compared to detecting objects from a static image or frame, detecting object displacement presents us a new, tough requirement: We have to analyze the frames real-time and thus, performance is the key. We cannot simply use all the methods described earlier, since, especially on mobile devices, they simply take too much time to compute. Ideally, depending on the framerate and the estimated speed, relative to our field of view (FoV), of the moving object, our operation for tracking the image should take less than 10 milliseconds per frame. It is quite obvious that the complexity of any algorithms we use is relative to the frame size – the less pixels we have to analyze, the faster the operation.

Instead of using all the methods described earlier (chroma filter, object mapping, convex hull etc.) to track the object, we utilize them to “lock” the target object. In other words, we identify the object we want to track and after that we can use far lighter methods to track its position. We don’t have to process the full frame, but only the area of the object with some margin. This helps us to reduce the resolution and run our operations much quicker.

Since our target object can be expected not to change color (unless we’re tracking a chameleon), we can do the following:

  1. Once we have detected the object from the image/frames and we know its position and size (number of pixels horizontally and vertically where the object is thickest) we can define a rectangular cropped area with the object in the center and with a margin of e.g. 15 %.
  2. Apply chroma filter to this cropped area for each frame and keep track of the position, which is defined by the intersecting point of virtual lines placed where we have most pixels horizontally and vertically. Figure 9 illustrates tracking the locked target object.
    • If the center point displacement exceeds our predefined delta value, we move to the next phase, where we analyze the object movement.

VaatturiLockedToObjectScaledFigure 9. Target object locked, and tracking limited to the region marked by the green rectangle.

It moved, but where did it go?

How do we implement the next phase then? It seems that for more accurate analysis of the object movement, we must use more complex methods than we used for detecting the initial displacement of the object. What if we record the frames for later analysis? Since we may not know or forecast when the object is going to move, depending on the frame size, the video we record might be huge! Fortunately, there is a way to store the frames while still keeping the required size fixed: A ring buffer (also known as circular buffer). In short, ring buffer is a fixed size buffer and when you reach the end, your start again from the beginning and replace the frames recorder earlier. See this article about buffering video frames by Juhana Koski to learn more. Because we observe the initial displacement of the object in real-time, we can record few more frames (the estimated time until the object exists our FoV) and then stop. After this we no longer have the real-time requirement and we can take our time analyzing what happened to the object after its initial displacement.

Let’s say that we want to get the last frame of the object until it leaves the FoV. We could use the following algorithm:

  1. Start iterating from the last recorded frame towards the frame of the initial displacement:
    1. Treat each frame as we did in the beginning when we found the desired object from the image using chroma filter, object map, convex hull and shape analysis.
    2. If we find an object satisfying our criteria, we stop expecting it to be the object we were tracking.
  2. We now have the object position from the beginning of its movement to the last known position in our FoV (see figure 10). This means we can at least calculate the angle and relative velocity of the object.

 VaatturiObjectMotionCapturedScaledFigure 10. Object (USB cannon projectile wrapped with pink sticker) motion captured.

Challenges and future development

Lighting challenges are typical with image pattern recognition solutions. Changes in lighting conditions affect the perceived color and that makes the selection of parameters (YUV value and threshold) for chroma filtering difficult. Camera hardware and its settings play a significant role here: Longer the exposure time, easier it is to detect the object properly. However, with long exposure time, it’s harder to capture the object movement. The object in motion will have a distorted shape and its color will blend with the background. Thus, it becomes more difficult find the object in the frames when it’s moving. On the other hand, if we use short exposure time, we get less light per frame and the color difference of the object compared to the background might be insufficient.

The current implementation of the solution relies on manual parameter setting for both color and threshold. In the future, we could try to at least partially automate the parameter setting. We would still have to roughly know the shape and size of the object we want to find. We could apply edge detection algorithms to boost the color filter and get more accurate results with stationary objects. Of course, when an object is moving fast, the edges may blur. However, since the current implementation provides us with the frame of the initial object displacement, we can compare that to the later and see the changes in e.g. chroma. The moving object will leave a trace even if it’s blurred with the background or distorted.

And then there was the code…

The related code project is hosted in GitHub:

See the file delivered with the project to learn more. The project is freely licensed so you can utilize any bits of the code anyway you like. Have fun!

ThatsAllFolks…or is it?

EDIT: Turns out that’s not all folks. See how everything turns out here.

Tracking Objects from Video Feed Part II: Identifying a Stationary Object

In the first part we introduced the raw image data formats, which we receive from camera. Now it’s time to get to the good stuff. First we formulate our goals and start with trying to extract an object from video feed. In order to track something we first need to find the “something”.

Think before you act?

When solving a complex problem, it may be sometimes useful to start from the desired result and work your way from the result to the beginning in the problem solving pipeline. This way we can break the problem down to several resolutions – steps in between leading to the final solution if you like. In case of our problem, this reversed pipeline would look something like in the table below.

Table 1. Problem solving pipeline reversed.

What is the result? How do we get there?
An object, with a specific shape, identified in the image (location, size, shape etc.) Find the center of mass of the two dimensional object and its outline.
Outline of the detected object. Apply convex hull algorithm to object map (extended binary image, where the background is removed).
Object map, where all significant (suspected) objects are separated and the background is removed. Individualize objects from a binary image.
Binary image in which objects have value 1 and background value 0. Apply algorithm to extract objects with some criteria from the background. E.g. chroma filter.
Chroma filter implementation to extract objects from image. Start coding!

Chroma filter

Here the word “chroma” is a bit misleading, since we also inspect luma (Y) value when filtering the image. The algorithm is as follows:

  1. Set a desired YUV value as a target. This is basically the color we want to find from the picture i.e. if a ball is blue and the background is orange, we try to set the value as close to the blue color of the ball as possible. We also need to set a threshold value, which defines the allowed difference between the target value and the blue color we accept.
  2. Iterate the image data (byte array), pixel by pixel or block by block. By block we mean the unit shown in figures 1 and 2. In the case of NV12, there are four Y values, one U and one V. We could calculate the average of the four Y values, but since the values are likely to be almost the same, for the sake of optimization it’s enough to choose just one. Then we simply compare the set target values to the measured ones, as pseudo code:

IF Difference(target_value_Y, measured_Y) < threshold AND
Difference(target_value_U, measured_U) < threshold AND
Difference(target_value_V, measured_V) < threshold
(Mark this pixel/block as selected)
(Mark this pixel/block as not selected)


 VaatturiOriginalImage2Scaled  VaatturiChromaFilterRedScaled

Figure 4. The original image on left and the image with chroma filter (red) applied on right.

In the case of NV12 we can utilize the Y plane (since it matches the full size of the image) and convert it into a virtual binary image: Y value 0 indicates 0 and Y value 255 indicates 1 (as done on the right-hand-side image in figure 4). We can then use it to map objects as explained in the next chapter.

In our code project we have a native effect, which executes the aforementioned chroma filter for a single frame: ChromaFilterEffect.

Mapping objects

The simplest source for mapping objects is a binary image e.g. a bit array where 0 denotes background and 1 an object. What we need to do is to identify objects that are not joined (their pixels don’t touch each other). We can do this by assigning each object a unique number. The end result can be, for instance, an array of integers, where 0 still denotes background, but a value greater than 0 is an ID of a certain object. Figure 5 illustrates this process from the binary image to resulting object map.

BinaryImageAndObjectMapFigure 5. Binary image (left) and object map (right).

The principle of the algorithm for mapping objects is very simple: Check each pixel of the binary image and assign a unique value to those pixels which are adjacent. If we break this down into more detail, we can have something like this, when the image is in form of an array:

  1. Create an object ID counter (let’s call the counter c), which provides a unique value for the pixels of each object. You can start with value c = 1.
  2. Go through each pixel starting from 0 to N – 1, where N is the number of pixels:
    • If the pixel has value 0 (background):
      • Do nothing.
    • If the pixel has value 1 (an object):
      1. If the previous pixel had value 0:
        • Set the object ID counter value so that the new value is guaranteed to be unique (let’s call this unique ID U), c = U, since this can be a new object. It is important that the value is truly unique – mere increment by one is not enough.
      2. If there is an adjacent non-zero pixel above (see figure 6):
        1. Get the object ID of the adjacent pixel (let’s call this value as A).
        2. Backtrack (with a separate iteration) and replace all pixels, which have the current value of counter c with value
          • You can stop backtracking, if you encounter a line with no pixels having the current value of the object ID counter.
        3. Set the value of the object ID counter to the value of the adjacent pixel (c = A) and continue the original iteration from the index before backtracking.
      3. Set the value of the current pixel to match the current value of the object ID counter (value of c).

ObjectMapCreationProcessFigure 6. Object map creation process.

See ImageProcessingUtils::createObjectMap method where the aforementioned algorithm is implemented with C++.

Note that after applying the algorithm described above, you will not end up with an ordered object map (map with ordered IDs: 1, 2, …, N). Instead, you will have a map where each object has a unique ID, but they can be arbitrary, e.g. 2, 5, …, N + 100. Sorting the map is quite trivial: Get a list of current IDs (one iteration), create a map where each existing ID has an ordered counterpart (2 -> 1, 5 -> 2, …, N + 100 -> N) and replace the values in the object map (second iteration).

Identifying object shapes

Now that we have an object map, we can easily see the size of each object. But what about the shape? It depends on the way we want to classify shapes e.g. are we interested in round shapes or squares. In addition, depending on our object extraction methods, the objects may not be perfectly extracted; some of what was part of the object, could have been interpreted as background. For instance, in figure 6, the largest object could have been a ball with a large stripe on it that was lost when our chroma filter was applied. Luckily, there is a simple algorithm for “fixing” the shape, called convex hull.

Convex hull can be considered as an elastic band wrapped around an object. If we apply the algorithm to the first object in the object map we created earlier, the stripe in the center is discarded and a circle shape emerges:

ConvexHullFigure 7. Convex hull algorithm applied to the first object in the object map.

 VaatturiConvexHullScaledFigure 8. Convex hulls (green lines around the bird and red lines around small piece on X-Wing figher side panel) drawn to surround the filtered red areas.

There are number of different convex hull algorithms. The one used in this solution is called monotone chain (see ImageProcessingUtils::createConvexHull method). For more information about monotone chain including code samples can be found from Wikipedia:

Convex hull gives us a list of points (pixels coordinates), which form the outline of an object. However, we still have to determine the shape of the object somehow. If it’s a ball shape we’re after, we could find the center of the object – e.g. center of mass where one pixel equals one mass unit – and then see how much the outline differs from a circle, where the radius is width or the height of the object divided by 2. The less the difference, the more the shape is like a circle, which of course is the shape of a ball in two dimensional presentation.

to-be-continued-back-to-the-futureNow that we have found our object let’s try to track it. See the next and the final part of this thrilling trilogy and find out if we succeeded in our goal to detect the changes in object position!

Tracking Objects from Video Feed Part I: Image Data Formats

It so happened that a couple of months ago my colleague and I were introduced an interesting challenge: Could we use a phone camera to capture and analyze the path of a moving object. There were (and still are) some wild ideas what we could do in the not-so-distant future, but somewhat excited we decided to investigate what we could do with the current resources (hardware and APIs) available. I personally prefer setting the bar high, but, of course, one needs to start from the ground up – at least with no super powers.

VaatturiLockedToObjectScaledThe might eagle will never know what hit him…

In this three part blog jamboree I’ll unveil where we got (so far) and how we did it.

Getting Started

So we had our challenge formulated: Identify an object from the video feed (the stuff that the camera on the device feeds you) while it’s stationary. Then lock into that object – watch it like a hawk – and if it moves, try to see where it went. To be precise “where it went” means finding its last position in the field of view (FOV).

It was apparent to us from the beginning that we wanted to go native to get the maximum performance. At this point we only had very vague vision of the algorithms (and their complexity), which we would likely use. Anyhow, we expected to be dealing with some heavy stuff. Luckily, another nice fellow at work hinted us to checkout this MediaCapture API sample, and it turned out it was a good starting point. Thereon we focused understanding the image data formats in order to be able process the data.

Image Data Formats

The image data received from camera hardware (the camera of a smartphone, tablet or simply an external webcam) comes in YUV color space. In short, Y (luma) defines the brightness while U and V (chroma) define the color of a pixel. The notion Y’CbCr would be more accurate as Y’UV refers to an analog encoding scheme, but for the sake of simplicity we use U to denote Cb and V to denote Cr in this article.

Depending on the hardware, typically two type of specific YUV formats are used: YUV2, which is commonly used by e.g. webcams, and NV12, commonly used by smartphone cameras. Although they are both based on YUV color space, their structure – size and order of Y, U and V bytes, are different. Thus, the frames need to be either converted to some specific format prior to processing or we have to implement separate methods to process the frames correctly based on the format.


YUV2 format has the following properties:

YUV2 can be seen as a byte array where the first and then every second value is a Y (luma) value while chromas (U and V) fill the blanks in between so that U comes first as shown in figure 1. Each chunk has the size of 4 bytes (32 bits).

YUV2Figure 1. YUV2 format.

For more details about YUV12 format, visit the following pages:


NV12 format has the following properties:

  • 4:2:0 chroma subsampling
  • 12 bits per pixel

Let’s consider a VGA image (640×480 pixels) encoded in NV12 format. The size of the Y plane covers the whole resolution a byte per pixel i.e. the Y plane is 640×480 bytes. The U/V plane, which follows, is half of the Y plane: 640×240 bytes (see the figure below).

NV12PlanesFigure 2. NV12 image format.

The figure below depicts four pixels in NV12 format. As you can see each pixel has a dedicated Y value, but only ½ of chroma (¼ of U and ¼ V values to be precise).

FourPixelsNV12Figure 3. A four pixel section in NV12 image data.

The block in figure 2 on the right-hand-side describes the corresponding location of the bytes in NV12 byte array A, where h is the height of the image and w is the width of the image (in bytes).

For more details about NV12 format, visit the following pages:

Coming next…

Check out the next part where we dive straight into the methods of extracting objects from video frames.

BatmanTrappedWill the caped crusader overcome the challenge or will he be trapped by the algorithm villains? Tune in tomorrow (or right now if you like) – same dev-time, same dev-channel!