It so happened that a couple of months ago my colleague and I were introduced an interesting challenge: Could we use a phone camera to capture and analyze the path of a moving object. There were (and still are) some wild ideas what we could do in the not-so-distant future, but somewhat excited we decided to investigate what we could do with the current resources (hardware and APIs) available. I personally prefer setting the bar high, but, of course, one needs to start from the ground up – at least with no super powers.
In this three part blog jamboree I’ll unveil where we got (so far) and how we did it.
So we had our challenge formulated: Identify an object from the video feed (the stuff that the camera on the device feeds you) while it’s stationary. Then lock into that object – watch it like a hawk – and if it moves, try to see where it went. To be precise “where it went” means finding its last position in the field of view (FOV).
It was apparent to us from the beginning that we wanted to go native to get the maximum performance. At this point we only had very vague vision of the algorithms (and their complexity), which we would likely use. Anyhow, we expected to be dealing with some heavy stuff. Luckily, another nice fellow at work hinted us to checkout this MediaCapture API sample, and it turned out it was a good starting point. Thereon we focused understanding the image data formats in order to be able process the data.
Image Data Formats
The image data received from camera hardware (the camera of a smartphone, tablet or simply an external webcam) comes in YUV color space. In short, Y (luma) defines the brightness while U and V (chroma) define the color of a pixel. The notion Y’CbCr would be more accurate as Y’UV refers to an analog encoding scheme, but for the sake of simplicity we use U to denote Cb and V to denote Cr in this article.
Depending on the hardware, typically two type of specific YUV formats are used: YUV2, which is commonly used by e.g. webcams, and NV12, commonly used by smartphone cameras. Although they are both based on YUV color space, their structure – size and order of Y, U and V bytes, are different. Thus, the frames need to be either converted to some specific format prior to processing or we have to implement separate methods to process the frames correctly based on the format.
YUV2 format has the following properties:
- 4:2:2 chroma subsampling
- 16 bits per pixel
YUV2 can be seen as a byte array where the first and then every second value is a Y (luma) value while chromas (U and V) fill the blanks in between so that U comes first as shown in figure 1. Each chunk has the size of 4 bytes (32 bits).
For more details about YUV12 format, visit the following pages:
NV12 format has the following properties:
- 4:2:0 chroma subsampling
- 12 bits per pixel
Let’s consider a VGA image (640×480 pixels) encoded in NV12 format. The size of the Y plane covers the whole resolution a byte per pixel i.e. the Y plane is 640×480 bytes. The U/V plane, which follows, is half of the Y plane: 640×240 bytes (see the figure below).
The figure below depicts four pixels in NV12 format. As you can see each pixel has a dedicated Y value, but only ½ of chroma (¼ of U and ¼ V values to be precise).
The block in figure 2 on the right-hand-side describes the corresponding location of the bytes in NV12 byte array A, where h is the height of the image and w is the width of the image (in bytes).
For more details about NV12 format, visit the following pages:
Check out the next part where we dive straight into the methods of extracting objects from video frames.