Essentially, long story short, we're doing some human safety systems work where we need to detect where a human is in an environment. I'm not directly involved with that part of the effort, but the team that is is using some depth cameras (like Kinect in a way) to evaluate the safety systems. Our role, and mine specifically, is to provide visualization elements to meld reality and simulation and our first step is to generate some depth data for analysis.

We started by taking our Ogre3D visualization and a co-worker got a depth of field sample shader working. This first image shows a typical view in our Ogre visualization. The scene has some basic elements in world space (the floor, frame and man) and others in local space (the floating boxes) we can test against.

A sample scene, showing the typical camera view. The upper-right cut-in is a depth preview. |

Two color channel depth output image. |

To do this sort of math its surprisingly easy. Essentially you find the depth value right from the shader and make it a range of 0 to 1, with 1 being the max depth. Since we are using two channels, we want the range to be between 0 and 65536, so just take the depth and multiply by 65536. Determining the 256 values for each channel is pretty easy too using the modulus. (A quick explanation of a modulus is like 1pm = 13. It's when numbers wrap around. So the modulus of 13 by 12 is 1 for example, as is 25 = 1. You could also consider it the remainder after division.) So the red channel is determined by the modulus of depth by 256. The green channel is done similarly, but in this case is determined by the modulus of depth/256 by 256.

red channel = modulus(depth, 256)

green channel = modulus(depth/256, 256)

Here's an example. Lets say the depth is 0.9. That would result in a color value of 58982.4 (0.9 * 65536). The red channel color would be the modulus of 58982.4 by 256, which equals 102. The green channel would be the modulus of 58982.4/256 by 256, which is 230.

With that done, I save out the image representing the depth with two channels as I illustrate above.

Next I calculate the position from the image and depth information. This particular aspect caused me lots of headaches because I was over-complicating my calculations with unnecessary trigonometry. It also requires that you know a some basic information about the image. First off, it has to be symmetric view frustum. Next, you need to know your field of views, both horizontal and vertical, or at least one and the aspect. From there its pretty easy, so long as you realize the depth is flat (not curved like a real camera). Many of the samples out there that cover this sort of thing assume the far clip is the cut off, but in my case I designed the depth to be a function of a specified depth.

I know the depth by taking the color of a pixel in the image and reversing the process I outlined above. To find the x and y positions in the scene I take a pixel's image position as a percentage (like a UV coordinate for instance), then determine that position based off the center of the image. This is really quite easy, though it may sound confusing. For example, take pixel 700, 120 in a 1000 x 1000 pixel image. The position is 0.70, 0.12. The position based on center is 0.40, -0.76. That means that the pixel is 40% right of center, and down 76% of center. The easiest way to calculate it is to double the value then minus 1.

pixelx = pixelx * 2 - 1

pixely = pixely * 2 - 1

To find the x and y positions, in local coordinates to the view, its some easy math.

x = tan(horizontal FOV / 2) * pixelx * depth

y = tan(vertical FOV / 2) * pixely * depth

z = depth

This assumes that positive X values are on the right, and positive Y values are down (or up, depending on which corner 0,0 is in your image). Positive Z values are projected out from the view.

To confirm that all my math was correct I took a sample depth image (the one above) and calculated the xyz for each pixel then projected those positions back into the environment. The following images are the result.

Resulting depth data to position, from the capture angle. |

The position data from depth from a different angle. |

The position data from depth from a different angle. |