Wednesday, November 24, 2010

Position from Depth

I've been doing some stuff at work using our visualization, shaders and some python scripting.  I normally don't post stuff about work for many reasons, but this project has been a lot of fun and is worth blogging about (and I have permission).  I also want to document what I did as well as address some of the issues I encountered.

Essentially, long story short, we're doing some human safety systems work where we need to detect where a human is in an environment.  I'm not directly involved with that part of the effort, but the team that is is using some depth cameras (like Kinect in a way) to evaluate the safety systems.  Our role, and mine specifically, is to provide visualization elements to meld reality and simulation and our first step is to generate some depth data for analysis.

We started by taking our Ogre3D visualization and a co-worker got a depth of field sample shader working.  This first image shows a typical view in our Ogre visualization.  The scene has some basic elements in world space (the floor, frame and man) and others in local space (the floating boxes) we can test against.

A sample scene, showing the typical camera view.  The upper-right cut-in is a depth preview.
The next image shows the modifications I made to the depth shader.  Instead of using a typical black and white depth image, I decided to use two channels, the red and green channels.  The blue channel is reserved for any geometry beyond the sensor vision. Black is depth less, essentially no geometry exists there.

Two color channel depth output image.
I decided to use two color channels for depth, to improve the accuracy.  That's why you see color banding, because I hop around both channels.  If I only used one channel, at 8 bit, that would be 256 colors.  A depth of 10 meters would mean that the accuracy would only be about 4 cm (10.0 m / 256). By using two color channels I'm effectively using 16 bit, for a total of 65536 colors (256 * 256), which increased our accuracy to 1.5 mm (10.0 m / 65536).  In retrospect, perhaps I could have used a 16 bit image format instead.

To do this sort of math its surprisingly easy.  Essentially you find the depth value right from the shader and make it a range of 0 to 1, with 1 being the max depth.  Since we are using two channels, we want the range to be between 0 and 65536, so just take the depth and multiply by 65536.  Determining the 256 values for each channel is pretty easy too using the modulus.  (A quick explanation of a modulus is like 1pm = 13.  It's when numbers wrap around.  So the modulus of 13 by 12 is 1 for example, as is 25 = 1.  You could also consider it the remainder after division.)  So the red channel is determined by the modulus of depth by 256.  The green channel is done similarly, but in this case is determined by the modulus of depth/256 by 256.

red channel = modulus(depth, 256)
green channel = modulus(depth/256, 256)

Here's an example.  Lets say the depth is 0.9.  That would result in a color value of 58982.4 (0.9 * 65536).  The red channel color would be the modulus of 58982.4 by 256, which equals 102.  The green channel would be the modulus of 58982.4/256 by 256, which is 230.

With that done, I save out the image representing the depth with two channels as I illustrate above.

Next I calculate the position from the image and depth information.  This particular aspect caused me lots of headaches because I was over-complicating my calculations with unnecessary trigonometry.  It also requires that you know a some basic information about the image.  First off, it has to be symmetric view frustum.  Next, you need to know your field of views, both horizontal and vertical, or at least one and the aspect.  From there its pretty easy, so long as you realize the depth is flat (not curved like a real camera).  Many of the samples out there that cover this sort of thing assume the far clip is the cut off, but in my case I designed the depth to be a function of a specified depth.

I know the depth by taking the color of a pixel in the image and reversing the process I outlined above.  To find the x and y positions in the scene I take a pixel's image position as a percentage (like a UV coordinate for instance), then determine that position based off the center of the image.  This is really quite easy, though it may sound confusing.  For example, take pixel 700, 120 in a 1000 x 1000 pixel image.  The position is 0.70, 0.12.  The position based on center is 0.40, -0.76.  That means that the pixel is 40% right of center, and down 76% of center.  The easiest way to calculate it is to double the value then minus 1.

pixelx = pixelx * 2 - 1
pixely = pixely * 2 - 1


To find the x and y positions, in local coordinates to the view, its some easy math.

x = tan(horizontal FOV / 2) * pixelx * depth
y = tan(vertical FOV / 2) * pixely * depth
z = depth

This assumes that positive X values are on the right, and positive Y values are down (or up, depending on which corner 0,0 is in your image).  Positive Z values are projected out from the view.

To confirm that all my math was correct I took a sample depth image (the one above) and calculated the xyz for each pixel then projected those positions back into the environment.  The following images are the result.

Resulting depth data to position, from the capture angle.
The position data from depth from a different angle.
The position data from depth from a different angle.
The results speak for themselves.  It was a learning process, but I loved it.  You may notice the frame rate drop in the later images.  That's because I represent the pixel data with a lot of planes, over 900,000.  It isn't an efficient way to display the data, but all I wanted was confirmation that the real scene and the calculated positions correspond.


No comments: