## >> Sunday, October 23, 2011

My first post on this blog dealt with using OpenCL vector functions to sort data. Thanks to the as_type function, I've improved my original code. If input is a vector containing four floats, the following code will sort its components in place:

uint4 mask1 = (uint4)(1, 0, 3, 2);
uint4 mask2 = (uint4)(2, 3, 0, 1);
uint4 mask3 = (uint4)(3, 2, 1, 0);

int4 add1 = (int4)(1, 1, 3, 3);
int4 add2 = (int4)(2, 3, 2, 3);
int4 add3 = (int4)(1, 2, 2, 3);

int4 comp = input < shuffle(input, mask1);
input = shuffle(input, as_uint4(comp + add1));
comp = input < shuffle(input, mask2);
input = shuffle(input, as_uint4(comp * 2 + add2));
comp = input < shuffle(input, mask3);
input = shuffle(input, as_uint4(comp + add3));
This may look like a lot of work to sort four values, but there are no if statements in this code. Therefore, unless the shuffle function is implemented with branches, there are no delays associated with branch misses.

The sorting direction can be controlled with the bitwise exclusive-OR. For example, if dir is set to -1, the following code will sort the elements in descending order:
int4 comp = input < shuffle(input, mask1) ^ dir;
input = shuffle(input, as_uint4(comp + add1));
comp = input < shuffle(input, mask2) ^ dir;
input = shuffle(input, as_uint4(comp * 2 + add2));
comp = input < shuffle(input, mask3) ^ dir;
input = shuffle(input, as_uint4(comp + add3));
If dir is set to 0, the code will sort the input elements in ascending order.

### Variable Casting in OpenCL with as_type

One of my favorite OpenCL tricks involves using the result of a vector comparison as the mask argument of a shuffle/shuffle2 operation. This makes it easy to sort a vector's components in place.

The problem is that vector comparisons produce signed integer vectors and the shuffle functions require their mask vectors to contain unsigned values. OpenCL doesn't tolerate regular C/C++ casting, so you can't use anything like:

Earlier, I used the abs function to convert signed vectors to unsigned vectors. But OpenCL makes it possible to cast variables properly using the as_type function. Here, type represents the desired data type. An example will show how this works:
It works just as simply for scalars as it does for vectors, and it can be used for floating-point as well as signed conversion. For example, to convert a float called x to an int, you'd call as_int(x).

## >> Saturday, October 22, 2011

At long last, I've added collision detection to my OpenCL-OpenGL application. The application executes three OpenCL kernels at regular time intervals. The first generates n choose 2 active work-items, where n is the number of figures in the model. Each work-item determines whether a given pair of objects has collided, and if so, it alters their velocities according to two equations:

• v1_new = (v1*(m1-m2) + 2*m2*v2)/(m1+m2)
• v2_new = (v2*(m2-m1) + 2*m1*v1)/(m1+m2)
The second kernel generates n active work-items, one for each figure in the model. This kernel updates the figure's velocity and displacement with the following equations:
• velocity += acceleration * delta_t
• displacement = velocity * delta_t
The third kernel generates one work-item for every vertex in the model and updates the vertex's position with the displacement computed earlier. Then the vertices are ready to be rendered.

## >> Monday, October 10, 2011

I've coded a few physics-based applications with OpenGL, and the overall operation is simple: the host computes a figure's new position based on its velocity and acceleration, and sends the position delta to the vertex shader as a uniform. The vertex shader updates each vertex position with the delta and the figure moves over time.

But now I'm implementing the physics processing with OpenCL. At the moment, my goal is to execute the following loop:

for each figure in the in model
if figure collided with another
change velocity and acceleration of both figures
end if
end for

I'm using the GJK method to detect collisions, and it's not easy. But my main concern is this: if the OpenCL kernel computes velocity and acceleration without updating the host, the host can't set the uniform properly for the vertex shader. However, the kernel can modify the VBO data directly, thereby making the uniform unnecessary.

This raises another concern. In my earlier code, the vertex shader applied the physics update after performing the modelview-perspective transformation. But with this new method, the shader will receive VBOs that have already been updated based on velocity and acceleration. This out-of-order transformation may cause an error.

I'm starting to think that the OpenCL kernel should take over the vertex shader's processing. I wonder what effect this will have on performance.

## >> Wednesday, October 5, 2011

Goodbye, Mr. Jobs, and thank you for your insight, leadership, vision, and high technological standards. R.I.P.

### Nvidia and OpenCL 1.1

Until recently, I'd thought that Nvidia didn't support OpenCL 1.1. But it's only their development drivers that don't support 1.1. Their regular drivers do. Keep a close eye on the versions. On Windows and Linux, the development drivers have major version 270 while the regular drivers have major version 280.

I don't know what capabilities the development drivers provide that the regular drivers don't, but after installing the regular drivers, I've successfully tested all my code on an Nvidia GPU. Life is good.

## >> Saturday, October 1, 2011

I agree with Tom Olson that the killer app for OpenCL is graphics acceleration, particularly involving physics and computational geometry. When programmers see how much they can accomplish on the GPU with OpenCL, I think they'll put aside their CPU-centric code.

To demonstrate how this works, I've coded an application that uses OpenCL to compute ray-triangle intersection. More specifically, I've implemented the Möller-Trumbore method using OpenCL-OpenGL interoperability. 3-D applications need to compute ray-triangle intersection quickly because it identifies which object in the model the user clicked.

I've written a knol on the subject here, and it discusses the theory of the Möller-Trumbore method and how to implement it using both C and OpenCL. I've uploaded an example application here. The application reads mesh data from a COLLADA file, renders ten spheres with OpenGL, and then uses OpenCL to determine which sphere the user clicked on. So far, the application has passed every test I've given it.

I'm mostly happy, but there's one improvement I need to add. Currently, when the user clicks on the rendering, the application executes a kernel for every figure in the model. The kernel's work-items process the triangles of the figure in parallel, and this provides solid performance. But it would be better to have the kernel process every triangle in the figure at once. The code gets ugly because each figure's VBO needs to be made a kernel argument. But I think the performance improvement will be worth the hassle.

© Blogger template Werd by Ourblogtemplates.com 2009