Work-group Sizes

>> Tuesday, September 27, 2011

I've been experimenting with OpenCL work-group sizes and I've learned something I should have known before. For a given kernel, work-groups are always the same size. To test this, I configured clEnqueueNDRangeKernel to execute a kernel with 991 work-items. I set the local_size argument to NULL so that OpenCL will determine its own work-group size, and it created 991 work-groups with 1 work-item each. I've tried this with other prime numbers and it does the same thing.

I'm surprised. How can 991 separate work-groups execute faster than, for example, four groups of 200 work-items each and one group of 191 work-items? As another example, when I specified 651 work-items, OpenCL creates three groups of 217 items each. But when I specify 653 work-items, OpenCL creates 653 groups containing one item each.

Now that I'm working with OpenCL-OpenGL interoperability, dealing with arbitrary data sizes is a frequent concern. My work-groups can't communicate with one another, so my host application receives one result from each group. But the host can't allocate memory for the result array if it doesn't know in advance how many work-groups will execute the kernel.

The only solution I can see is to pad the number of work-items until the total number is a multiple of the maximum work-group size (or a similar number obtained through CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE). This means that, if the data size requires 991 work-items, I should pad the value to 1024, which is a multiple of 256, which is the maximum number of items per group on my GPU.

But now there's another problem: the padded work-items execute the kernel just like the regular ones, which means they may cause an error in the result. To get around this, I'm using the following code in my kernel:

if(get_global_id(0) < max_items) {

   ...perform regular kernel execution...

}
In this code, max_items is the unpadded size of the data. That is, if there are 991 data points to process, max_items is set to 991. This is not elegant code, but I can't see an alternative. I'll keep thinking.

Read more...

Allocating Local Memory for an Arbritrary Number of Work-items

>> Sunday, September 25, 2011

In my opinion, the most important OpenCL function to know is clEnqueueNDRangeKernel, which enqueues a kernel-execution command. Once you understand how its parameters affect the kernel's execution, the rest of OpenCL is fairly straightforward.

The second most important function, in my opinion, is clSetKernelArg. This configures the kernel's arguments so it has data to process while it executes. Its signature is given as follows:

cl_int clSetKernelArg(cl_kernel kernel, cl_uint arg_index,
                      size_t arg_size, const void *arg_value)

When I first encountered it, I thought this function was simple. But in addition to specifying the data for a kernel argument, this function also configures which address space (global/constant, local, or private) the data will be stored in. There are three main rules:
  1. If arg_value points to a memory object, the argument's data will be stored in global or constant memory, depending on the modifier used by the kernel function.
  2. If arg_value points to something that isn't a memory object, the argument's data will be stored in private memory.
  3. If arg_value is NULL, the argument's data won't be initialized, but arg_size bytes will be allocated from local memory to store the data.
In general, work-items can access local memory much faster than they can access global memory, so it's a good idea to have each work-item store intermediate results in the local address space. If I know there are 256 work-items in a work-group and each item needs 32 bytes to store intermediate data, I'll set arg_size to 256*32 = 8192 and arg_value to NULL.

But there's a problem. Unless the total number of work-items is a multiple of the maximum number of work-items per work-group, I have no way of knowing how many work-items will be assigned to each group. I can set the local_size parameter of clEnqueueNDRangeKernel, but sometimes I don't even know the total number of work-items I'll need. This happens frequently in OpenCL-OpenGL interoperability, in which I don't know in advance how many triangles each vertex buffer object will contain.

The solution isn't too hard: allocate all the local memory the device can provide. To find out how much local memory is available, call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE. The following code shows how this works:

cl_ulong local_mem_size;
clGetDeviceInfo(device, CL_DEVICE_LOCAL_MEM_SIZE,
                sizeof(local_mem_size), &local_mem_size, NULL);

Remember that the data returned by the function is a cl_ulong, not a size_t. I made this mistake repeatedly, and though it worked fine on my Linux system, my Windows development system didn't like it at all.

Read more...

Problems with OpenGL-OpenCL Interoperability

>> Saturday, September 17, 2011

I've spent the last week coding a routine for pick selection, which determines which 3-D object in the model corresponds to the user's mouse click. Geometrically, the process involves ray-triangle intersection, and I've been focusing on the Möller-Trumbore method. 3-D applications generally don't test every triangle in the model at once, but I figured this would be a perfect application to demonstrate OpenGL-OpenCL interoperability.

It's been hard going. First of all, I made the mistake of upgrading Ubuntu 10.10 to version 11.4. When I did this, every call to clCreateFromGLBuffer returned a CL_INVALID_GL_OBJECT error. Eventually, the error went away after I reinstalled the AMD driver, updated the AMD APP SDK, and disabled the Unity interface.

But when I implement the Möller-Trumbore method in an OpenCL kernel, I get the strangest error. Work Groups 0 and 18 always produce the same selection result, even though they're analyzing different sets of triangles. The same holds true for Groups 1 and 19 and Groups 2 and 20. The only reason I can think of is that my GPU has 18 compute units. But this shouldn't matter; work-groups are supposed to operate independently of one another, no matter which compute unit they occupy. Weird, weird, weird...

Read more...

COLLADA, TinyXML, and OpenGL

>> Saturday, September 10, 2011

When I first started working with COLLADA, the Khronos Group provided example code that demonstrated how COLLADA meshes could be rendered with OpenGL. But the source code for COLLADA RT seems to have disappeared, so I've coded my own. I've uploaded the C++ source files to a Google-hosted project here, and I've included an example application that renders a sphere. I've also written a Google knol that explains how the code works.

Simply put, the ColladaInterface class reads a COLLADA file and returns a data structure for each mesh in the scene. The COLLADA format is based on XML, so the class calls functions from the TinyXML toolset. ColladaInterface doesn't access all the information in the COLLADA file, so there's plenty of room for improvement.

At some point, I'll add a function that writes data to COLLADA files. But for the moment, I hope this will help developers who are interested in using COLLADA for digital asset storage.

Read more...

  © Blogger template Werd by Ourblogtemplates.com 2009

Back to TOP