Allocating Local Memory for an Arbritrary Number of Work-items

>> Sunday, September 25, 2011

In my opinion, the most important OpenCL function to know is clEnqueueNDRangeKernel, which enqueues a kernel-execution command. Once you understand how its parameters affect the kernel's execution, the rest of OpenCL is fairly straightforward.

The second most important function, in my opinion, is clSetKernelArg. This configures the kernel's arguments so it has data to process while it executes. Its signature is given as follows:

cl_int clSetKernelArg(cl_kernel kernel, cl_uint arg_index,
                      size_t arg_size, const void *arg_value)

When I first encountered it, I thought this function was simple. But in addition to specifying the data for a kernel argument, this function also configures which address space (global/constant, local, or private) the data will be stored in. There are three main rules:
  1. If arg_value points to a memory object, the argument's data will be stored in global or constant memory, depending on the modifier used by the kernel function.
  2. If arg_value points to something that isn't a memory object, the argument's data will be stored in private memory.
  3. If arg_value is NULL, the argument's data won't be initialized, but arg_size bytes will be allocated from local memory to store the data.
In general, work-items can access local memory much faster than they can access global memory, so it's a good idea to have each work-item store intermediate results in the local address space. If I know there are 256 work-items in a work-group and each item needs 32 bytes to store intermediate data, I'll set arg_size to 256*32 = 8192 and arg_value to NULL.

But there's a problem. Unless the total number of work-items is a multiple of the maximum number of work-items per work-group, I have no way of knowing how many work-items will be assigned to each group. I can set the local_size parameter of clEnqueueNDRangeKernel, but sometimes I don't even know the total number of work-items I'll need. This happens frequently in OpenCL-OpenGL interoperability, in which I don't know in advance how many triangles each vertex buffer object will contain.

The solution isn't too hard: allocate all the local memory the device can provide. To find out how much local memory is available, call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE. The following code shows how this works:

cl_ulong local_mem_size;
clGetDeviceInfo(device, CL_DEVICE_LOCAL_MEM_SIZE,
                sizeof(local_mem_size), &local_mem_size, NULL);

Remember that the data returned by the function is a cl_ulong, not a size_t. I made this mistake repeatedly, and though it worked fine on my Linux system, my Windows development system didn't like it at all.


Post a Comment

  © Blogger template Werd by 2009

Back to TOP