Atomic Functions and Dot Products

>> Monday, February 28, 2011

I love OpenCL dearly, but it has two shortcomings that frustrate me to no end:

  • No atomic floating-point operations
  • No mutexes or locks
An example will show why this state of affairs is so painful. Suppose you have 1000 work-items and you want to compute the dot product of two vectors, x and y, both containing 1000 floats. At first glance, you might try something like the following:
int id = get_global_id(0);
dot_prod += x[id] * y[id];

This will compile, but you won't get the right answer because every work-item will attempt to access dot_prod at the same time. The result is a race condition. To get the right answer, each work-item must have exclusive access to dot_prod. But OpenCL doesn't support atomic floating-point operations or mutexes, so there's no way for one work-item to tell the others to back off.

You could process the dot product in stages. In the first stage, you could have 100 work-items perform 10 multiply-and-add operations. In the second stage, you could have 10 work-items sum the 100 results from the first stage. In the third stage, a single work-item could sum the 10 results from the second stage.

But there's a problem. You need to set up a barrier after each stage to ensure that succeeding stages won't start prematurely. In my experience, it takes more time for a barrier command to execute than it takes to perform the algorithm using a single work-item.

So that's what I'm doing. To my great shame, the QR decomposition code computes dot products with one work-item:
if(get_global_id(0) == 0) {
   ...compute dot product...
This code doesn't use any parallelism, and would get me fired from any professional job in high-performance computing. But it's faster than every alternative I can think of.


The AMD Fusion

>> Wednesday, February 23, 2011

One of the reasons I first became interested in OpenCL was AMD's Fusion APU (Accelerated Processor Unit). Since Intel's Larrabee project collapsed, this seems to be the super-processor of the future.

Originally, I heard that the Fusion would make discrete graphics cards obsolete because it contains a CPU and a GPU on a single chip. But AMD, realizing that this would cut into their lucrative graphics card business, has forsworn this. In the past, they used statements like "a supercomputer in every home," but now it looks like they don't have a coherent selling strategy at all.

So I'm nervous. The marketing for the Fusion has been meager, and although AMD has provided a whitepaper, they still haven't come up with a compelling reason for people to buy the Fusion. Instead, it looks like AMD is waiting for enthusiasts/entrepreneurs to tell them why people should buy the Fusion.

AMD hasn't provided any simulators or training classes, as IBM did for the Cell processor. They haven't announced any app contests with prizes, as Google did with their Android. But they have set up the AMD Fusion Fund, calling for companies to submit their ideas. I submitted my idea for Fusion development this afternoon, so we'll see what happens.


Nvidia, CUDA, and OpenCL

>> Sunday, February 20, 2011

I started learning OpenCL because I wanted my code to run on as many vendors' devices as possible. But whenever I find GPU code on the web, it's always in CUDA, which only targets Nvidia devices.

CUDA is older than OpenCL, so it's understandable that it has a larger developer base. But still, a search on tells me there are 217 job openings for "cuda -opencl" but only 79 job openings for "opencl -cuda". So even though the OpenCL Working Group contains AMD, Intel, IBM, Apple, and Nvidia, companies are still only interested in targeting Nvidia hardware.

Not really that surprising, I suppose. Many corporations support OpenCL, but none of them are as passionate about the language as Nvidia is about CUDA. GPGPU development is still a novelty to most, but Nvidia keeps making dedicated GPGPU hardware like their Tesla servers. Nvidia has released cuBlas, a library of BLAS-related matrix routines. OpenCL doesn't have anything like it. CUDA has three books and a certification exam. OpenCL has no books at all. Not yet, anyway.



>> Thursday, February 17, 2011

I just spent an extraordinary amount of money to have my second novel reviewed. The reviewer isn't a professional editor or agent, but a suspense novelist with quite a few published novels. I'm sure his feedback will be helpful, but I don't hold out much hope that my novel will be published. It's kind of odd.

I had higher hopes for the short story that I submitted to the 2010 Short Story Contest by Suspense Magazine. But alas, it's not on their list of the Top Ten Submissions...



>> Wednesday, February 16, 2011

I've started using PyOpenCL, which lets you code OpenCL host applications in Python. It's unbelievable how simple it is to work with. For example, let's say you want to create a kernel from a function called foo, set three arguments (a, b, and c), and execute the kernel with 20x16 work-items divided into work-groups of 5x4 work-items each. In C, the code would look like this:

foo_kernel = clCreateKernel(program, "foo", NULL);
clSetKernelArg(foo_kernel, 0, sizeof(a), &a);
clSetKernelArg(foo_kernel, 1, sizeof(b), &b);
clSetKernelArg(foo_kernel, 2, sizeof(c), &c);
size_t global_size[2] = {20, 16};
size_t local_size[2] = {5, 4};
clEnqueueNDRangeKernel(queue, foo_kernel, 2, NULL, global_size, local_size, 0, NULL, NULL);
Here's how it looks with PyOpenCL:, (20, 16), (5, 4), a, b, c)
One line of Python code accomplishes the same result as seven lines of C code, and it's so much more intuitive. It almost feels like cheating. If my intended application didn't require C++, I would definitely code with PyOpenCL.


Shuffling and Sorting

>> Sunday, February 13, 2011

Note: I've improved upon this code and you can read the discussion here.

Of all the builtin functions available for OpenCL kernels, my favorites are shuffle and shuffle2. These rearrange the elements of vectors, which is something you can't do elegantly in regular C. When I needed to code the bitonic sort using OpenCL, I thought it would be clever to come up with a compare-and-swap operation using vector operations. Here's what I arrived at:

inline void compare_and_swap(__local int2 *d1, __local int2 *d2, uint dir) {
   int2 input1 = *d1; int2 input2 = *d2;
   uint2 cmp = (uint2)(input1 > input2) ^ dir;
   uint2 mask = 2*cmp;
   mask.s1 += 1;
   *d1 = shuffle2(input1, input2, mask);
   *d2 = shuffle2(input2, input1, mask);
The goal is to create a mask vector that can rearrange the elements of vectors d1 and d2 in local memory. In contrast, the bitonic sorts provided by the Nvidia SDK and the AMD SDK use compare-and-swap routines that rely on scalar operations:
if((local_data[i] > local_data[i+2]) == dir ) {
   t = local_data[i];
   local_data[i] = local_data[i+2];
   local_data[i+2] = t;
Sorting data is a crucial topic, and I think database acceleration will turn out to be one of the most important uses of OpenCL. To determine whether sorting is better accomplished with vectors or scalars, I coded three test kernels:
I've profiled these three kernels extensively, and the full_vector kernel swaps data faster than the full_scalar kernel. This makes sense to me, but oddly, the full_scalar kernel runs faster than the part_vector kernel. Still trying to figure this out...


  © Blogger template Werd by 2009

Back to TOP