Atomic Functions and Dot Products

>> Monday, February 28, 2011

I love OpenCL dearly, but it has two shortcomings that frustrate me to no end:

  • No atomic floating-point operations
  • No mutexes or locks
An example will show why this state of affairs is so painful. Suppose you have 1000 work-items and you want to compute the dot product of two vectors, x and y, both containing 1000 floats. At first glance, you might try something like the following:
int id = get_global_id(0);
dot_prod += x[id] * y[id];

This will compile, but you won't get the right answer because every work-item will attempt to access dot_prod at the same time. The result is a race condition. To get the right answer, each work-item must have exclusive access to dot_prod. But OpenCL doesn't support atomic floating-point operations or mutexes, so there's no way for one work-item to tell the others to back off.

You could process the dot product in stages. In the first stage, you could have 100 work-items perform 10 multiply-and-add operations. In the second stage, you could have 10 work-items sum the 100 results from the first stage. In the third stage, a single work-item could sum the 10 results from the second stage.

But there's a problem. You need to set up a barrier after each stage to ensure that succeeding stages won't start prematurely. In my experience, it takes more time for a barrier command to execute than it takes to perform the algorithm using a single work-item.

So that's what I'm doing. To my great shame, the QR decomposition code computes dot products with one work-item:
if(get_global_id(0) == 0) {
   ...compute dot product...
This code doesn't use any parallelism, and would get me fired from any professional job in high-performance computing. But it's faster than every alternative I can think of.


Post a Comment

  © Blogger template Werd by 2009

Back to TOP