Shuffling and Sorting

>> Sunday, February 13, 2011

Note: I've improved upon this code and you can read the discussion here.

Of all the builtin functions available for OpenCL kernels, my favorites are shuffle and shuffle2. These rearrange the elements of vectors, which is something you can't do elegantly in regular C. When I needed to code the bitonic sort using OpenCL, I thought it would be clever to come up with a compare-and-swap operation using vector operations. Here's what I arrived at:

inline void compare_and_swap(__local int2 *d1, __local int2 *d2, uint dir) {
   int2 input1 = *d1; int2 input2 = *d2;
   uint2 cmp = (uint2)(input1 > input2) ^ dir;
   uint2 mask = 2*cmp;
   mask.s1 += 1;
   *d1 = shuffle2(input1, input2, mask);
   *d2 = shuffle2(input2, input1, mask);
The goal is to create a mask vector that can rearrange the elements of vectors d1 and d2 in local memory. In contrast, the bitonic sorts provided by the Nvidia SDK and the AMD SDK use compare-and-swap routines that rely on scalar operations:
if((local_data[i] > local_data[i+2]) == dir ) {
   t = local_data[i];
   local_data[i] = local_data[i+2];
   local_data[i+2] = t;
Sorting data is a crucial topic, and I think database acceleration will turn out to be one of the most important uses of OpenCL. To determine whether sorting is better accomplished with vectors or scalars, I coded three test kernels:
I've profiled these three kernels extensively, and the full_vector kernel swaps data faster than the full_scalar kernel. This makes sense to me, but oddly, the full_scalar kernel runs faster than the part_vector kernel. Still trying to figure this out...


Post a Comment

  © Blogger template Werd by 2009

Back to TOP