### OpenCL and the Dot Product

## >> Sunday, November 11, 2012

In an earlier post, I whined about OpenCL's lack of atomic functions for floating-point operations. This makes it hard to code a high-performance dot product in OpenCL, but by using vectors and local memory, we can still do pretty well.

I've coded an application that computes the dot product of two vectors with 2^18 floating-point values each. The source files are on github and the kernel looks like this:

__kernel void dot_product(__global float4* a_vec, __global float4* b_vec, __global float* output, __local float4* partial_dot) {

int gid = get_global_id(0);

int lid = get_local_id(0);

int group_size = get_local_size(0);

/* Place product of global values into local memory */

partial_dot[lid] = a_vec[gid] * b_vec[gid];

barrier(CLK_LOCAL_MEM_FENCE);

/* Repeatedly add values in local memory */

for(int i = group_size/2; i>0; i >>= 1) {

if(lid < i) {

partial_dot[lid] += partial_dot[lid + i];

}

barrier(CLK_LOCAL_MEM_FENCE);

}

/* Transfer final result to global memory */

if(lid == 0) {

output[get_group_id(0)] = dot(partial_dot[0], (float4)(1.0f));

}

}

Executing this kernel, the device doesn't compute the entire dot product. Instead, each work group returns a value to the host, and the host computes the final sum. My tests have shown that this runs much faster than a basic multiply-and-add algorithm. Still, I'm sure there's room for improvement.

I've decided to open this blog for comments. If you have any thoughts on this kernel or anything else on this blog, feel free to write.

## 1 comments:

How much speed up did u achieve?

Post a Comment