Interesting Times

>> Tuesday, November 13, 2012

  • I managed to order a Nexus 10 this morning, but I'm concerned. Where is the OpenCL development kit for the Mali T-604? I've searched through ARM's Mali developer site, but none of their tools mention OpenCL. Roberto Mijat gave a presentation on OpenCL programming at the recent ARM TechCon, but there were no announcements related to OpenCL. Strange. Why would ARM go to the trouble of seeking OpenCL-compliance for their GPU if they're not going to let anyone program it?

  • AMD has denied that they're hoping to sell the company, but there's no question that they're in deep financial trouble. It's heartbreaking. I had high hopes for the Fusion, but the performance just wasn't there. Throughout the Fusion Developer Summit, AMD's corporate officers said they were "betting the company" on OpenCL. Perhaps they lost.

  • To pressure gamers to upgrade, Microsoft has stated that their upcoming release of DirectX, version 11.1, will only be available for Windows 8. I hope that this behavior, along with Gabe Newell's efforts, will make game developers choose OpenGL over Direct3D. But Microsoft has always been very persuasive.


OpenCL and the Dot Product

>> Sunday, November 11, 2012

In an earlier post, I whined about OpenCL's lack of atomic functions for floating-point operations. This makes it hard to code a high-performance dot product in OpenCL, but by using vectors and local memory, we can still do pretty well.

I've coded an application that computes the dot product of two vectors with 2^18 floating-point values each. The source files are on github and the kernel looks like this:

__kernel void dot_product(__global float4* a_vec, __global float4* b_vec,                              __global float* output, __local float4* partial_dot) {

   int gid = get_global_id(0);
   int lid = get_local_id(0);
   int group_size = get_local_size(0);

   /* Place product of global values into local memory */
   partial_dot[lid] = a_vec[gid] * b_vec[gid];

   /* Repeatedly add values in local memory */
   for(int i = group_size/2; i>0; i >>= 1) {
      if(lid < i) {
         partial_dot[lid] += partial_dot[lid + i];

   /* Transfer final result to global memory */
   if(lid == 0) {
      output[get_group_id(0)] = dot(partial_dot[0], (float4)(1.0f));

Executing this kernel, the device doesn't compute the entire dot product. Instead, each work group returns a value to the host, and the host computes the final sum. My tests have shown that this runs much faster than a basic multiply-and-add algorithm. Still, I'm sure there's room for improvement.

I've decided to open this blog for comments. If you have any thoughts on this kernel or anything else on this blog, feel free to write.


  © Blogger template Werd by 2009

Back to TOP