A Sense of hUMA

>> Saturday, May 4, 2013

In OpenCL, transferring data between the CPU and GPU can take a significant amount of time. This is a problem for all GPGPU applications, and to solve it, AMD has developed the Heterogeneous Uniform Memory Access, or hUMA. Ars Technica has a great article on the subject here.

It looks wonderful to me, particularly the cache coherence. With hUMA, CPU-GPU data transfer will only consist of sending pointers back and forth. And if this will be available for the upcoming PS4, that will be even better.

Read more...

WebCL News

>> Tuesday, April 23, 2013

Here are two items of interest related to WebCL, the browser-based implementation of OpenCL:

  1. I wrote an article that explains how to execute WebCL kernels from Firefox.
  2. The Khronos Group has a Request for Quote (PDF) for a tool capable of validating WebCL kernels. The goal is to prevent bad kernels from crashing the user's computer.
I considered submitting a quote, but I just don't have the time. And I'd imagine validating a kernel is a tricky task.

Read more...

Ten Reasons Why Android Should Support OpenCL

>> Wednesday, April 3, 2013

In an earlier post, I explained why a Google engineer's grievance with OpenCL was mistaken. Now here are ten reasons why Google should make OpenCL the high-performance language of choice on Android:

10. GPU vendors are the driving force behind OpenCL and their tools provide stable, high-speed kernel execution. When GPU technology improves, OpenCL improves immediately.

9. If Android becomes a desktop OS, it should be able to access desktop GPUs and mobile GPUs. OpenCL has broad support on both platforms.

8. Aparapi makes it easy to launch OpenCL kernels from Java. It's open-source and GSS Mahadevan has successfully used it on Android.

7. In addition to CPUs and GPUs, OpenCL kernels can be executed on DSPs and FPGAs. Future high-performance devices will be more likely to support OpenCL than any other language.

6. When OpenCL devices are added to a context, they can work together to execute kernels. With OpenCL, embedded devices have the potential of accessing more powerful systems to crunch data.

5. One of OpenCL's chief advantages is OpenCL-OpenGL interoperability, which allows OpenCL kernels to process OpenGL buffer data before rendering starts. It would take a lot of work to add a similar capability for Renderscript.

4. Currently, native Android developers need to learn x86 and ARM/NEON instructions to ensure high-performance. Intel and ARM are both strong supporters of OpenCL, so if Android adopted OpenCL, native developers would only have to learn one language.

3. Google has put aside O3D in favor of WebGL and the Renderscript graphics engine in favor of OpenGL. If history is any guide, Google will choose OpenCL over Renderscript computation.

2. The general-purpose GPU (GPGPU) community is small and fragmented. It's unlikely that a new, OS-specific language will attract a developer base large enough to justify its existence.

1. If iOS supports OpenCL and Android doesn't, GPU-accelerated apps will run faster on iOS. High-performance mobile computing isn't a big deal yet, but there's no telling what the future may bring.

Read more...

OpenCL Image Filtering on the Nexus 10

>> Wednesday, March 27, 2013

I've coded a practical OpenCL example for the Nexus 10 and the code is on Github. The app works in six stages:

  1. The user clicks a button to take a picture.
  2. The Activity receives image data from the camera, places it in a ByteBuffer, and passes it to the native function.
  3. The native function sends the buffer data to the GPU, which executes the OpenCL kernel.
  4. The kernel performs 3x3 spatial convolution and places the result in an output buffer.
  5. The native function reads the GPU's result and places the data in a ByteBuffer.
  6. The Activity creates a bitmap from the ByteBuffer containing the filtered image.
I tried using OpenCL's image processing routines, but the Mali GPU doesn't support the CL_ARGB format. I tried using two-dimensional work-items, but that gave me an odd error. So here's the kernel:
__kernel void camera_filter(__global int *input_img,
                            __global int *out_img,
                            int width, int height) {

   int offset, red, green, blue;
   int row = get_global_id(0)/width;
   int col = get_global_id(0) - row * width;
   int4 pixels[3], color_vec;

   /* Set filter coefficients */
   int4 k0 = (int4)(-1, -1,  0,  0);
   int4 k1 = (int4)(-1,  0,  1,  0);
   int4 k2 = (int4)( 0,  1,  1,  0);

   /* Set filter denominator */
   int denom = 1;

   if((row > 0) && (col > 0) && (col < width-2) && (row < height-1)) {

      /* Read input image data into private memory */
      offset = (row-1) * width + (col-1);
      pixels[0] = vload4(0, input_img + offset);
      pixels[1] = vload4(0, input_img + offset + width);
      pixels[2] = vload4(0, input_img + offset + 2*width);

      /* Compute red component */
      color_vec = ((pixels[0] & 0x00ff0000) >> 16) * k0 +
                  ((pixels[1] & 0x00ff0000) >> 16) * k1 +
                  ((pixels[2] & 0x00ff0000) >> 16) * k2;
      red = clamp((color_vec.s0 + color_vec.s1 + color_vec.s2)/denom, 0, 255);

      /* Compute green component */
      color_vec = ((pixels[0] & 0x0000ff00) >> 8) * k0 +
                  ((pixels[1] & 0x0000ff00) >> 8) * k1 +
                  ((pixels[2] & 0x0000ff00) >> 8) * k2;
      green = clamp((color_vec.s0 + color_vec.s1 + color_vec.s2)/denom, 0, 255);

      /* Compute blue component */
      color_vec = (pixels[0] & 0x000000ff) * k0 +
                  (pixels[1] & 0x000000ff) * k1 +
                  (pixels[2] & 0x000000ff) * k2;
      blue = clamp((color_vec.s0 + color_vec.s1 + color_vec.s2)/denom, 0, 255);

      /* Update output pixel in global memory */
      out_img[get_global_id(0)] = 0xff000000 + (red << 16) + (green << 8) + blue;
   }
   else {
      out_img[get_global_id(0)] = input_img[get_global_id(0)];
   }
}

The filter's operation is determined by the coefficients in the three int4 vectors. In this case, the filter gives a 3D shadow to the image. The coefficients are defined in the kernel at the moment, but ideally, they'd be set by the user at runtime. I'll see what I can do.

On a scale from 1 to 10, I'd give this project's difficulty a solid 10. The problem isn't the OpenCL, but interfacing Java and OpenCL through the JNI. It would be nice if Android had something like Aparapi, which makes it possible to directly execute OpenCL kernels from Java.

Read more...

Google and OpenCL

>> Wednesday, March 20, 2013

As of Android 4.1, the Renderscript graphics engine has been deprecated in favor of OpenGL. Since then, I've hoped Renderscript computation would be deprecated in favor of OpenCL. But this message from a Google engineer explains why this hasn't happened.

His concern with OpenCL/CUDA is that "it gets peak performance at the expense of performance portability." He goes on to say that an application targeting one device won't execute optimally on another due to differences in work-group size, shared memory availability, and so on.

With proper coding, this isn't an issue. clGetDeviceInfo and clGetKernelWorkGroupInfo examine the device at runtime and return information abouts its work-group size, shared memory availability, preferred vector widths, and the sizes of its memory buffers. With these functions, applications can tailor their operation to take full advantage of the target device, no matter what its characteristics are.

For example, the following code identifies how many work-items should be generated per group to execute a given kernel on a given device:

size_t group_size;
clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE,
                         sizeof(group_size), &group_size, NULL);

If Google is going to stick with Renderscript for high performance, I'll grit my teeth and learn it. But I wish they'd support other options. I also wish they'd join the Khronos Group.

Read more...

Conferences

>> Monday, March 11, 2013

I'm working on an app that reads image data from the Android's camera and processes it with OpenCL. In the meantime, I thought I'd mention three upcoming conferences:

  1. AMD Developer Summit (11/11/13 in San Jose) - a conference devoted to OpenCL and AMD technology. Presentation topics can be submitted here and the deadline is March 15.
  2. LEAP Conference (5/21/13 in London) - LEAP stands for Low Energy Application Parallelism, and the conference focuses on high-performance computing with low-power hardware. The call for papers is here and the deadline is March 15.
  3. GTC (3/18/13 in San Jose) - Nvidia calls it the GPU Technology Conference, but it's really just for Nvidia GPUs.

Read more...

OpenCL Device Test for Android (Nexus 10)

>> Sunday, March 3, 2013

I've received a few e-mails asking about the OpenCL-Android example. In response, I've taken two steps:

  • I've uploaded a free app called OpenCL Device Test to Google Play. It checks for compliant devices and if it finds one, it lists the device's processing characteristics.
  • I've uploaded the source code for the OpenCL Device Test to Github. This contains the entire SDK project, including the Java and JNI code.
The first task was a lot harder than I'd expected. To deploy an app on Google Play, you have to manually exclude every device that shouldn't be listed as a target. In this case, I excluded every device except the Nexus 10. This is because the app requires libGLES_mali.so, which isn't available on other platforms except the Nexus 4 (which I didn't test).

I just checked Google Play, and the app isn't available yet. This is my first time trying this, so I hope you'll bear with me.

Read more...

  © Blogger template Werd by Ourblogtemplates.com 2009

Back to TOP