Goodbye, Dr. Dobb's

>> Tuesday, December 16, 2014

In college, my roommate subscribed to the print version of Dr. Dobb's Journal, and I was impressed by how intelligent and well-written it was. The print version ran its course, but the online site still provided solid content. I was flattered to be invited to write an article on OpenCL, and it was a pleasure to correspond with Andrew Binstock, the editor-in-chief.

But today, Mr. Binstock announced that the online site will cease publication of new articles. Goodbye, Dr. Dobb's. Thank you for so many fine articles.


FFT version 1.1

>> Wednesday, November 19, 2014

The good folks at Manning have uploaded the new FFT code to the book's main site. I've updated the Linux and Windows archives, but I didn't change the Mac OS archive because my MacBook White is long dead.

As far as I can tell, the new code takes care of the race condition. If anyone has any concerns, please let me know.


FFT Update

>> Saturday, November 15, 2014

Because of the comments I received, I decided to test my FFT on new systems with new hardware and new drivers. My FFT passed every test, so I wrote a self-satisfied post stating that the commenter's problem was caused by using work-groups whose sizes weren't a power of two.

Then it dawned on me. In the fft_init kernel, work items read data from bit-reversed addresses and write the processed data to unreversed addresses in the same buffer. This makes it possible for one work item to read data that has already been processed by another. This is the race condition to which the commenter was referring.

Thankfully, this problem is easy to fix. I'll add a second buffer to fft_init so that every work item reads from the first buffer and writes to the second. I'll get this coded tomorrow morning and I'll contact Manning to get it uploaded to their software site.

I'd like to thank the commenter for his/her assistance. I'd also like to point out that my bit-reversal algorithm, while idiosyncratic, is perfectly functional.


FFT Concern

>> Saturday, November 1, 2014

Over three and half years ago, I completed the OpenCL FFT that I discussed in Chapter 14. I tested it with data sets of varying sizes on different graphics cards and operating systems. It ran successfully every time, but recent comments make it seem likely that there's a race condition that needs to be addressed.

The problem with debugging an FFT is that it requires lengthy time for concentration, which usually involves me lying on the floor and squinting up at the ceiling for hours on end. Unfortunately, I'm busy at the moment and don't the time. But because I'm so ashamed, I'm going to take the week of 11/10 off from work and I'll do my best to resolve the problem.

It looks like the root cause is my bit-reversal routine, and I'll explain why this is particularly jarring. If you're familiar with FFT code, then you know that many routines perform bit-reversal with code like the following:

ans = x & 1;
while(--numBits) {
   x >>= 1;
   ans <<= 1;
   ans += x & 1;

Rather than operate on scalars, I devised a routine that bit-reverses all four elements of a uint4 vector at the same time. I thought it was clever, but if it causes a race condition, it has to go.

I apologize to everyone who was/is disappointed with the code. If you're still looking for a good OpenCL FFT, I recommend the clFFT project. This was once part of AMD's Accelerated Parallel Processing Math Libraries (APPML), but it looks like that's no longer supported.


Allocating Local Memory

>> Friday, October 3, 2014

Recently, someone asked me about the difference between the two methods of allocating local memory. That is, why would you use one method over the other? This isn't an easy question, and before I answer it here, I'd like to explain what the two methods are.

The first allocation method is performed inside the kernel. If a variable is declared in the kernel function, its declaration must identify which memory region contains its data. If the declaration is preceded by __local, the variable will be allocated from local memory. For example, the following declaration sets aside 1024 bytes of local memory to be accessed as an array named foo.

__local float foo[256];

The second method is performed in host code. clSetKernelArg assigns data to be passed into the kernel function as an argument. The last parameter of clSetKernelArg points to the argument's data. If the last argument points to a memory object (cl_mem), the kernel will access the data from global/constant memory. If the argument points to a primitive like an int, the kernel will access the data from private memory. But if the last argument is NULL, no data will be passed to the kernel. In this case, the purpose of clSetKernelArg is to allocate local memory for the kernel's use.

An example will help make this clear. Consider the following calls to clSetKernelArg:
clSetKernelArg(kernel, 0, sizeof(cl_mem), &example_buffer);
clSetKernelArg(kernel, 1, sizeof(int), &example_int);
clSetKernelArg(kernel, 2, 256 * sizeof(float), NULL);

These lines define the first three arguments of the kernel. If the kernel function's name is foo, the following code shows how foo's arguments might be accessed as function parameters:
__kernel void foo(__global float4* ex_buffer,
                  int ex_int,
                  __local float* local_var)

In the third call to clSetKernelArg, the data pointer is set to NULL. This tells the kernel that its third argument can be allocated from global memory or from local memory. This data is uninitialized—the kernel will read/write to the memory as it performs its computation.

So, the two methods of allocating local memory are declaring a local variable in a kernel and calling clSetKernelArg with a NULL data pointer. When do you use one over the other? The answer is straightforward.

If the size of the local memory is constant, use the first method because it's simpler. But if the kernel must be executed multiple times with different sizes of local memory, use the second method. This is because the second method makes it possible to control the local memory size with a variable.


AMD and OpenCL 2.0

>> Wednesday, October 1, 2014

The good news is that AMD has released new drivers that enable OpenCL 2.0 support on existing graphics cards. This makes AMD the first company to provide real-world support for OpenCL 2.0.

The bad news is that all my attempts to install the Linux driver have failed. My target system runs 64-bit CentOS 6.5, but when I ran Catalyst, it gave me a 32-bit RPM. This made me nervous, but I uninstalled the existing driver, installed the 32-bit software dependencies, and finished the driver installation. Now when I start the computer, it hangs during the bootloading process. Grr...

I've been a fan of AMD even since they released the Athlon 64 CPU. But dealing with fglrx is murder. I have never installed an AMD graphics driver without repeated trial and error.


State of 3-D Graphical Programming

>> Monday, September 22, 2014

Tom Dalling wrote an insightful post about the current set of 3-D graphical programming APIs: OpenGL, Mantle, Metal, and Direct3D. I can still remember how excited I was when AMD announced Mantle. But it's been a year since the announcement and AMD still hasn't released the API to the public.

The post also mentions the Next Generation OpenGL Initiative. According to the SIGGRAPH 2014 presentation (PDF), this entails a "ground up redesign" of the specification. The OpenGL API definitely needs an overhaul, but if they rewrite OpenGL from scratch, what will happen to OpenCL-OpenGL interoperability? We'll see.


  © Blogger template Werd by 2009

Back to TOP