FFT version 1.1

>> Wednesday, November 19, 2014

The good folks at Manning have uploaded the new FFT code to the book's main site. I've updated the code in the Linux and Windows archives, but I didn't change the Mac OS archive because I no longer have a Mac for OpenCL testing.

As far as I can tell, the new code takes care of the race condition. If anyone has any concerns, please let me know.

Read more...

FFT Update

>> Saturday, November 15, 2014

Because of the comments I received, I decided to test my FFT on newer systems with new hardware and current drivers. My FFT passed every test, so I wrote a self-satisfied post stating that the commenter's problem was caused by using work-groups whose sizes weren't a power of two.

Then it dawned on me. In the fft_init kernel, work items read data from bit-reversed addresses and write the processed data to unreversed addresses in the same buffer. This makes it possible for one work item to read data that has already been processed by another. This is the race condition to which the commenter was referring.

Thankfully, this problem is easy to fix. I'll add a second buffer to fft_init so that every work item reads from the first buffer and writes to the second. I'll get this coded tomorrow morning and I'll contact Manning to get it uploaded to their software site.

I'd like to thank the commenter for his/her assistance. I'd also like to point out that my bit-reversal algorithm, while idiosyncratic, is perfectly functional.

Read more...

FFT Concern

>> Saturday, November 1, 2014

Over three and half years ago, I completed the OpenCL FFT that I discussed in Chapter 14. I tested it with data sets of varying sizes on different graphics cards and operating systems. It ran successfully every time, but recent comments make it seem likely that there's a race condition that needs to be addressed.

The problem with debugging an FFT is that it requires lengthy time for concentration, which usually involves me lying on the floor and squinting up at the ceiling for hours on end. Unfortunately, I'm busy at the moment and don't the time. But because I'm so ashamed, I'm going to take the week of 11/10 off from work and I'll do my best to resolve the problem.

It looks like the root cause is my bit-reversal routine, and I'll explain why this is particularly jarring. If you're familiar with FFT code, then you know that many routines perform bit-reversal with code like the following:

ans = x & 1;
while(--numBits) {
   x >>= 1;
   ans <<= 1;
   ans += x & 1;
}

Rather than operate on scalars, I devised a routine that bit-reverses all four elements of a uint4 vector at the same time. I thought it was clever, but if it causes a race condition, it has to go.

I apologize to everyone who was/is disappointed with the code. If you're still looking for a good OpenCL FFT, I recommend the clFFT project. This was once part of AMD's Accelerated Parallel Processing Math Libraries (APPML), but it looks like that's no longer supported.

Read more...

Allocating Local Memory

>> Friday, October 3, 2014

Recently, someone asked me about the difference between the two methods of allocating local memory. That is, why would you use one method over the other? This isn't an easy question, and before I answer it here, I'd like to explain what the two methods are.

The first allocation method is performed inside the kernel. If a variable is declared in the kernel function, its declaration must identify which memory region contains its data. If the declaration is preceded by __local, the variable will be allocated from local memory. For example, the following declaration sets aside 1024 bytes of local memory to be accessed as an array named foo.

__local float foo[256];

The second method is performed in host code. clSetKernelArg assigns data to be passed into the kernel function as an argument. The last parameter of clSetKernelArg points to the argument's data. If the last argument points to a memory object (cl_mem), the kernel will access the data from global/constant memory. If the argument points to a primitive like an int, the kernel will access the data from private memory. But if the last argument is NULL, no data will be passed to the kernel. In this case, the purpose of clSetKernelArg is to allocate local memory for the kernel's use.

An example will help make this clear. Consider the following calls to clSetKernelArg:
clSetKernelArg(kernel, 0, sizeof(cl_mem), &example_buffer);
clSetKernelArg(kernel, 1, sizeof(int), &example_int);
clSetKernelArg(kernel, 2, 256 * sizeof(float), NULL);

These lines define the first three arguments of the kernel. If the kernel function's name is foo, the following code shows how foo's arguments might be accessed as function parameters:
__kernel void foo(__global float4* ex_buffer,
                  int ex_int,
                  __local float* local_var)

In the third call to clSetKernelArg, the data pointer is set to NULL. This tells the kernel that its third argument can be allocated from global memory or from local memory. This data is uninitialized—the kernel will read/write to the memory as it performs its computation.

So, the two methods of allocating local memory are declaring a local variable in a kernel and calling clSetKernelArg with a NULL data pointer. When do you use one over the other? The answer is straightforward.

If the size of the local memory is constant, use the first method because it's simpler. But if the kernel must be executed multiple times with different sizes of local memory, use the second method. This is because the second method makes it possible to control the local memory size with a variable.

Read more...

AMD and OpenCL 2.0

>> Wednesday, October 1, 2014

The good news is that AMD has released new drivers that enable OpenCL 2.0 support on existing graphics cards. This makes AMD the first company to provide real-world support for OpenCL 2.0.

The bad news is that all my attempts to install the Linux driver have failed. My target system runs 64-bit CentOS 6.5, but when I ran Catalyst, it gave me a 32-bit RPM. This made me nervous, but I uninstalled the existing driver, installed the 32-bit software dependencies, and finished the driver installation. Now when I start the computer, it hangs during the bootloading process. Grr...

I've been a fan of AMD even since they released the Athlon 64 CPU. But dealing with fglrx is murder. I have never installed an AMD graphics driver without repeated trial and error.

Read more...

State of 3-D Graphical Programming

>> Monday, September 22, 2014

Tom Dalling wrote an insightful post about the current set of 3-D graphical programming APIs: OpenGL, Mantle, Metal, and Direct3D. I can still remember how excited I was when AMD announced Mantle. But it's been a year since the announcement and AMD still hasn't released the API to the public.

The post also mentions the Next Generation OpenGL Initiative. According to the SIGGRAPH 2014 presentation (PDF), this entails a "ground up redesign" of the specification. The OpenGL API definitely needs an overhaul, but if they rewrite OpenGL from scratch, what will happen to OpenCL-OpenGL interoperability? We'll see.

Read more...

State of OpenCL

>> Sunday, August 24, 2014

EDIT: AMD released its OpenCL 2.0 drivers for existing graphics cards on 9/30/2014, thereby becoming the first company to provide real-world support for OpenCL.

The SIGGRAPH conference was held in Vancouver on August 13. OpenCL was one of the topics discussed and the slides from the Khronos Group can be downloaded here.

Looking through the OpenCL BOF slides, four points caught my eye:

  • AMD was the first company to support OpenCL 1.2, but Intel will be the first to support OpenCL 2.0. Their new Broadwell GPU architecture complies with the OpenCL 2.0 spec and AnandTech has a great article on it here.
  • The presentation states that "The Future is Mobile" and I agree. But iOS and Android are nowhere near supporting the execution of OpenCL kernels. This is a shame, as many mobile GPU vendors are working hard to provide OpenCL SDKs and drivers.
  • Version 2.0 of SPIR (Standard Portable Intermediate Representation) has been released. This format makes it possible to exchange (i.e. sell) device-agnostic OpenCL programs without giving away the source code. I need to learn more about it.
  • The OpenCL 1.2 spec was released nearly four years ago, but Nvidia still doesn't support it. And as I've learned from experience, they don't make it easy to get their OpenCL 1.1 library.
This last point gives an idea of how much (or more precisely, how little) Nvidia cares for OpenCL. Which makes it all the stranger that Neil Trevett, a Senior Vice President at Nvidia, is leading the OpenCL Working Group. I'm sure he's a fine person and a devoted technologist, but if his company has given up on OpenCL, why is he in charge? It's like putting Bill Gates at the head of the Free Software Foundation.

Read more...

  © Blogger template Werd by Ourblogtemplates.com 2009

Back to TOP