Goodbye, Dr. Dobb's

>> Tuesday, December 16, 2014

In college, my roommate subscribed to the print version of Dr. Dobb's Journal. I didn't understand much of it at the time, but I was impressed by how intelligent and well-written it was. The print version ran its course, but the online site continued to provide insightful content. I was flattered to be invited to write an article on OpenCL, and I enjoyed corresponding with Andrew Binstock, the editor-in-chief.

But today, Mr. Binstock announced that the online site will cease publication of new articles. Goodbye, Dr. Dobb's. Thank you for so many fine articles.


FFT version 1.1

>> Wednesday, November 19, 2014

The good folks at Manning have uploaded the new FFT code to the book's main site. I've updated the Linux and Windows archives, but I didn't change the Mac OS archive because my MacBook White is long dead.

As far as I can tell, the new code takes care of the race condition. If anyone has any concerns, please let me know.


FFT Update

>> Saturday, November 15, 2014

Because of the comments I received, I decided to test my FFT on new systems with new hardware and new drivers. My FFT passed every test, so I wrote a self-satisfied post stating that the commenter's problem was caused by using work-groups whose sizes weren't a power of two.

Then it dawned on me. In the fft_init kernel, work items read data from bit-reversed addresses and write the processed data to unreversed addresses in the same buffer. This makes it possible for one work item to read data that has already been processed by another. This is the race condition to which the commenter was referring.

Thankfully, this problem is easy to fix. I'll add a second buffer to fft_init so that every work item reads from the first buffer and writes to the second. I'll get this coded tomorrow morning and I'll contact Manning to get it uploaded to their software site.

I'd like to thank the commenter for his/her assistance. I'd also like to point out that my bit-reversal algorithm, while idiosyncratic, is perfectly functional.


FFT Concern

>> Saturday, November 1, 2014

Over three and half years ago, I completed the OpenCL FFT that I discussed in Chapter 14. I tested it with data sets of varying sizes on different graphics cards and operating systems. It ran successfully every time, but recent comments make it seem likely that there's a race condition that needs to be addressed.

The problem with debugging an FFT is that it requires lengthy time for concentration, which usually involves me lying on the floor and squinting up at the ceiling for hours on end. Unfortunately, I'm busy at the moment and don't the time. But because I'm so ashamed, I'm going to take the week of 11/10 off from work and I'll do my best to resolve the problem.

It looks like the root cause is my bit-reversal routine, and I'll explain why this is particularly jarring. If you're familiar with FFT code, then you know that many routines perform bit-reversal with code like the following:

ans = x & 1;
while(--numBits) {
   x >>= 1;
   ans <<= 1;
   ans += x & 1;

Rather than operate on scalars, I devised a routine that bit-reverses all four elements of a uint4 vector at the same time. I thought it was clever, but if it causes a race condition, it has to go.

I apologize to everyone who was/is disappointed with the code. If you're still looking for a good OpenCL FFT, I recommend the clFFT project. This was once part of AMD's Accelerated Parallel Processing Math Libraries (APPML), but it looks like that's no longer supported.


Allocating Local Memory

>> Friday, October 3, 2014

Recently, someone asked me about the difference between the two methods of allocating local memory. That is, why would you use one method over the other? This isn't an easy question, and before I answer it here, I'd like to explain what the two methods are.

The first allocation method is performed inside the kernel. If a variable is declared in the kernel function, its declaration must identify which memory region contains its data. If the declaration is preceded by __local, the variable will be allocated from local memory. For example, the following declaration sets aside 1024 bytes of local memory to be accessed as an array named foo.

__local float foo[256];

The second method is performed in host code. clSetKernelArg assigns data to be passed into the kernel function as an argument. The last parameter of clSetKernelArg points to the argument's data. If the last argument points to a memory object (cl_mem), the kernel will access the data from global/constant memory. If the argument points to a primitive like an int, the kernel will access the data from private memory. But if the last argument is NULL, no data will be passed to the kernel. In this case, the purpose of clSetKernelArg is to allocate local memory for the kernel's use.

An example will help make this clear. Consider the following calls to clSetKernelArg:
clSetKernelArg(kernel, 0, sizeof(cl_mem), &example_buffer);
clSetKernelArg(kernel, 1, sizeof(int), &example_int);
clSetKernelArg(kernel, 2, 256 * sizeof(float), NULL);

These lines define the first three arguments of the kernel. If the kernel function's name is foo, the following code shows how foo's arguments might be accessed as function parameters:
__kernel void foo(__global float4* ex_buffer,
                  int ex_int,
                  __local float* local_var)

In the third call to clSetKernelArg, the data pointer is set to NULL. This tells the kernel that its third argument can be allocated from global memory or from local memory. This data is uninitialized—the kernel will read/write to the memory as it performs its computation.

So, the two methods of allocating local memory are declaring a local variable in a kernel and calling clSetKernelArg with a NULL data pointer. When do you use one over the other? The answer is straightforward.

If the size of the local memory is constant, use the first method because it's simpler. But if the kernel must be executed multiple times with different sizes of local memory, use the second method. This is because the second method makes it possible to control the local memory size with a variable.


AMD and OpenCL 2.0

>> Wednesday, October 1, 2014

The good news is that AMD has released new drivers that enable OpenCL 2.0 support on existing graphics cards. This makes AMD the first company to provide real-world support for OpenCL 2.0.

The bad news is that all my attempts to install the Linux driver have failed. My target system runs 64-bit CentOS 6.5, but when I ran Catalyst, it gave me a 32-bit RPM. This made me nervous, but I uninstalled the existing driver, installed the 32-bit software dependencies, and finished the driver installation. Now when I start the computer, it hangs during the bootloading process. Grr...

I've been a fan of AMD even since they released the Athlon 64 CPU. But dealing with fglrx is murder. I have never installed an AMD graphics driver without repeated trial and error.


State of 3-D Graphical Programming

>> Monday, September 22, 2014

Tom Dalling wrote an insightful post about the current set of 3-D graphical programming APIs: OpenGL, Mantle, Metal, and Direct3D. I can still remember how excited I was when AMD announced Mantle. But it's been a year since the announcement and AMD still hasn't released the API to the public.

The post also mentions the Next Generation OpenGL Initiative. According to the SIGGRAPH 2014 presentation (PDF), this entails a "ground up redesign" of the specification. The OpenGL API definitely needs an overhaul, but if they rewrite OpenGL from scratch, what will happen to OpenCL-OpenGL interoperability? We'll see.


State of OpenCL

>> Sunday, August 24, 2014

EDIT: AMD released its OpenCL 2.0 drivers for existing graphics cards on 9/30/2014, thereby becoming the first company to provide real-world support for OpenCL.

The SIGGRAPH conference was held in Vancouver on August 13. OpenCL was one of the topics discussed and the slides from the Khronos Group can be downloaded here.

Looking through the OpenCL BOF slides, four points caught my eye:

  • AMD was the first company to support OpenCL 1.2, but Intel will be the first to support OpenCL 2.0. Their new Broadwell GPU architecture complies with the OpenCL 2.0 spec and AnandTech has a great article on it here.
  • The presentation states that "The Future is Mobile" and I agree. But iOS and Android are nowhere near supporting the execution of OpenCL kernels. This is a shame, as many mobile GPU vendors are working hard to provide OpenCL SDKs and drivers.
  • Version 2.0 of SPIR (Standard Portable Intermediate Representation) has been released. This format makes it possible to exchange (i.e. sell) device-agnostic OpenCL programs without giving away the source code. I need to learn more about it.
  • The OpenCL 1.2 spec was released nearly four years ago, but Nvidia still doesn't support it. And as I've learned from experience, they don't make it easy to get their OpenCL 1.1 library.
This last point gives an idea of how much (or more precisely, how little) Nvidia cares for OpenCL. Which makes it all the stranger that Neil Trevett, a Senior Vice President at Nvidia, is leading the OpenCL Working Group. I'm sure he's a fine person and a devoted technologist, but if his company has given up on OpenCL, why is he in charge? It's like putting Bill Gates at the head of the Free Software Foundation.


Intel, FPGAs, and DLC

>> Sunday, June 29, 2014

I designed FPGA circuits early in my career and I was surprised by how difficult it can be. The logic elements in an FPGA operate independently, so designers have to keep track of their input/output signals to make sure they're all in step. If Signal A reaches a gate before B and C are valid, the element may produce errors. Timing errors are hard to detect and very difficult to debug. Tools like gdb can't help, so designers use virtual logic analyzers like those provided by Modelsim.

OpenCL can reduce the risk and difficulty of FPGA design, but given the small developer base, Intel might not allow developers to access the Xeon's integrated FPGAs. Instead, Intel could assemble a catalog of prebuilt, fully-tested FPGA designs for special tasks. If Intel's C/C++ compiler (icc) notices that an application could be accelerated with one of these designs, it could alert the developer with a friendly dialog box:

Howdy, developer! I see you're sorting database records and performing statistical analysis. If you install Intel's RapidCore on your Xeon, this application will execute 7,364 times faster.

Buy RapidCore (Y/N)?

After the purchase is completed, the compiler downloads the core from the Internet and automatically installs it on the Xeon's embedded FPGA. This way, the developer doesn't need to understand OpenCL, logic design, or timing analysis.

The principle is similar to the downloadable content (DLC) provided by game publishers. After customers buy a game, they can pay extra to make the game easier or more interesting. With Xeon DLC, developers buy the compiler, and then they can improve performance with special-purpose FPGA designs. Similar improvements could be made available to end-users.


Intel, FPGAs, and OpenCL

>> Monday, June 23, 2014

Intel has announced that upcoming releases of the Xeon processor will have integrated Field Programmable Gate Arrays (FPGAs). At first, this amazed me. The primary languages for FPGA design are Verilog and VHDL, and both are beyond the experience of most Intel programmers. In fact, the process of designing an FPGA circuit with Verilog/VHDL is completely different than that of building a C/C++ application.

Then a thought struck me. The two main FPGA vendors, Xilinx and Altera, are developing toolsets for creating FPGA designs with OpenCL. I wouldn't be surprised if the Xeon's FPGA is intended to be accessed through OpenCL, not Verilog or VHDL.

The announcement doesn't say whose FPGAs will be integrated in the Xeon, but it's noteworthy that Intel is manufacturing Altera's latest generation of FPGAs, which includes the Arria 10 and the Stratix 10. These are the first FPGAs to provide dedicated logic for floating-point DSP. Further, Altera is working hard on its OpenCL support, and I can state from experience that their SDK is functional and polished.

So here's my prediction: Intel's new Xeons will have integrated FPGAs from Altera. Developers will be able to access the FPGAs' dedicated DSP blocks using OpenCL.

This sounds fine, but I foresee three problems:

  1. No matter what language you use, compiling an FPGA design takes hours. Are developers willing to wait that long?
  2. Altera's OpenCL SDK is great, but it's not free. Also, it requires installation of Altera's Quartus II application.
  3. Despite my best efforts, the OpenCL developer community is pretty small. Integrating OpenCL-accessible FPGAs into high-end CPUs seems like a big risk.
Wait a minute. What if these Xeons are intended for Apple? Apple is a fervent believer in OpenCL and they probably know which floating-point routines need FPGA acceleration. Hmm.

ETA: I received a link to a post that accuses Intel of copying Microsoft's effort to use FPGAs to accelerate web searching. This may be the case, but I suspect Intel is trying to compete with Nvidia's high-speed number-crunching servers. We'll see...


Trial of the Century: Oracle v. Google

>> Friday, May 9, 2014

The US Court of Appeals has ruled that the Java API is copyrightable, and that Google used it improperly in its Android devices. The terrible ramifications of this decision can't be overstated. From now on, programmers will have to ask who owns the language before they start coding. Also, Oracle can (and probably will) go after everyone who develops applications with Java, which is one of the most popular programming languages in the world. And remember: while patents last 14-20 years, copyrights last seventy years.

Apple owns the trademark for OpenCL, but who owns the copyright for the API? What would happen if the owner(s) decided to sue for infringement?

If someone developed a programming language based on the English language, could they sue everyone who writes English? How would you legally distinguish a computer program from regular text?


SYCL and Other Announcements

>> Sunday, March 23, 2014

The Game Developers Conference took place last week and there were many announcements related to OpenCL. For one thing, the WebCL 1.0 standard has been released. The page says that "Security is top priority" but I haven't seen any tools or programming constructs that prevent kernels from locking up the GPU. And it doesn't look like WebGL-WebCL interoperability is a significant concern. Ah well.

The GDC announcement that I found especially interesting involves a new programming layer called SYCL. According to its Khronos page, the goal is to make OpenCL and SPIR accessible through C++. I thought Benedict Gaster did a fine job with his cl.hpp wrapper, but I look forward to using the new SYCL API.

The SYCL effort is led by CodePlay, whose CEO, Andrew Richards, discussed OpenCL in an interview I mentioned in a previous post. He appreciates the importance of OpenCL-OpenGL interoperability, and if SYCL can simplify the coding process, that will be a wonderful thing. The FAQ for SYCL is here, but it doesn't answer the burning question: What does SYCL stand for? If you're going to make up a *CL acronym, why use four letters instead of three?


Chromium and WebCL

>> Sunday, March 16, 2014

Some time ago, I was very interested in using Google's Native Client to enable OpenCL processing in Chrome. As it turns out, the Native Client doesn't allow that sort of thing, but AMD hasn't given up. They've added WebCL to Google's Chromium project, which is the open-source version of Chrome.

AMD has made the source code for Chromium-WebCL available here. I've downloaded it and I'll give it my full attention when the time presents itself.


Book News

>> Wednesday, February 12, 2014

A coworker pointed out that two of my book's examples don't work properly: device_ext_test and buffer_check. After testing both on my Linux/AMD system, I was forced to agree. So I fixed the code and sent the updated zip files to the publisher.

The device_ext_test application fails because of clGetDeviceIDs. This doesn't seem to work properly when you want to determine how many devices are present. To be specific, the following line of code causes the error:

clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, NULL, &num_devices);

When this executes, it returns CL_INVALID_VALUE. I have no idea why this is, but I removed the call to clGetDeviceIDs and the error disappears.

My second error is more interesting. buffer_check fails because of the call to clCreateSubBuffer. My original code set the sub-buffer's origin to 120. This isn't an aligned memory address, and when I wrote the application, alignment wasn't a concern. But now my call to clCreateSubBuffer produces CL_MISALIGNED_SUB_BUFFER_OFFSET, a new error that was introduced in OpenCL 1.2. To clear this, I set the origin to 0x100. Now all is well.

In other news, Manning has made my book the Deal of the Day for February 13. Woo-hoo! That Oculus Rift book looks pretty incredible as well...


RenderScript and OpenCL

>> Monday, January 20, 2014

I decided it was time to learn RenderScript, so I spent the weekend reading through documentation and testing code on my Galaxy S4. For those unfamiliar, a RenderScript project requires at least three files:

  • a native file (*.rs) containing high-performance C code
  • a Java file automatically generated from the *.rs file
  • a Java file that calls the methods in the generated file
This is complicated, but RenderScript is much easier to deal with than the Java Native Interface.

Code in a *.rs file can operate on scalar and vector types and can call functions like dot, sin, and ilogb. The functions are declared in RenderScript headers (*.rsh) and one of the most prominent headers is rs_cl.rsh.

Looking through rs_cl.rsh, I was surprised by how similar its functions are to OpenCL's kernel functions. That's when it dawned on me—the 'cl' in rs_cl.rsh refers to OpenCL. So RenderScript isn't really competing with OpenCL. RenderScript is a Java layer on top of OpenCL's kernel API.

As I investigated further, the parallels between the two languages became more apparent. RenderScript's Allocations serve the same role as OpenCL's memory objects. In OpenCL, work-items have identifiers with one, two, or three dimensions. In RenderScript, kernels access similar dimensions as function parameters.

Of course, RenderScript differs from OpenCL in many respects. RenderScript doesn't let you choose the target device and each kernel can only access one or two Allocations (memory objects). Also, developers can't specify the usage of global, local, or private memory.


  © Blogger template Werd by 2009

Back to TOP