Allocating Local Memory

>> Friday, October 3, 2014

Recently, someone asked me about the difference between the two methods of allocating local memory. That is, why would you use one method over the other? This isn't an easy question, and before I answer it here, I'd like to explain what the two methods are.

The first allocation method is performed inside the kernel. If a variable is declared in the kernel function, its declaration must identify which memory region contains its data. If the declaration is preceded by __local, the variable will be allocated from local memory. For example, the following declaration sets aside 1024 bytes of local memory to be accessed as an array named foo.

__local float foo[256];

The second method is performed in host code. clSetKernelArg assigns data to be passed into the kernel function as an argument. The last parameter of clSetKernelArg points to the argument's data. If the last argument points to a memory object (cl_mem), the kernel will access the data from global/constant memory. If the argument points to a primitive like an int, the kernel will access the data from private memory. But if the last argument is NULL, no data will be passed to the kernel. In this case, the purpose of clSetKernelArg is to allocate local memory for the kernel's use.

An example will help make this clear. Consider the following calls to clSetKernelArg:
clSetKernelArg(kernel, 0, sizeof(cl_mem), &example_buffer);
clSetKernelArg(kernel, 1, sizeof(int), &example_int);
clSetKernelArg(kernel, 2, 256 * sizeof(float), NULL);

These lines define the first three arguments of the kernel. If the kernel function's name is foo, the following code shows how foo's arguments might be accessed as function parameters:
__kernel void foo(__global float4* ex_buffer,
                  int ex_int,
                  __local float* local_var)

In the third call to clSetKernelArg, the data pointer is set to NULL. This tells the kernel that its third argument can be allocated from global memory or from local memory. This data is uninitialized—the kernel will read/write to the memory as it performs its computation.

So, the two methods of allocating local memory are declaring a local variable in a kernel and calling clSetKernelArg with a NULL data pointer. When do you use one over the other? The answer is straightforward.

If the size of the local memory is constant, use the first method because it's simpler. But if the kernel must be executed multiple times with different sizes of local memory, use the second method. This is because the second method makes it possible to control the local memory size with a variable.


AMD and OpenCL 2.0

>> Wednesday, October 1, 2014

The good news is that AMD has released new drivers that enable OpenCL 2.0 support on existing graphics cards. This makes AMD the first company to provide real-world support for OpenCL 2.0.

The bad news is that all my attempts to install the Linux driver have failed. My target system runs 64-bit CentOS 6.5, but when I ran Catalyst, it gave me a 32-bit RPM. This made me nervous, but I uninstalled the existing driver, installed the 32-bit software dependencies, and finished the driver installation. Now when I start the computer, it hangs during the bootloading process. Grr...

I've been a fan of AMD even since they released the Athlon 64 CPU. But dealing with fglrx is murder. I have never installed an AMD graphics driver without repeated trial and error.


State of 3-D Graphical Programming

>> Monday, September 22, 2014

Tom Dalling wrote an insightful post about the current set of 3-D graphical programming APIs: OpenGL, Mantle, Metal, and Direct3D. I can still remember how excited I was when AMD announced Mantle. But it's been a year since the announcement and AMD still hasn't released the API to the public.

The post also mentions the Next Generation OpenGL Initiative. According to the SIGGRAPH 2014 presentation (PDF), this entails a "ground up redesign" of the specification. The OpenGL API definitely needs an overhaul, but if they rewrite OpenGL from scratch, what will happen to OpenCL-OpenGL interoperability? We'll see.


State of OpenCL

>> Sunday, August 24, 2014

EDIT: AMD released its OpenCL 2.0 drivers for existing graphics cards on 9/30/2014, thereby becoming the first company to provide real-world support for OpenCL.

The SIGGRAPH conference was held in Vancouver on August 13. OpenCL was one of the topics discussed and the slides from the Khronos Group can be downloaded here.

Looking through the OpenCL BOF slides, four points caught my eye:

  • AMD was the first company to support OpenCL 1.2, but Intel will be the first to support OpenCL 2.0. Their new Broadwell GPU architecture complies with the OpenCL 2.0 spec and AnandTech has a great article on it here.
  • The presentation states that "The Future is Mobile" and I agree. But iOS and Android are nowhere near supporting the execution of OpenCL kernels. This is a shame, as many mobile GPU vendors are working hard to provide OpenCL SDKs and drivers.
  • Version 2.0 of SPIR (Standard Portable Intermediate Representation) has been released. This format makes it possible to exchange (i.e. sell) device-agnostic OpenCL programs without giving away the source code. I need to learn more about it.
  • The OpenCL 1.2 spec was released nearly four years ago, but Nvidia still doesn't support it. And as I've learned from experience, they don't make it easy to get their OpenCL 1.1 library.
This last point gives an idea of how much (or more precisely, how little) Nvidia cares for OpenCL. Which makes it all the stranger that Neil Trevett, a Senior Vice President at Nvidia, is leading the OpenCL Working Group. I'm sure he's a fine person and a devoted technologist, but if his company has given up on OpenCL, why is he in charge? It's like putting Bill Gates at the head of the Free Software Foundation.


Intel, FPGAs, and DLC

>> Sunday, June 29, 2014

I designed FPGA circuits early in my career and I was surprised by how difficult it can be. The logic elements in an FPGA operate independently, so designers have to keep track of their input/output signals to make sure they're all in step. If Signal A reaches a gate before B and C are valid, the element may produce errors. Timing errors are hard to detect and very difficult to debug. Tools like gdb can't help, so designers use virtual logic analyzers like those provided by Modelsim.

OpenCL can reduce the risk and difficulty of FPGA design, but given the small developer base, Intel might not allow developers to access the Xeon's integrated FPGAs. Instead, Intel could assemble a catalog of prebuilt, fully-tested FPGA designs for special tasks. If Intel's C/C++ compiler (icc) notices that an application could be accelerated with one of these designs, it could alert the developer with a friendly dialog box:

Howdy, developer! I see you're sorting database records and performing statistical analysis. If you install Intel's RapidCore on your Xeon, this application will execute 7,364 times faster.

Buy RapidCore (Y/N)?

After the purchase is completed, the compiler downloads the core from the Internet and automatically installs it on the Xeon's embedded FPGA. This way, the developer doesn't need to understand OpenCL, logic design, or timing analysis.

The principle is similar to the downloadable content (DLC) provided by game publishers. After customers buy a game, they can pay extra to make the game easier or more interesting. With Xeon DLC, developers buy the compiler, and then they can improve performance with special-purpose FPGA designs. Similar improvements could be made available to end-users.


Intel, FPGAs, and OpenCL

>> Monday, June 23, 2014

Intel has announced that upcoming releases of the Xeon processor will have integrated Field Programmable Gate Arrays (FPGAs). At first, this amazed me. The primary languages for FPGA design are Verilog and VHDL, and both are beyond the experience of most Intel programmers. In fact, the process of designing an FPGA circuit with Verilog/VHDL is completely different than that of building a C/C++ application.

Then a thought struck me. The two main FPGA vendors, Xilinx and Altera, are developing toolsets for creating FPGA designs with OpenCL. I wouldn't be surprised if the Xeon's FPGA is intended to be accessed through OpenCL, not Verilog or VHDL.

The announcement doesn't say whose FPGAs will be integrated in the Xeon, but it's noteworthy that Intel is manufacturing Altera's latest generation of FPGAs, which includes the Arria 10 and the Stratix 10. These are the first FPGAs to provide dedicated logic for floating-point DSP. Further, Altera is working hard on its OpenCL support, and I can state from experience that their SDK is functional and polished.

So here's my prediction: Intel's new Xeons will have integrated FPGAs from Altera. Developers will be able to access the FPGAs' dedicated DSP blocks using OpenCL.

This sounds fine, but I foresee three problems:

  1. No matter what language you use, compiling an FPGA design takes hours. Are developers willing to wait that long?
  2. Altera's OpenCL SDK is great, but it's not free. Also, it requires installation of Altera's Quartus II application.
  3. Despite my best efforts, the OpenCL developer community is pretty small. Integrating OpenCL-accessible FPGAs into high-end CPUs seems like a big risk.
Wait a minute. What if these Xeons are intended for Apple? Apple is a fervent believer in OpenCL and they probably know which floating-point routines need FPGA acceleration. Hmm.

ETA: I received a link to a post that accuses Intel of copying Microsoft's effort to use FPGAs to accelerate web searching. This may be the case, but I suspect Intel is trying to compete with Nvidia's high-speed number-crunching servers. We'll see...


Trial of the Century: Oracle v. Google

>> Friday, May 9, 2014

The US Court of Appeals has ruled that the Java API is copyrightable, and that Google used it improperly in its Android devices. The terrible ramifications of this decision can't be overstated. From now on, programmers will have to ask who owns the language before they start coding. Also, Oracle can (and probably will) go after everyone who develops applications with Java, which is one of the most popular programming languages in the world. And remember: while patents last 14-20 years, copyrights last seventy years.

Apple owns the trademark for OpenCL, but who owns the copyright for the API? What would happen if the owner(s) decided to sue for infringement?

If someone developed a programming language based on the English language, could they sue everyone who writes English? How would you legally distinguish a computer program from regular text?


  © Blogger template Werd by 2009

Back to TOP