Belated Responses

>> Tuesday, December 20, 2011

I've received a handful of e-mails over the last month and it's about time I responded:

  1. Global Barrier - One e-mail discusses the shortcomings of my beloved global barrier. My kernel used to execute properly every time I tested it, but with AMD's latest SDK, it fails on my Linux/AMD platform. And when I say it fails, I mean it locks the GPU and freezes my computer. I've added a warning to the blog post and I've removed the barrier from the book's example code.
  2. Altera - I've received two e-mails concerning Altera's initiative to convert OpenCL code into FPGA designs. OpenCL is easier for traditional programmers to work with than VHDL or Verilog, and I hope this broadens the usage of FPGAs. But it would be even better if Xilinx and Altera developed a customizable soft processor (similar to Xilinx's MicroBlaze) capable of meeting OpenCL's compliance criteria. Then developers could take advantage of FPGAs without having to deal with the FPGA toolchain.
  3. ARM - In my opinion, the most exciting news comes from ARM, whose latest embedded GPU is OpenCL-compliant. CUDA dominates GPGPU development on servers and desktop systems, but I predict OpenCL will take the lead in embedded and mobile devices. And for smartphones, OpenCL isn't just nice to have--it's ability to accelerate graphics will make it essential for development of immersive 3-D applications.

Happy holidays!



>> Monday, December 5, 2011

Google Knol is closing its doors so I've decided to host my knols elsewhere. Here's a full list of my online articles related to OpenCL, OpenGL, and C++:

Hope this helps.


GPUs and Random Number Generation

>> Sunday, November 20, 2011

I had a great time at SC11 and I learned a lot about the current state of high-performance computing. I enjoyed meeting people who work in the industry, which included academics, developers, salesmen, and a few readers of this blog (Hello John and Jamie).

One mysterious man came to my booth and asked what I knew about generating random numbers with OpenCL. I told him about implementations of the Mersenne Twister, but he wasn't impressed. He told me about a new technical paper that explains how to generate random numbers on GPUs by combining integer counters and block ciphers. In reverential tones, he said that counter-based random number generators (CBRNGs) produce numbers with greater statistical randomness than the MT and with much greater speed.

I promised to look into the matter, and I have. The paper can be downloaded here, and though I did my best, I don't have the cryptography background needed to fully grasp the theory. However, the paper was a finalist in SC11's Best Paper competition, so it's been subjected to scrutiny by folks a lot smarter than me.

In addition to writing the paper, Dr. Salmon has made his code freely available, including an implementation in OpenCL. The full code release is here and the documentation is here. I don't need to generate random numbers at the moment, but if ever I do, this is the first place I'll go.


OpenCL 1.2 Specification Released

>> Tuesday, November 15, 2011

The Khronos Group has released the OpenCL 1.2 specification (main site here). As stated on the site, the new functionalities include "seamless sharing of media and surfaces with DirectX® 9 and 11, enhanced image support, custom devices and kernels, device partitioning and separate compilation and linking of objects."

Upon hearing the news, I felt like something of an imbecile. After all, I'm here at the Supercomputing conference, purportedly an OpenCL expert, and I had no idea that a new spec was available. Ah well.


Supercomputing 2011

>> Friday, November 11, 2011

The time has come. I'm just about to start driving up to Seattle for the Supercomputing 2011 conference. In addition to the laptop, monitor, and cables, I have posters, poster stands, article printouts, business cards, markers, duct tape, and at least a million copies of OpenCL in Action. I'll be selling the books at a discount in Booth 4613 on Level 6 of the Washington State Convention Center.

I'll also be demonstrating an application that combines OpenCL and OpenGL to portray rigid bodies accelerating, decelerating, and colliding in space. If you have any questions about OpenCL-OpenGL interoperability, feel free to drop by.

As if that's not enough, I'll be giving a presentation on simulating physics with OpenCL and OpenGL. Specifically, my presentation will be 2:30-3:00 on Thursday in WSCC 613/614.



>> Thursday, November 3, 2011

To promote the book and demonstrate dynamic simulation with OpenCL and OpenGL, I've reserved a tiny booth at the Supercomputing 2011 conference in Seattle. Since making the reservation, I've been deluged with e-mails from contractors offering services such as advertising, plumbing, shipping, and lead retrieval. I've ignored them thus far, assuming that the high cost of reserving the booth would cover the services I need.

But as a saleswoman explained to me, this assumption was mistaken. You see, SC11 does not provide any electricity to the booths. So unless you're demonstrating an innovation in abacus design, you need to pay a contractor to rig an outlet for you. As you might imagine, this is not cheap. Now that I've paid for electricity, I'm concerned about the necessities I haven't paid for. Will I have light? Oxygen? Gravity?

In other news, the military announced the results of its solicitation calling for finite-difference time-domain (FDTD) simulation software. They want code capable of running on CPUs and GPUs, and because I'd worked with FDTD in grad school, I submitted a proposal explaining how OpenCL could do the job. I included plenty of kernel code for FDTD, but I didn't win and I wasn't expecting to.

However, I was very interested to find out which small business won the contract. After all, any company working on OpenCL-based electromagnetic simulation is a company I want to know about. But the military didn't give the award to anybody. Bummer.


Shuffling and Sorting, Part 2

>> Sunday, October 23, 2011

My first post on this blog dealt with using OpenCL vector functions to sort data. Thanks to the as_type function, I've improved my original code. If input is a vector containing four floats, the following code will sort its components in place:

uint4 mask1 = (uint4)(1, 0, 3, 2);
uint4 mask2 = (uint4)(2, 3, 0, 1);
uint4 mask3 = (uint4)(3, 2, 1, 0);

int4 add1 = (int4)(1, 1, 3, 3);
int4 add2 = (int4)(2, 3, 2, 3);
int4 add3 = (int4)(1, 2, 2, 3);

int4 comp = input < shuffle(input, mask1);
input = shuffle(input, as_uint4(comp + add1));
comp = input < shuffle(input, mask2);
input = shuffle(input, as_uint4(comp * 2 + add2));
comp = input < shuffle(input, mask3);
input = shuffle(input, as_uint4(comp + add3));
This may look like a lot of work to sort four values, but there are no if statements in this code. Therefore, unless the shuffle function is implemented with branches, there are no delays associated with branch misses.

The sorting direction can be controlled with the bitwise exclusive-OR. For example, if dir is set to -1, the following code will sort the elements in descending order:
int4 comp = input < shuffle(input, mask1) ^ dir;
input = shuffle(input, as_uint4(comp + add1));
comp = input < shuffle(input, mask2) ^ dir;
input = shuffle(input, as_uint4(comp * 2 + add2));
comp = input < shuffle(input, mask3) ^ dir;
input = shuffle(input, as_uint4(comp + add3));
If dir is set to 0, the code will sort the input elements in ascending order.


Variable Casting in OpenCL with as_type

One of my favorite OpenCL tricks involves using the result of a vector comparison as the mask argument of a shuffle/shuffle2 operation. This makes it easy to sort a vector's components in place.

The problem is that vector comparisons produce signed integer vectors and the shuffle functions require their mask vectors to contain unsigned values. OpenCL doesn't tolerate regular C/C++ casting, so you can't use anything like:

shuffle(input, (uint4)mask);
Earlier, I used the abs function to convert signed vectors to unsigned vectors. But OpenCL makes it possible to cast variables properly using the as_type function. Here, type represents the desired data type. An example will show how this works:
shuffle(input, as_uint4(mask));
It works just as simply for scalars as it does for vectors, and it can be used for floating-point as well as signed conversion. For example, to convert a float called x to an int, you'd call as_int(x).


Collision Detection with OpenCL

>> Saturday, October 22, 2011

At long last, I've added collision detection to my OpenCL-OpenGL application. The application executes three OpenCL kernels at regular time intervals. The first generates n choose 2 active work-items, where n is the number of figures in the model. Each work-item determines whether a given pair of objects has collided, and if so, it alters their velocities according to two equations:

  • v1_new = (v1*(m1-m2) + 2*m2*v2)/(m1+m2)
  • v2_new = (v2*(m2-m1) + 2*m1*v1)/(m1+m2)
The second kernel generates n active work-items, one for each figure in the model. This kernel updates the figure's velocity and displacement with the following equations:
  • velocity += acceleration * delta_t
  • displacement = velocity * delta_t
The third kernel generates one work-item for every vertex in the model and updates the vertex's position with the displacement computed earlier. Then the vertices are ready to be rendered.


OpenCL Kernels and Vertex Shaders

>> Monday, October 10, 2011

I've coded a few physics-based applications with OpenGL, and the overall operation is simple: the host computes a figure's new position based on its velocity and acceleration, and sends the position delta to the vertex shader as a uniform. The vertex shader updates each vertex position with the delta and the figure moves over time.

But now I'm implementing the physics processing with OpenCL. At the moment, my goal is to execute the following loop:

for each figure in the in model
   if figure collided with another
      change velocity and acceleration of both figures
   end if
end for

I'm using the GJK method to detect collisions, and it's not easy. But my main concern is this: if the OpenCL kernel computes velocity and acceleration without updating the host, the host can't set the uniform properly for the vertex shader. However, the kernel can modify the VBO data directly, thereby making the uniform unnecessary.

This raises another concern. In my earlier code, the vertex shader applied the physics update after performing the modelview-perspective transformation. But with this new method, the shader will receive VBOs that have already been updated based on velocity and acceleration. This out-of-order transformation may cause an error.

I'm starting to think that the OpenCL kernel should take over the vertex shader's processing. I wonder what effect this will have on performance.


A Sad Day in Mudville

>> Wednesday, October 5, 2011

Goodbye, Mr. Jobs, and thank you for your insight, leadership, vision, and high technological standards. R.I.P.


Nvidia and OpenCL 1.1

Until recently, I'd thought that Nvidia didn't support OpenCL 1.1. But it's only their development drivers that don't support 1.1. Their regular drivers do. Keep a close eye on the versions. On Windows and Linux, the development drivers have major version 270 while the regular drivers have major version 280.

I don't know what capabilities the development drivers provide that the regular drivers don't, but after installing the regular drivers, I've successfully tested all my code on an Nvidia GPU. Life is good.


Ray-Triangle Intersection with OpenCL and OpenGL

>> Saturday, October 1, 2011

I agree with Tom Olson that the killer app for OpenCL is graphics acceleration, particularly involving physics and computational geometry. When programmers see how much they can accomplish on the GPU with OpenCL, I think they'll put aside their CPU-centric code.

To demonstrate how this works, I've coded an application that uses OpenCL to compute ray-triangle intersection. More specifically, I've implemented the Möller-Trumbore method using OpenCL-OpenGL interoperability. 3-D applications need to compute ray-triangle intersection quickly because it identifies which object in the model the user clicked.

I've written a knol on the subject here, and it discusses the theory of the Möller-Trumbore method and how to implement it using both C and OpenCL. I've uploaded an example application here. The application reads mesh data from a COLLADA file, renders ten spheres with OpenGL, and then uses OpenCL to determine which sphere the user clicked on. So far, the application has passed every test I've given it.

I'm mostly happy, but there's one improvement I need to add. Currently, when the user clicks on the rendering, the application executes a kernel for every figure in the model. The kernel's work-items process the triangles of the figure in parallel, and this provides solid performance. But it would be better to have the kernel process every triangle in the figure at once. The code gets ugly because each figure's VBO needs to be made a kernel argument. But I think the performance improvement will be worth the hassle.


Work-group Sizes

>> Tuesday, September 27, 2011

I've been experimenting with OpenCL work-group sizes and I've learned something I should have known before. For a given kernel, work-groups are always the same size. To test this, I configured clEnqueueNDRangeKernel to execute a kernel with 991 work-items. I set the local_size argument to NULL so that OpenCL will determine its own work-group size, and it created 991 work-groups with 1 work-item each. I've tried this with other prime numbers and it does the same thing.

I'm surprised. How can 991 separate work-groups execute faster than, for example, four groups of 200 work-items each and one group of 191 work-items? As another example, when I specified 651 work-items, OpenCL creates three groups of 217 items each. But when I specify 653 work-items, OpenCL creates 653 groups containing one item each.

Now that I'm working with OpenCL-OpenGL interoperability, dealing with arbitrary data sizes is a frequent concern. My work-groups can't communicate with one another, so my host application receives one result from each group. But the host can't allocate memory for the result array if it doesn't know in advance how many work-groups will execute the kernel.

The only solution I can see is to pad the number of work-items until the total number is a multiple of the maximum work-group size (or a similar number obtained through CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE). This means that, if the data size requires 991 work-items, I should pad the value to 1024, which is a multiple of 256, which is the maximum number of items per group on my GPU.

But now there's another problem: the padded work-items execute the kernel just like the regular ones, which means they may cause an error in the result. To get around this, I'm using the following code in my kernel:

if(get_global_id(0) < max_items) {

   ...perform regular kernel execution...

In this code, max_items is the unpadded size of the data. That is, if there are 991 data points to process, max_items is set to 991. This is not elegant code, but I can't see an alternative. I'll keep thinking.


Allocating Local Memory for an Arbritrary Number of Work-items

>> Sunday, September 25, 2011

In my opinion, the most important OpenCL function to know is clEnqueueNDRangeKernel, which enqueues a kernel-execution command. Once you understand how its parameters affect the kernel's execution, the rest of OpenCL is fairly straightforward.

The second most important function, in my opinion, is clSetKernelArg. This configures the kernel's arguments so it has data to process while it executes. Its signature is given as follows:

cl_int clSetKernelArg(cl_kernel kernel, cl_uint arg_index,
                      size_t arg_size, const void *arg_value)

When I first encountered it, I thought this function was simple. But in addition to specifying the data for a kernel argument, this function also configures which address space (global/constant, local, or private) the data will be stored in. There are three main rules:
  1. If arg_value points to a memory object, the argument's data will be stored in global or constant memory, depending on the modifier used by the kernel function.
  2. If arg_value points to something that isn't a memory object, the argument's data will be stored in private memory.
  3. If arg_value is NULL, the argument's data won't be initialized, but arg_size bytes will be allocated from local memory to store the data.
In general, work-items can access local memory much faster than they can access global memory, so it's a good idea to have each work-item store intermediate results in the local address space. If I know there are 256 work-items in a work-group and each item needs 32 bytes to store intermediate data, I'll set arg_size to 256*32 = 8192 and arg_value to NULL.

But there's a problem. Unless the total number of work-items is a multiple of the maximum number of work-items per work-group, I have no way of knowing how many work-items will be assigned to each group. I can set the local_size parameter of clEnqueueNDRangeKernel, but sometimes I don't even know the total number of work-items I'll need. This happens frequently in OpenCL-OpenGL interoperability, in which I don't know in advance how many triangles each vertex buffer object will contain.

The solution isn't too hard: allocate all the local memory the device can provide. To find out how much local memory is available, call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE. The following code shows how this works:

cl_ulong local_mem_size;
clGetDeviceInfo(device, CL_DEVICE_LOCAL_MEM_SIZE,
                sizeof(local_mem_size), &local_mem_size, NULL);

Remember that the data returned by the function is a cl_ulong, not a size_t. I made this mistake repeatedly, and though it worked fine on my Linux system, my Windows development system didn't like it at all.


Problems with OpenGL-OpenCL Interoperability

>> Saturday, September 17, 2011

I've spent the last week coding a routine for pick selection, which determines which 3-D object in the model corresponds to the user's mouse click. Geometrically, the process involves ray-triangle intersection, and I've been focusing on the Möller-Trumbore method. 3-D applications generally don't test every triangle in the model at once, but I figured this would be a perfect application to demonstrate OpenGL-OpenCL interoperability.

It's been hard going. First of all, I made the mistake of upgrading Ubuntu 10.10 to version 11.4. When I did this, every call to clCreateFromGLBuffer returned a CL_INVALID_GL_OBJECT error. Eventually, the error went away after I reinstalled the AMD driver, updated the AMD APP SDK, and disabled the Unity interface.

But when I implement the Möller-Trumbore method in an OpenCL kernel, I get the strangest error. Work Groups 0 and 18 always produce the same selection result, even though they're analyzing different sets of triangles. The same holds true for Groups 1 and 19 and Groups 2 and 20. The only reason I can think of is that my GPU has 18 compute units. But this shouldn't matter; work-groups are supposed to operate independently of one another, no matter which compute unit they occupy. Weird, weird, weird...


COLLADA, TinyXML, and OpenGL

>> Saturday, September 10, 2011

When I first started working with COLLADA, the Khronos Group provided example code that demonstrated how COLLADA meshes could be rendered with OpenGL. But the source code for COLLADA RT seems to have disappeared, so I've coded my own. I've uploaded the C++ source files to a Google-hosted project here, and I've included an example application that renders a sphere. I've also written a Google knol that explains how the code works.

Simply put, the ColladaInterface class reads a COLLADA file and returns a data structure for each mesh in the scene. The COLLADA format is based on XML, so the class calls functions from the TinyXML toolset. ColladaInterface doesn't access all the information in the COLLADA file, so there's plenty of room for improvement.

At some point, I'll add a function that writes data to COLLADA files. But for the moment, I hope this will help developers who are interested in using COLLADA for digital asset storage.


Devices, Platforms, and Contexts

>> Tuesday, August 30, 2011

I've been discussing the possibility of writing an article for Intel's Visual Computing site. The proposed article involves accelerating ray-triangle intersection on GPUs with OpenCL. This computation is necessary to convert 2-D mouse clicks into 3-D object selection.

The Intel reviewer seems happy with my outline, but wants me to emphasize balancing the computational load between CPUs and GPUs. This makes sense for Intel, which doesn't make OpenCL-compliant GPUs. But there's a problem. OpenCL devices can only share buffer data, such as vertex colors and coordinates, if they belong to the same context. For an overview of devices and contexts, I recommend this article.

The OpenCL spec doesn't mention this explicitly, but OpenCL devices can only be placed in the same context if they belong to the same platform. Platforms correspond to device vendors, so if you have an Intel CPU and an AMD GPU, the devices belong to different platforms. Therefore, they can't be placed in the same context, which means they can't share buffer data. Intel doesn't make GPUs and Nvidia doesn't make CPUs, so the only way I know to place a CPU and GPU in a context is if both devices were made by AMD. When I demonstrate DynLab at SC11, that's the setup I'm going to use.

For more information, I recommend this thread on the Khronos forum. If I was on the OpenCL Working Group, my first priority for OpenCL 2.0 would involve making sure developers can create contexts with devices in different platforms.


Online OpenGL Book

>> Thursday, August 18, 2011

While searching for information on OpenGL's uniform buffer objects, I stumbled across an online book called Learning Modern 3D Graphics Programming. This is the best book on OpenGL 3.x/4.x I've encountered. It's well-written and the material proceeds gracefully from the simple to the complex. There are plenty of code samples, and unlike the examples in the OpenGL SuperBible, the code relies on the OpenGL API instead of deriving its own set of helper functions.

The book is so polished that I wish there was a PayPal link or other way to reimburse the author. I'd also like to know if he intends to finish the final chapters. But there's no e-mail link, and except for this brief biography, I can't find any information on Jason McKesson.


Processing OpenGL Vertices with OpenCL

>> Tuesday, August 16, 2011

OpenGL provides two main methods of combining vertices into primitives (lines, triangles, etc.):

  • glDrawArrays/glMultiDrawArrays - For each primitive, every vertex is sent to the GPU. If Vertex A is connected to four primitives, Vertex A will be sent to the GPU four times.
  • glDrawElements/glMultiDrawElements - Vertices are sent in one large block and an index list is used to determine which vertices belong to which primitive. If Vertex A is connected to four primitives, Vertex A will only be sent to the GPU once.
The first method is simpler, but in general, the second method provides better performance. This is because CPU-GPU data transfer is a significant time sink and the second method usually results in less data to transfer. This is particularly true when primitives are combined to form complex 3-D shapes like spheres and tetrahedra.

But I want to process vertices on the GPU using OpenCL work-items. In this case, it may be easier to have each work-item access a separate group of vertices than to have multiple work-items access the same input data through an index list. As long as each work-item processes different vertices, I don't need to synchronize them with barriers. So I'm currently using glMultiDrawArrays instead of glMultiDrawElements.

However, some operations are beyond OpenCL, and dynamic memory allocation is an important one. For example, if a model contains one thousand 3-D objects that need be processed mathematically, OpenCL is great. But if the user deletes Objects 5 and 612, OpenCL can't deallocate the memory. Instead, the CPU needs to free the memory and re-transfer the remaining vertices to the GPU.

So the question is whether the speed-up provided by OpenCL kernels is sufficient to offset the delay imposed by using glMultiDrawArrays instead of glMultiDrawElements. In the end, the only way to know is to profile both methods and go with the one that provides better performance.


News from the Khronos Group

>> Tuesday, August 9, 2011

SIGGRAPH 2011 is taking place in Vancouver, and the Khronos Group has made two interesting announcements:

  • OpenGL 4.2 has been released along with GLSL 4.20.6. The new standard supports atomic operations and read/write operations on textures. It looks like the difference in capability between OpenCL kernels and OpenGL shaders is narrowing.
  • The Khronos Group is seeking corporate participation with regard to the WebCL and StreamInput standards. The more I read about StreamInput, the more interesting it seems. I'd love to sit in on the standard deliberation process, but I don't think I'd have anything productive to contribute.


Apple and OpenCL

>> Wednesday, August 3, 2011

UPDATE: Certain versions of Mac OS do support OpenCL 1.1, but there's no official announcement. To determine if your system supports 1.1, I recommend that you compile and run the code listed here.

The book is nearly finished, but one reviewer had a significant concern: a handful of my code examples didn't compile on his Mac OS system. But my code wasn't the problem; the problem is that Mac OS doesn't support OpenCL 1.1.

What's odd about this is that Apple was the driving force behind OpenCL 1.0. As I understand it, they told their device vendors to come up with a non-proprietary toolset for accessing high-performance devices like CPUs and GPUs. In response, Nvidia, AMD, and IBM put their heads together and wrote the first draft of the OpenCL spec. But now, over a year after the release of OpenCL 1.1, Mac OS doesn't support the new standard. And I haven't found any plan to support OpenCL 1.1 in future releases.

The same goes for OpenGL, which is a far greater concern than OpenCL. OpenGL 4.1 was released in July 2010, but according to this table, Apple's latest operating system supports nothing higher than version 3.2. In other words, Apple's hardware is capable of high-performance rendering but the OS won't allow modern rendering applications to execute.

Very frustrating. One of the selling points of OpenGL/OpenCL over Microsoft technologies is that they're cross-platform. But if Apple isn't willing to support modern versions of these tools, cross-platform means Windows and Linux only.

So here's my plan regarding the book's example code. I'll release a separate archive for Mac OS users, and the code won't be any different than it is for GNU users. But I'll remove every project that requires OpenCL 1.1 and OpenGL. This may upset Mac users, but there's nothing I can do.


Article on Dr. Dobb's Journal

>> Monday, August 1, 2011

I excerpted two sections of the book and combined them into an article called A Gentle Introduction to OpenCL. OpenCL is a tough subject, so I've relied on analogies to explain how host applications and kernels work. Neither analogy is perfect, but I hope they'll help newcomers to the topic.

Regarding kernel execution, there's one point I wanted to mention that didn't get into the article. As OpenCL developers, we can control the total number of work-items generated for a kernel. We can also control the total number of work-groups. But we can't control the maximum number of work-items in a work-group. This depends on the device's resources and the resources required by the kernel. The clGetKernelWorkGroupInfo function makes it possible to determine this in code.



>> Sunday, July 31, 2011

I'm currently coding DynLab, a scientific visualization tool that uses OpenCL for geometry and physics, OpenGL for rendering, and Qt as the overall framework. Some of the theory is difficult, particularly that involving boundary representation and object collision, but I'm enjoying the work. Thankfully, there are two technologies that make life easier:

  • ODE (Open Dynamics Engine) - An open-source toolset for computing rigid body dynamics
  • COLLADA (Collaborative Design Activity) - An open-source XML format for representing 3-D objects
I'd worked with COLLADA when I wrote the Cell processor book, but that was version 1.4. The latest version of COLLADA, 1.5, supports physics and boundary representation, so that's wonderful.

And so is ODE. Not only does it provide routines related to rigid-body dynamics, it also provides a test application that demonstrates how ODE and OpenGL work together. As I port aspects of ODE to OpenCL, I'm genuinely impressed with the author's code and documentation. I'm surprised I'd never heard of this before.

One thing bothers me, though. ODE hasn't had an official update since 2009. COLLADA hasn't had a new version since 2008. Has the Khronos Group decided that COLLADA 1.5 is perfect, or have they realized that commercial users are relying on Microsoft's technology (stable and integrated) instead of their own (high-performance but decentralized).


Work-items and Local Memory

>> Sunday, July 17, 2011

I've been experimenting with my FFT code, changing the size of the work-groups and the amount of local memory each group has to work with. Here are my observations:

  • Increasing the work-group size always improves performance.
  • Decreasing the amount of local memory available to each work-group usually improves performance.
The first point didn't surprise me but the second did. My initial kernel computes one FFT for each work-group and the FFT's size fully occupies local memory. That is, if a work-group has 32kB local memory and each complex point occupies 2*sizeof(float) = 8 bytes, then each work-group can perform a 4k-point FFT. Successive kernels merge the work-groups' results until the final FFT is computed.

I'd assumed that each work-group should perform as large an FFT as possible. This means less synchronization and fewer successive stages. But when I experimented, the opposite held true. As I reduced the amount of local memory allocated for each group, the FFT performance improved.

I have a theory. The more local memory each work-group has, the more each work-item needs to read from global memory. Ideally, work-items in a work-group will combine their read requests so that the group's memory operations are performed at once. But in my FFT, the repeated iterations may end up producing staggered global memory operations, which are very time consuming. Further experiments are needed.


A Recommendation for the Khronos Group

>> Wednesday, July 13, 2011

Dear Khronos Group,

I'm a devoted fan of your technologies, from COLLADA to OpenGL to OpenCL. I applaud your commitment to open-source software and the free support you provide through your forums. Like academics and enthusiasts throughout the world, I admire all you've accomplished.

But professionals (excluding micro-entrepreneurs like myself) don't admire you. They appreciate Microsoft and the Visual Studio framework for software development. With Visual Studio, developers can not only access all of Microsoft's technologies but also incorporate them into professional applications. Microsoft's range of technologies can't compete with yours, but they always win in the end -- not because of their technology focus, but because of their developer focus.

Here's a case in point. I downloaded a set of example OpenGL applications from your site. I'm impressed with how far OpenGL has come since the disastrous 3.0 release, but there's a problem: every example requires the OpenGL Framework (GLF), which requires GLUT. This is a disgrace.

GLUT was created as a teaching tool for OpenGL, and it serves this purpose well. But its features haven't progressed to a level anyone would consider professional. I've spent a lot of time evaluating different frameworks that support OpenGL rendering, but I'm not 100% satisfied with any of them. To access your technology, I need to make trade-offs in performance and capability that no Windows developer would ever worry about.

So here's my recommendation: fork Qt. Qt is a full-featured cross-platform software framework whose developer base stretches across the world. Enthusiasts appreciate its open-source licensing and extensibility; professionals appreciate its stability and support.

And you're in luck. Nokia, Qt's primary guardian, has joined forces with Microsoft in developing their mobile platforms. This means that from now on, Nokia's smartphones will be based on Windows, not Qt. Nokia's leadership has stated that Qt support is still a priority, but I'll bet Qt's lead developers would rather work with you than with their former sponsor.

Qt has been around for decades and it has an established developer base, so you wouldn't have to put any effort into marketing or bug fixing. All you'd have to do is integrate your technologies so developers can easily code full-featured applications with them. This wouldn't be hard. Qt already provides access to OpenGL rendering and there's even a preliminary Qt library that calls OpenCL functions. But neither of these features are perfectly accessible because no one is making integration a serious priority. If you took the reins, however, that would change.

You may think these concerns are beneath your notice, Khronos Group, but if you don't pay attention to your developers' needs, developers will stop paying attention to you.



>> Tuesday, July 12, 2011

I'd heard whispers about OpenCL running in a browser, but I figured it would take months if not years to see any real code. So I was pleasantly shocked when Nokia released a WebCL implementation that runs in Firefox. I haven't figured out how this will set the world on fire, but even if no one takes advantage of it, the technology is astounding.

I found a CNET article that discusses WebCL. It doesn't say anything particularly profound, but one conclusion is clear: OpenCL is gaining momentum.


AMD Fusion Reviews

Reviews of the desktop AMD Fusion A8-3850 have been trickling in, and they're positive for the most part. Tom's Hardware and AnandTech agree that the device is great for entry-level systems but that it doesn't compare to a full CPU/GPU combo.

I'm impressed with the technical discussion of the chip at Real World Technologies. My sole interest in the Fusion is its ability to process OpenCL kernels, so the CPU-GPU integration is a major concern for me. I'm glad that the Fusion's memory bandwidth is greater than that for a discrete GPU (8 GB/s instead of 6 GB/s), but I'd hoped for better. Intel's Sandy Bridge chips provide better integration, but from what I've seen, they don't support OpenCL yet.


Exit Through the Gift Shop

>> Monday, June 27, 2011

Over the weekend, I watched the documentary Exit Through the Gift Shop and I really enjoyed it. I knew nothing about street art, but it looks like a lot of fun, and the artists (Banksy, Shepard Fairey, and Space Invader) seem like genuine, passionate, fascinating people.

This isn't a normal documentary. The film centers on a protagonist named Thierry Guetta, who transforms from a slavish camera operator into a famous, arrogant artist. Some reviewers claim the film is true to life, but others call the movie a complete hoax, saying that Banksy is thumbing his nose at the world of art.

I agree that parts of the documentary are fiction, but I disagree regarding Banksy's motives. Before I can present my interpretation, I need to explain what happens in the film:

  1. Thierry Guetta, a childlike man obsessed with his videocamera, becomes fascinated with street art. He stalks Space Invader, Shepard Fairey, Banksy, and others, following them as they work and recording their thoughts.
  2. Banksy, seeing Thierry's footage, asks him to create a documentary. The result, called Life Remote Control, is atrocious, and Banksy decides to make the documentary on his own.
  3. While Banksy becomes a filmmaker, Thierry becomes a street artist. He adopts the name Mr. Brainwash and acquires enough standing to hold his own art auction in Los Angeles.
  4. In preparation for his show, Thierry becomes a tyrant and a copycat. He does nothing on his own, and orders his subordinates to produce obviously-derivative works based on existing pop art (Marilyn Monroe with blue hair, Elvis with a machine gun, and so on).
  5. Despite mocking reviews from Banksy and others, Thierry's show is a success.

Since watching the movie, I've done some research on my own (mainly Wikipedia). I'm not certain about anything, but here's what I've gathered:
  1. Life Remote Control was a real documentary, but the director was a Swiss filmmaker named Joachim Levy, not Thierry.
  2. Banksy used the name "Mister Brainwash" in his street art long before Thierry took the name for himself.
  3. Thierry's artshow really took place, and he really succeeded.

Here's my interpretation. Banksy made this documentary on his own because he hated Life Remote Control, but rather than insult Mr. Levy, he shifted the blame to the cameraman, Thierry. He also gave Thierry a persona: arrogant, obsessive, and phony. In doing this, Banksy's goal is to contrast Thierry with the street art movement. This, Banksy is saying, is what we are not.

The reception of Thierry's art is the only aspect of the documentary that Banksy couldn't control, and I think Banksy wanted the Mr. Brainwash show to flop. I think he wanted to contrast Thierry's failure with his own success, and show people that hype alone can't sell art. Thierry's failure would lend validation to his fame and that of street art in general. But that's not what happened. As I see it, the despondency shown by Banksy and Shepard Fairey toward the end of the documentary is the most sincere part of the film.

I admire Banksy for including his failed experiment in the film, and I'm glad he didn't focus on any particular buyer of Thierry's art (except Madonna). At the same time, the documentary made me thankful that I work in technology. My line of work is 100% hype-free and that's the way I like it.


OpenCL Image Support

>> Monday, June 20, 2011

The GPUs that I use for testing all support OpenCL image objects, but not to the fullest extent of the specification. For example, I can't get linear interpolation to work and many of the image format types don't seem to be readable by the kernel.

My latest concern has been with OpenGL-OpenCL interoperability. I've had no trouble creating buffer objects with clCreateFromGLBuffer, but when I try to create image objects with clCreateFromGLTexture2D, I always get the CL_INVALID_IMAGE_FORMAT_DESCRIPTOR error. This happens no matter what image format I choose.

I spent a lot of time on this, but then I looked through the sample code in the AMD SDK. Instead of calling clCreateFromGLTexture2D, their application stores texture data using pixel buffer objects (PBOs). Very clever. PBO data can be shared with OpenCL buffer objects in the same way that VBO data can, and once the PBO is bound to the GL_PIXEL_UNPACK_BUFFER target, texture objects can read their image data from the PBO.

I'm glad there's a workaround for the clCreateFromGLTexture2D issue, but I'd be even happier if the function worked properly.


Alea Iacta Est

The final arrangements have been made and money has changed hands. I will be at the Supercomputing Conference 2011 (SC11) in November, and I'll bring my DynLab application for all to see. DynLab is a physics simulation tool that uses OpenCL for computing, OpenGL for rendering, and Qt to provide the overall application structure.

With luck, I'll also be able to distribute the first OpenCL in Action books. That will be very exciting, and I'll do everything I can to help Manning get everything ready by the deadline.



>> Friday, June 17, 2011

I've been following the emergence of WebGL with avid interest. Putting 3-D rendering in a browser sounds like a great idea and I hope this will draw more programmers over to OpenGL. But while I admire the technology, one question gnaws at me: where's the killer app? What can WebGL do for society that nothing else can?

OpenGL's primary uses are video games and CAD, but I can't picture either of those being successfully transitioned to the browser. And despite Google's enthusiastic support, their O3D project hasn't gone anywhere since I watched it demonstrated at Google IO 2009.

It doesn't help that Microsoft is dead set against WebGL. They have a good point - a WebGL shader can access the user's GPU directly, which means it can potentially lock the user's system. Someone could code a shader parser in JavaScript capable of validating a shader's safety. Hmm...

I've read comments saying that Microsoft's disinterest will drive users to browsers other than IE, but this assumes the existence of a WebGL application that will set the world on fire. What is that application?


The Bug from Hell

>> Wednesday, June 1, 2011

I just spent three hours riddling out why I was getting a segmentation fault every time I enqueued a kernel. The application uses OpenGL-OpenCL interoperability, so I assumed the problem involved the shared resources. But when I tested clEnqueueAcquireGLObjects and clEnqueueReleaseGLObjects, everything worked fine.

As it turned out, the problem had nothing to do with interoperability. I was setting the kernel argument incorrectly. Here's my original code:

clSetKernelArg(kernel, 0, sizeof(cl_mem), buffer);

Anyone familiar with OpenCL will know that I'm trying to make a buffer object into a kernel argument. But can you spot my terrible error? If not, here it is: the last argument should be a reference to the memory object, not the memory object itself. The function should be called as follows:
clSetKernelArg(kernel, 0, sizeof(cl_mem), &buffer);

That missing ampersand cost me three hours of coding. If the compiler had caught this, I could have fixed the problem in three seconds. But no, the last argument of clSetKernelArg can be set to anything, and the only way you'll know something's wrong is when the kernel execution raises a segmentation fault.


GPGPU Programming and Marketing

>> Wednesday, May 25, 2011

It's been a big week for GPU computing. Cray recently announced their first supercomputer with GPUs: the XK6. Mathworks announced that their new Parallel Computing Toolbox will support GPU-based applications. And the University of Michigan has developed a GPU-based version of R, the popular language for statistics.

This is major progress for GPGPU programming, but all three of these projects are exclusively based on CUDA, which only runs on Nvidia hardware. OpenCL runs on many more types of devices, but if anyone's doing anything exciting with it, they're doing a great job keeping it under their hat.

My hopes for OpenCL rest on AMD's Fusion processor, which should be released in the next month or so. I think this will spur interest in OpenCL, but Nvidia has already worked hard to position itself as the GPU vendor for high-performance computing. It will be hard for the Khronos Group to convince people otherwise.

The Fusion embodies a revolution in computing, but AMD isn't getting the word out. I always wonder what might have happened if Intel had developed the Fusion instead. We'd be bombarded with commercials and advertisements and sales pitches. We'd see engineers dancing in their pastel bunny suits as they made chips, and what would they be dancing to? Fusion, of course.

My favorite piece of Intel marketing came around in 1996. After making minor improvements to the Pentium P5, they released an updated device that supported additional instructions for vector computing. No existing software could execute these instructions, but that didn't bother Intel. They called the new device Pentium with MMX, and it was a huge success. I mean, who wants a vanilla Pentium when you can have a Pentium with MMX? The acronym (as I understood it) stood for MultiMedia eXtension, but it didn't matter. It was so catchy that people upgraded from Pentiums to Pentiums/MMX just because of the name.


Right Here in River City

>> Sunday, May 15, 2011

I live in California at the border of Walnut Creek and Pleasant Hill. I love this area. The weather is gorgeous, the running paths are gorgeous, and the people are as pleasant and as laid-back as could be wished.

In Pleasant Hill, there's a small shop called My Divine Skin. I always assumed it was a beautician's office, but it's actually a massage parlor. And by night, it's a police-run brothel.

And that's just the tip of the iceberg. According to the article, the local police have been involved in bribery, distribution of stolen drugs, and setting up phony DUI arrests.

My beloved city is a hotbed of vice and corruption! I'm shocked beyond my ability to express myself.

What fascinates me is that I was at the Contra Costa courthouse recently as a potential juror. The judge turned us all away because his cases had settled, but this would have been an incredible case to sit in on.

The more I read about this, the more certain I am that there's a novel in here somewhere. There has to be...



>> Sunday, May 1, 2011

I submitted my mutex-based barrier code to the OpenCL forum, and they weren't impressed. It wasn't that my code didn't work, they said, but that it didn't scale. And this is important - if your OpenCL code doesn't scale, it may as well not work.

They were absolutely right. My barrier works only as long as the number of work-groups doesn't exceed the number of compute units on the device. But, as I learned the hard way, if the number of work-groups is greater than the number of compute units, the kernel hangs. More precisely, the GPU hangs, which means I have to restart the computer.

I don't know exactly how compute units execute work-groups, but not all work-groups execute at once. Once the first set of work-groups finish their execution, then the next set can start. So here's the problem--if the first set of work-groups is waiting for all the work-groups to synchronize, they'll never stop executing and the next set of work-groups will never start. So that's why my barrier never completes.




>> Wednesday, April 27, 2011

NOTE: Though theoretically sound, this mutex code fails to execute on some systems. In particular, it fails to execute on my Linux/AMD test platform. Be warned.

I've spent some time coding a global memory barrier using OpenCL's atomic functions, and I think I've succeeded. This is different than barrier(CLK_GLOBAL_MEM_FENCE), which only synchronizes work-items within a work-group. In my tests, the code synchronizes work-items across different work-groups. Here's the code:

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable

#define LOCK(a) atom_cmpxchg(a, 0, 1)
#define UNLOCK(a) atom_xchg(a, 0)

__kernel void mutex_test(__global int *count, __global int *mutex) {

   if(get_local_id(0) == 0) {

      /* Increment the count */
      *count += 1;

      /* Wait for everyone else to increment the count */
      int waiting = 1;
      while(waiting) {
         if(*count == get_num_groups(0)) {
            waiting = 0;

In this code, LOCK atomically reads the value at mutex and checks to see if it's unlocked (0). If it's unlocked, LOCK sets the mutex value to its locked value (1), and then accesses the critical section. Then, UNLOCK sets the mutex value to its unlocked value (0).

The barrier works in two stages. First, Work-Item 0 in each work-group atomically increments the count variable. Then it waits to see if every other Work-Item 0 has done the same. Once count equals the result of get_num_groups(), each work-item has passed the barrier.

It's late, so I'll post this on the Khronos OpenCL forum tomorrow. I'll be interested to see if this works on other systems. If so, I think this will be a big deal. It means that it's no longer necessary to restart new kernels in order to synchronize global memory across work-groups.


Thoughts in Brief

>> Sunday, April 3, 2011

  1. I'm reading Freedom from Fear by David M. Kennedy. It's a history book that focuses on the presidency of Franklin Roosevelt, but it reads like a suspense novel. Someone could make a good television series on this--like The West Wing, but 75-odd years in the past. Not only were the crises larger than life, but so were the people: Huey Long, Father Coughlin, Eleanor Roosevelt, Lorena Hickok, and John L. Lewis.
  2. It breaks my heart to see the Borders store in the center of Pleasant Hill closing. I always liked the fact that the most impressive building in the city center was a bookstore, but soon the doors will be closing. The business model just can't be sustained. I'll miss you, Borders.


The Fast and the Fourier

>> Monday, March 28, 2011

Five days. It's taken five long, painful days, but it's done. I finished coding the Fast Fourier Transform (FFT) for Chapter 12 this evening, and it works. Not just the forward FFT either, but also the inverse. And if there's anything more agonizing than debugging an inverse FFT running on a GPU, I haven't experienced it.

This isn't the first time I've coded an FFT, but it's the first time I've coded an FFT with OpenCL's work-items, work-groups, vector operations, and memory synchronization. Unbelievable. I look like something that staggered out of a zombie movie, but as I write this, I feel a tranquility and lightness of spirit that I can't describe.

I took one break on Saturday afternoon, and rented 13 Assassins on iTunes. Fine movie. A little more characterization would have made the lengthy battle scenes more meaningful, but other than that, it's a fine addition to the men-on-a-mission genre.


Sparse Matrices and OpenCL

>> Saturday, March 12, 2011

I finally finished implementing the Conjugate Gradient (CG) algorithm in OpenCL, and judging from a casual web search, I think this is the first time it's been done. The theory isn't simple by any means, and despite Jonathan Shewchuk's excellent paper on the subject (available here), there are still a few places where I'm not satisfied.

Thankfully, Matlab provides an m-file that shows how the CG algorithm is implemented, so I was able to check my work. Last night, I tested my code with the BCSSTK05 sparse matrix from NIST's Harwell-Boeing collection, and it converged with a final residual of 0.067. Happy day.

I was also planning to implement the biconjugate gradient stabilized algorithm, also known as BiCGSTAB, in OpenCL, but when I tried the Matlab bicgstab routine with NIST's LNS_131 matrix, the algorithm didn't converge. Even after 20,000 iterations, the residual stayed above ten thousand.

This amazes me. Finite element analysis (FEA) has been around since the 1960s, so I figured all the mathematical theory had been riddled out years ago. But from what I've seen, coding sparse matrix solvers is still more of an art than a science.

GPUs excel at highly-parallel algorithms, so it might be better to have them solve sparse matrix systems using direct methods instead of iterative methods. Need to give this more thought.


Is Your Local Memory Really Local?

>> Friday, March 4, 2011

CUDA makes a clear distinction between shared memory, which is SRAM located on the GPU, and local memory, which is located in DRAM, off the GPU. Both memory types are specific to a given processing unit, but shared memory has much less latency than local memory or global memory. For this reason, CUDA routines generally copy input data from global memory to shared memory, process the data in shared memory, and write the output to global memory.

OpenCL, on the other hand, doesn't make any distinction between shared memory and local memory. Both types are referred to as local. So here's the question: how do you know if the local memory you're working with is high-speed memory on the GPU or low-speed memory off the GPU?

It turns out that the clGetDeviceInfo function has a field called CL_DEVICE_LOCAL_MEM_TYPE, which can be either CL_LOCAL or CL_GLOBAL. If the type is CL_GLOBAL, then there's no point copying data from global memory to local memory because both memory types are essentially global. But if the type is CL_LOCAL, then the memory is close to the processing unit and it's a good idea to store frequently-accessed data there.

Kind of a nuisance, isn't it? It seems like the only way to ensure high-performance is to check the local memory type of a device and send it a different kernel depending on whether it's CL_LOCAL or CL_GLOBAL.


The 83rd Annual Academy Awards

>> Wednesday, March 2, 2011

The Oscars were given out this past weekend, and I'm glad that my only serious prediction turned out to be correct: Melissa Leo won for Best Supporting Actress in The Fighter. She was the best thing about that movie, and she made everyone else look like amateurs. I'd hoped that Geoffrey Rush would win for Best Supporting Actor, but that's more because he's my favorite actor than because of his performance.

I enjoyed The King's Speech, partly for the story and partly for the performances. Guy Pearce and Helena Bonham Carter were wonderful. It took me a while to recognize Jennifer Ehle, but I hadn't seen her since the last time she'd starred with Colin Firth. They even cast Anthony Andrews, who had played King Edward VIII, to play Stanley Baldwin. Ha.

I had two problems with the movie, though:

  • It should have ended with the speech's conclusion. The ten minutes of smug back-clapping weren't necessary, and given the gravity of the subject, it seemed odd that everyone was so happy. And Winston Churchill saying "I couldn't have said it better myself" was just silly.
  • If you're going to cast Claire Bloom as Queen Mary, you should give her some lines. I saw her in Limelight some time ago, and I've been a fan ever since.
I knew that one of the two little girls was Queen Elizabeth II, but it wasn't until I walked out of the theater that I realized Helena Bonham Carter's character was the Queen Mum. I just read her Wikipedia entry, and she was a fascinating woman.

I should see Limelight again. I wonder if I'll enjoy it as much as I once did.


Atomic Functions and Dot Products

>> Monday, February 28, 2011

I love OpenCL dearly, but it has two shortcomings that frustrate me to no end:

  • No atomic floating-point operations
  • No mutexes or locks
An example will show why this state of affairs is so painful. Suppose you have 1000 work-items and you want to compute the dot product of two vectors, x and y, both containing 1000 floats. At first glance, you might try something like the following:
int id = get_global_id(0);
dot_prod += x[id] * y[id];

This will compile, but you won't get the right answer because every work-item will attempt to access dot_prod at the same time. The result is a race condition. To get the right answer, each work-item must have exclusive access to dot_prod. But OpenCL doesn't support atomic floating-point operations or mutexes, so there's no way for one work-item to tell the others to back off.

You could process the dot product in stages. In the first stage, you could have 100 work-items perform 10 multiply-and-add operations. In the second stage, you could have 10 work-items sum the 100 results from the first stage. In the third stage, a single work-item could sum the 10 results from the second stage.

But there's a problem. You need to set up a barrier after each stage to ensure that succeeding stages won't start prematurely. In my experience, it takes more time for a barrier command to execute than it takes to perform the algorithm using a single work-item.

So that's what I'm doing. To my great shame, the QR decomposition code computes dot products with one work-item:
if(get_global_id(0) == 0) {
   ...compute dot product...
This code doesn't use any parallelism, and would get me fired from any professional job in high-performance computing. But it's faster than every alternative I can think of.


The AMD Fusion

>> Wednesday, February 23, 2011

One of the reasons I first became interested in OpenCL was AMD's Fusion APU (Accelerated Processor Unit). Since Intel's Larrabee project collapsed, this seems to be the super-processor of the future.

Originally, I heard that the Fusion would make discrete graphics cards obsolete because it contains a CPU and a GPU on a single chip. But AMD, realizing that this would cut into their lucrative graphics card business, has forsworn this. In the past, they used statements like "a supercomputer in every home," but now it looks like they don't have a coherent selling strategy at all.

So I'm nervous. The marketing for the Fusion has been meager, and although AMD has provided a whitepaper, they still haven't come up with a compelling reason for people to buy the Fusion. Instead, it looks like AMD is waiting for enthusiasts/entrepreneurs to tell them why people should buy the Fusion.

AMD hasn't provided any simulators or training classes, as IBM did for the Cell processor. They haven't announced any app contests with prizes, as Google did with their Android. But they have set up the AMD Fusion Fund, calling for companies to submit their ideas. I submitted my idea for Fusion development this afternoon, so we'll see what happens.


Nvidia, CUDA, and OpenCL

>> Sunday, February 20, 2011

I started learning OpenCL because I wanted my code to run on as many vendors' devices as possible. But whenever I find GPU code on the web, it's always in CUDA, which only targets Nvidia devices.

CUDA is older than OpenCL, so it's understandable that it has a larger developer base. But still, a search on tells me there are 217 job openings for "cuda -opencl" but only 79 job openings for "opencl -cuda". So even though the OpenCL Working Group contains AMD, Intel, IBM, Apple, and Nvidia, companies are still only interested in targeting Nvidia hardware.

Not really that surprising, I suppose. Many corporations support OpenCL, but none of them are as passionate about the language as Nvidia is about CUDA. GPGPU development is still a novelty to most, but Nvidia keeps making dedicated GPGPU hardware like their Tesla servers. Nvidia has released cuBlas, a library of BLAS-related matrix routines. OpenCL doesn't have anything like it. CUDA has three books and a certification exam. OpenCL has no books at all. Not yet, anyway.



>> Thursday, February 17, 2011

I just spent an extraordinary amount of money to have my second novel reviewed. The reviewer isn't a professional editor or agent, but a suspense novelist with quite a few published novels. I'm sure his feedback will be helpful, but I don't hold out much hope that my novel will be published. It's kind of odd.

I had higher hopes for the short story that I submitted to the 2010 Short Story Contest by Suspense Magazine. But alas, it's not on their list of the Top Ten Submissions...



>> Wednesday, February 16, 2011

I've started using PyOpenCL, which lets you code OpenCL host applications in Python. It's unbelievable how simple it is to work with. For example, let's say you want to create a kernel from a function called foo, set three arguments (a, b, and c), and execute the kernel with 20x16 work-items divided into work-groups of 5x4 work-items each. In C, the code would look like this:

foo_kernel = clCreateKernel(program, "foo", NULL);
clSetKernelArg(foo_kernel, 0, sizeof(a), &a);
clSetKernelArg(foo_kernel, 1, sizeof(b), &b);
clSetKernelArg(foo_kernel, 2, sizeof(c), &c);
size_t global_size[2] = {20, 16};
size_t local_size[2] = {5, 4};
clEnqueueNDRangeKernel(queue, foo_kernel, 2, NULL, global_size, local_size, 0, NULL, NULL);
Here's how it looks with PyOpenCL:, (20, 16), (5, 4), a, b, c)
One line of Python code accomplishes the same result as seven lines of C code, and it's so much more intuitive. It almost feels like cheating. If my intended application didn't require C++, I would definitely code with PyOpenCL.


Shuffling and Sorting

>> Sunday, February 13, 2011

Note: I've improved upon this code and you can read the discussion here.

Of all the builtin functions available for OpenCL kernels, my favorites are shuffle and shuffle2. These rearrange the elements of vectors, which is something you can't do elegantly in regular C. When I needed to code the bitonic sort using OpenCL, I thought it would be clever to come up with a compare-and-swap operation using vector operations. Here's what I arrived at:

inline void compare_and_swap(__local int2 *d1, __local int2 *d2, uint dir) {
   int2 input1 = *d1; int2 input2 = *d2;
   uint2 cmp = (uint2)(input1 > input2) ^ dir;
   uint2 mask = 2*cmp;
   mask.s1 += 1;
   *d1 = shuffle2(input1, input2, mask);
   *d2 = shuffle2(input2, input1, mask);
The goal is to create a mask vector that can rearrange the elements of vectors d1 and d2 in local memory. In contrast, the bitonic sorts provided by the Nvidia SDK and the AMD SDK use compare-and-swap routines that rely on scalar operations:
if((local_data[i] > local_data[i+2]) == dir ) {
   t = local_data[i];
   local_data[i] = local_data[i+2];
   local_data[i+2] = t;
Sorting data is a crucial topic, and I think database acceleration will turn out to be one of the most important uses of OpenCL. To determine whether sorting is better accomplished with vectors or scalars, I coded three test kernels:
I've profiled these three kernels extensively, and the full_vector kernel swaps data faster than the full_scalar kernel. This makes sense to me, but oddly, the full_scalar kernel runs faster than the part_vector kernel. Still trying to figure this out...


  © Blogger template Werd by 2009

Back to TOP