>> Friday, March 4, 2011
CUDA makes a clear distinction between shared memory, which is SRAM located on the GPU, and local memory, which is located in DRAM, off the GPU. Both memory types are specific to a given processing unit, but shared memory has much less latency than local memory or global memory. For this reason, CUDA routines generally copy input data from global memory to shared memory, process the data in shared memory, and write the output to global memory.
OpenCL, on the other hand, doesn't make any distinction between shared memory and local memory. Both types are referred to as local. So here's the question: how do you know if the local memory you're working with is high-speed memory on the GPU or low-speed memory off the GPU?
It turns out that the clGetDeviceInfo function has a field called CL_DEVICE_LOCAL_MEM_TYPE, which can be either CL_LOCAL or CL_GLOBAL. If the type is CL_GLOBAL, then there's no point copying data from global memory to local memory because both memory types are essentially global. But if the type is CL_LOCAL, then the memory is close to the processing unit and it's a good idea to store frequently-accessed data there.
Kind of a nuisance, isn't it? It seems like the only way to ensure high-performance is to check the local memory type of a device and send it a different kernel depending on whether it's CL_LOCAL or CL_GLOBAL.