Work-items and Local Memory

>> Sunday, July 17, 2011

I've been experimenting with my FFT code, changing the size of the work-groups and the amount of local memory each group has to work with. Here are my observations:

  • Increasing the work-group size always improves performance.
  • Decreasing the amount of local memory available to each work-group usually improves performance.
The first point didn't surprise me but the second did. My initial kernel computes one FFT for each work-group and the FFT's size fully occupies local memory. That is, if a work-group has 32kB local memory and each complex point occupies 2*sizeof(float) = 8 bytes, then each work-group can perform a 4k-point FFT. Successive kernels merge the work-groups' results until the final FFT is computed.

I'd assumed that each work-group should perform as large an FFT as possible. This means less synchronization and fewer successive stages. But when I experimented, the opposite held true. As I reduced the amount of local memory allocated for each group, the FFT performance improved.

I have a theory. The more local memory each work-group has, the more each work-item needs to read from global memory. Ideally, work-items in a work-group will combine their read requests so that the group's memory operations are performed at once. But in my FFT, the repeated iterations may end up producing staggered global memory operations, which are very time consuming. Further experiments are needed.


Post a Comment

  © Blogger template Werd by 2009

Back to TOP