>> Sunday, July 17, 2011
I've been experimenting with my FFT code, changing the size of the work-groups and the amount of local memory each group has to work with. Here are my observations:
- Increasing the work-group size always improves performance.
- Decreasing the amount of local memory available to each work-group usually improves performance.
I'd assumed that each work-group should perform as large an FFT as possible. This means less synchronization and fewer successive stages. But when I experimented, the opposite held true. As I reduced the amount of local memory allocated for each group, the FFT performance improved.
I have a theory. The more local memory each work-group has, the more each work-item needs to read from global memory. Ideally, work-items in a work-group will combine their read requests so that the group's memory operations are performed at once. But in my FFT, the repeated iterations may end up producing staggered global memory operations, which are very time consuming. Further experiments are needed.