>> Sunday, May 1, 2011
I submitted my mutex-based barrier code to the OpenCL forum, and they weren't impressed. It wasn't that my code didn't work, they said, but that it didn't scale. And this is important - if your OpenCL code doesn't scale, it may as well not work.
They were absolutely right. My barrier works only as long as the number of work-groups doesn't exceed the number of compute units on the device. But, as I learned the hard way, if the number of work-groups is greater than the number of compute units, the kernel hangs. More precisely, the GPU hangs, which means I have to restart the computer.
I don't know exactly how compute units execute work-groups, but not all work-groups execute at once. Once the first set of work-groups finish their execution, then the next set can start. So here's the problem--if the first set of work-groups is waiting for all the work-groups to synchronize, they'll never stop executing and the next set of work-groups will never start. So that's why my barrier never completes.