I have exactly the same function and all I did was reduce the number of items that need to be processed. So, earlier if I had to have 100000 threads running on the GPU, I organized the data so that only 20000 threads need to run to produce the same result. I thought I was being smart. It turns out that the profiler says that when I was running 100000 threads, the execution was faster! The kernel is exactly the same but the speed decreases by almost 50% when there is less data to process… Any ideas what could be going on?