I am curious about these output I got from the profiler when I ran my existing (unoptimized) code.
Is CPU time ~= GPU time a good thing?
Given my code is probably memory bound as the video data is copied into global memory (the 1st memcpy call) but I run about 86K threads in 338 blocks of 256 threads, should I expect to be able to do better than an occupancy of 66%?
To followup, I experimented with changing the number of threads per block without changing my kernel code. There seems to be 2 sweet spots, 128 threads per block and 320 threads per block where the occupancy reaches 83%. I can understand one sweet spot due to optimal register allocation but 2 sweetspots?
Could be that the second sweet spot is due to running more blocks concurrently. I my experience, usually this is with number of threads that are multiples of one another (i.e. 128 and 256), but could still be the case with your program.
It could very well be the case as my kernel is memory bound. I have to store all the vide data in global memory becauses there just isn’t enough shared memory to store them even divided across multiprocessors. I try to read the bits I care about into registers when I can.
Experimenting with the occupancy spreadsheet seems to say occupancy is a function of the number of registers used since the amount of shared memory used have much less impact on occupancy.
I benchmarked the current version of my kernel for different number of theads (16 to 512) and though the occupancy rate varied from 0.333 to 0.667, the kernel executation time varied by less than 2% (1545 - 1584 microseconds).
So in this case, I expect I am limited by the access cost of global memory which I use plenty of.