I was experimenting with CUDAs performance for low-arithmetic usage kernels, meaning that the kernel will surely be memory bound. So I wonder if any of you have made observations on the following:
Writing to global memory is “fire and forget”. However there must be latency until the data arrives in memory. A typical CUDA implementation therefore uses shared memory to cache data that is used several times. So the data flow is readglobal->shared->compute->writeglobal. If the writes at the end of the code are just fired off, the kernel will be done before the write latency is over. Does the CUDA kernel return nevertheless, or does it block until the writes are finished? The latter would mean that an optimization strategy would be to fire the writes ASAP. Judging from the .ptx files however, nvcc does not seem to make any effort moving the writes to the front.
When you put more than one kernel into a .cu file, everything worked fine for me until now. In the .cubin file however the shared memory usage for each of the kernels is wrongly calculated as the sum of the shared memory allocations in all of the contained kernels. Are the numbers in the .cubin used in any way when loading a kernel? That would mean that a resource check could fail for no reason.