I am wondering if a kernel uses a small part of shared memory for itself when it is called or if the entire 16K per multiprocessor is available for programmer use. I ask because I would like to use all of shared memory (down to the last byte) to store specific data for use in my calculation but when I look at the .cubin file there seems to be 64 bytes being used for something which I am not declaring. Is this standard or should I be looking for something that may be using shared memory that I am unaware of?
I am able to shrink my shared memory use and just run the kernel more times, but I would prefer to use as much as possible. It also seems to take longer to run more kernels even though each one is doing less work. (I run the kernel 4 times as often and each does 1/4 the work, but overall everything takes about twice as long even though occupancy is better).
Another quick question. Are float4’s more efficient than float3’s to use on the GPU? I converted my code to use float3’s instead of float4’s to save some room in memory and each kernel call suddenly took twice as long to run.