I’m currently running the simpleGL example from the SDK and I’ve been playing around with different block sizes and I have encountered something that I hope someone can explain to me.
The original block size in the example is set to 8x8 = 64 threads and when I run ptxas option I get 14 registers used per thread and the occupancy calculator gives a occupancy of 50 % on each multiprocessor since we can only have 8 resident blocks on each multiprocessor at a given time and this gives us 512 active threads. So my idea was to set the threads/block to 128 since then we would have 8 resident blocks per multiprocessor and 1024 active threads and this I thought would increase the execution time of the kernel but no…
So my question is: Does anyone know why this is the case? The block dim will be 16x8 when running 128 threads/block, can this affect latency of memory writes in the kernel?
The mesh size is 256x256 and from the cudaprofiler i get an instruction throughput of 1.04115 for threads/block 8x8
and 0,967362 for thread/blocks 16x8.
My graphics card : Quadro FX 1800M
Thank you in advance!=)