I have a kernel that uses a fixed number of blocks in a grid and a fixed number of threads in a block. The number of threads is 48 so I try running the kernel with a 3 x 16 block and also with a 16 x 3 block. The strage thing is that the profiler shows 25% occupancy in both cases, yet the 16 x 3 version runs about 50% slower.
It is especially strange because Mark said in another post
Which is exactly the opposite of what I’m seeing, so I’m kind of confused what’s going on. Any ideas?
(Both cubin files show reg = 18 and smem = 4256.)