Occupancy wierdness.... Is the calculator wrong?

We are getting some strange results from a CUDA kernel we are playing with here. This kernel is not really dependant on the number of threads/block, so we are using that as a tuning metric to gain performance. We have noticed that what we measure vs what the CUDA occupancy calculator tells us do not match up:

  1. When we set the number of threads per block equal to the number of actual processors per SIMD group on the card, performance was drastically improved (20MS -> 10MS) without changing any other code, even though the occupancy calculator said we were moving from >50% occupancy to 33% occupancy. This change did not affect the amount of data being processed by the kernel.

  2. The perfomance improved when we increased the number of registers used in the kernel, even though the calculator said we were taking a large hit in the occupancy.

  3. On a simple program, setting the number of threads per block to any number bigger then the number of processors in a single SIMD group leads to a large hit in performance, even though the kernel is not using any shared memory.

I can not post the kernel, however I can say that these kernels are completely independant of each other (i.e they do not depend on anything generated in any other kernel).

This leads me to some questions:

What does the occupancy calculator calculate, exactly?
How could it be that a program with near 100% occupancy could run slower then a program with 33% occupancy, assuming that it is doing the same work and written efficiently?

The occupancy calculator is probably calculating correctly, just there are bugs in the hardware (device memory access) that lead to the above observations. See my measurements of device memory throughput vs different thread/block configs. Increasing occupancy can cut your device memory throughput by a factor of 3 if you are unlucky.

Eric

BTW my sweet spot is 33% occupancy & 32 registers

I’ve experienced performance increase as occupancy goes from 33% to 58% once. That kernel is texture-bound, though. So I guess what actually happened is the larger (and possibly better shaped) block improved my access pattern.
For my other kernels, performance doesn’t vary much as occupancy varies. Again, most my kernels read from texture and write in poor patterns.
So occupancy may be still worth trying if you have special memory access patterns.

Jeff,

when you say you set the number of threads per block to the number of processors per SIMD group, do you mean your block size was 8?

Also, how are you increasing the number of registers?

When the number of registers is forced (with the compiler flag) below some app-specific threshold, registers get “spilled” into local memory, which is comparable to global memory performance-wise. Essentially, to keep the number of registers down, some values get written to local memory and are later read back, so that the register can be used for something else in the mean time. Spilling into local memory in many cases will decrease performance.

The calculator gives you the ratio of the actual number of threads that can be run concurrently on a multiprocessor to the maximum number. The maximum number is 768. So, for example, if you’re getting 50% occupancy that means that due to register/smem restrictions only 384 threads can be active on the multiprocessor.

As far as occupancy and performance are concerned, it will be kernel-specific. Generally, higher occupancy helps hide latencies (due to global memory accesses or register-read-after-write dependencies, the former being the bigger concern). The idea is really simple (akin to pipelining): if you have 4 reads from gmem and each incurs 400 cycle latency, one thread will take at least 1,600 cycles to get the data. However, if one uses 4 threads to read the data, and those threads belong to different half-warps, then all the data will be read after 403 cycles.

Higher occupancy should increase performance for kernels that access global memory. For example, you can try a simple kernel where each thread just writes a 32-bit word, going from 16% to 50% occupancy more than halves the time (occupancies higher than 50% do not significantly improve performance for this particular trivial example).

Paulius

Yes. We set the blocksize to 8 to match our 8800GTX.

We optimized our code a bit so that it was using a int4 as a register variable instead of 16 chars, saving some space. We did not change the memory copy routine to make sure that that was not affecting the timing. (We still did 16 reads, even though it was into an int4)

I tried this flag, but all it did was cause the program to crash. :blink: We did not use this flag to generate our timing numbers.

I thought the max number of threads that can be run on a multiprocessor concurrently is the number of processing elements that the processor has. Do you mean the maximum number of threads that can be scheduled (timesliced) on a multiprocessor is 768? Is it possible that the scheduling algorithm is what is causing our time to increase when we use more then 8 elements per block?

How does CUDA schedule blocks? If I have a program in which a warp has to block on I/O will the internal scheduler swap it out with another block, or another warp from the same block? It sounds to me like block sizes should at least be always divisible by 8, if not exactly 8 to maximize usage of the SIMD groups.

I don’t think scheduler should increase your time with more threads. I can see the time increasing if you’re synchronizing your threads with __syncthreads in your kernel, but I’ve yet to see a case where that is a performance bottleneck (not that it couldn’t be, just haven’t come across it).

Yes, up to 768 threads can be “active” per multiprocessor. Obviously, the number of those whose instructions are being executed during a given clock tick is dependent on the number of processing resources within a multiprocessor. There’s another restriction that only up to 8 threadblocks can be active on a multiprocessor. So, if you’re block is only of size 8, the max active threads you can have is 8x8=64 (occupancy = 64/768 = 1/12th). A global memory intensive kernel would definately benefit from increase in occupancy.

Yes, once a warp blocks due to I/O (or even register dependency), it will be replaced with another warp from the “active pool” of the same multiprocessor. It doesn’t really matter (performance-wise) whether the newly swapped-in warp belongs to the same threadblock or another one.

The order of blocks is not determined, so your program should not rely on it. Neither is the order of warps within a block. Block size should be a multiple of 16, to take best advantage of coalescing when accessing global memory.

Paulius