Few performance questions occupancy,active threads,cta_launch

Hi everyone

I’m optimizing my kernel and got some questions. I’ve searched the forum for a long time but still can’t find the answers.

  1. Occupancy:
    What does it mean? Does 0,33 occupancy mean, if the memory bandwidth is not the bottleneck, that only 1/3 of processors computation power is used? So if there was 1.0 occupancy then the computation should be 3 times faster? Or maybe 1,0 occupancy helps only with global/local memory latency.
    Lets say we have kernel that mainly does computations and only occasional uses device memory. There should be very little waiting in threads so the MP should be fully used. Would it give 1.0 occupancy? If not, maybe “how many threads can be executed at the same time on one MP” would be a better question, as it would say how much “real work” a MP can handle.

    I’m asking this because I’ve got confused reading the forum: some people say that memory bound kernels don’t benefit from higher occupancy, some say higher occupancy helps hiding the memory latency.
    From testing my kernel (which I think is computation bound, but I’m not sure) I’ve got the same time results in the profiler for 1.0 , 0.66 and 0.33 occupancy. My kernel uses 10 registers so I had to take up some shared memory with dummy data to lower the occupancy. So why the execution time is the same even if the occupancy is different?

  2. cta_launched
    Started to get frustrated that maybe my kernel is running 3 times slower then it should, I started to test different grid sizes. Normally I use over 256 block, each with 256 threads. The kernel is small enough that 3 blocks fit on an MP. I’ve got a 8600 GE card that is based on the G84 so it has 4 MPs. So why the cta_launched is 128 and not 256/4 = 64? The same is with the SDK projects. For example the “BlackScholes” uses a 256 grid and that gives cta_launched = 128. Am I getting something wrong or half of my MPs are taking a nap?

I’m doing a IFS fractal generator and I’m trying to optimize it to the fullest. But can’t get if I’m doing something wrong or am I hitting the computation or memory bandwidth limit for my card?

Occupancy is far less interesting than it sounds. It is a measure of how many threads you have active on a multiprocessor relative to the maximum. The GeForce 8 and 9 series cards can have 768 active threads. So you would have 100% occupancy if you ran blocks with 256 threads per block, and the register and shared memory usage was low enough to permit 3 blocks to run at a time per multiprocessor.

You can fully utilize the multiprocessor with less than 768 active threads, but more threads means that there are more opportunities to keep the multiprocessor busy if some of those threads have to wait for global memory reads. Aiming for somewhere at least 0.5 occupancy is probably a good goal, but it isn’t going to help all kernels equally. If you are finding it makes no difference in your case, that is not too surprising.

I have the same problem with cuda profiler. Is there any answer to this?

Another problem I have is the reported occupancy by profiler is not equal to cta_launched/768 … Any thoughts?

The odd thing is this only happens some times not in all cases !!!

Any response to this, NVIDIA?