CUDA_PROFILE shows the occupancy of a kernel is equal to one, however, there are incoherent global store and load. does that mean that even I can avoid the incoherent access to the global memory, I cannot get performance benefit? does occupancy equal to one mean the processors are busy all the time and no time are wasted for waiting memory access? Thank you.
no, occupancy = 1 means that you have 1024 threads (compute capability 1.3) to hide memory latency,
I depict a picture in the thread
maybe it is helpful.
According to the programming guide, Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. But what warps are active? If some threads within the warp are waiting for the data, is the warp considered to be active or inactive? Thanks.
“number of active warps per multiprocessor” means how many warps could be seen by warp scheduler of SM.
it is a static value, determined after compilation.
suppose you have 100 thread blocks on TeslaC1060 and each block has 512 threads.
We may assume occupancy is 50 %, say only one thread block is active in one SM, or say
only 16 warps are active in one SM.
initially, only 30 blocks among 100 blocks are scheduled into 30 SMs, warp scheduler choose one warp among 16 warps
to execute, and then pick up next one according to round-robin. If some warp waits for I/O, then it is put into waiting queue,
warp scheduler does not choose warps in waiting queue to execute.
However at this time, warps in waiting queue are still called “active warps”