I am learning producer-consumer in CUDA, and I noticed this:
|wait for buffer to be ready to be filled
||signal buffer is ready to be filled
|produce data and fill the buffer
|signal buffer is filled
||wait for buffer to be filled
||consume data in filled buffer
So when consumer has nothing to do, like waiting, and we know the occupancy: the active warp at the same time, is fixed, will this consumer take up one “active warp slot”? Or will it be idle, and let another warp to be active?
By the way, for matmul, we can see cutlass, the latest version uese producer-consumer structure for double buffer loading, why? Previous version does not need this… the loading and calculation will implicitly overlap each other…
In my view a warp slot is that thing that corresponds to the specification item:
Maximum number of resident warps per SM
in this table in the programming guide.
In that sense, each warp in a typical warp-specialized producer consumer arrangement would take up a warp slot - the block it belongs to has been scheduled to an SM, so it takes up a warp slot.
IMO, the question of will the warp be active (i.e. have instructions that can be scheduled by the warp scheduler that it is assigned to) or idle (not have instructions that can be scheduled,) when it is waiting to consume work, can only be answered with a code example.
However if we use the example here, we would say that warps that are waiting at a numbered barrier because they executed bar.cta.sync and are therefore consumer warps, and have not been released because the producer warps have not yet signalled the availability of data to consume, would not have instructions that can be scheduled by the warp scheduler that they are assigned to.
But they are considered for occupancy. They do count as occupying warp slots on the SM or SMSP.
In that case, specifically use one warp as producer will decrease the really working warp number! right?
The benefit here is just, we can use TMA block…