How does `BAR.SYNC.DEFER_BLOCKING` get generated?

Hi All,

I’m working with cutlass and when I compile I see the BAR.SYNC.DEFER_BLOCKING SASS instruction. I don’t even expect to find out what the real meaning of this is, if anyone knows it would be really nice :)

My main question is, when does nvcc generate this? I always see bar.sync in PTX code. But sometimes it generates BAR.SYNC.DEFER_BLOCKING, sometimes it generates BAR.SYNC. Does anyone know how does that switch work?

I’m asking because according to my use of cutlass, this instruction is generated or not. If I understand what’s going on, it would be great.

Thanks in advance

Do you see any bar.arrive instructions in the corresponding PTX code? This is outside my area of expertise, but by casual observation, I think this is where BAR.SYNC.DEFER_BLOCKING may originate, and each instance appears to be paired with a BAR.SYNC elsewhere in the code.

For an explanation of how bar.arrive and bar.sync work in pairs, see the relevant section of the PTX manual.

I don’t see any bar.arrive in the ptx. I see the exact same bar.sync in both versions. The pattern is actually very similar

	cp.async.cg.shared.global.L2::128B [%r261], [%rd47], 16, %r262;

	// end inline asm
	// begin inline asm
	cp.async.commit_group;

	// end inline asm
	add.s32 	%r1845, %r492, -2;
	// begin inline asm
	cp.async.wait_group 1;

	// end inline asm
	bar.sync 	0;
	add.s32 	%r567, %r6, %r370;
	shl.b32 	%r568, %r567, 4;
	add.s32 	%r267, %r379, %r568;
	// begin inline asm
	ldmatrix.sync.aligned.x4.m8n8.shared.b16 {%r263, %r264, %r265, %r266}, [%r267];

This is my kernel, I have a device function that use cutlass for gemm. For this kernel, nvcc generates BAR.SYNC.DEFER_BLOCKING.

__global__ void mykernel(float *lhs, float *rhs, float *res, int M, int N, int K, cutlass::gemm::GemmCoord problem_size) {
   compute_gemm_with_cutlass(lhs,rhs, res, problem_size);
}

But I want to calculate problem_size in the device. I change my kernel like below. This is the only difference between this kernel and the one that above. Now nvcc does not generate BAR.SYNC.DEFER_BLOCKING but does BAR.SYNC.

__global__ void mykernel(float *lhs, float *rhs, float *res, int M, int N, int K) {
   cutlass::gemm::GemmCoord problem_size(M,N,K);
   compute_gemm_with_cutlass(lhs, rhs, res, problem_size);
}

I mix the kernels. Now nvcc generates BAR.SYNC.DEFER_BLOCKING.

__global__ void mykernel(float *lhs, float *rhs, float *res, int M, int N, int K, cutlass::gemm::GemmCoord dummy) {
   cutlass::gemm::GemmCoord problem_size(M,N,K);
   compute_gemm_with_cutlass(lhs, rhs, res, problem_size);
}

I am totally lost.

As I said, not my area of expertise. Why does it matter whether nvcc generates bar.sync or bar.sync.defer_blocking?

Thank you very much anyway for digging in this for me. I can understand this is not CUDA, and requires what happens in the compiler.

Well, it is faster :) This instruction looks like the only major difference between my two program. defer_blocking version is typically 8-12% faster than bar.sync

Based on the observation presented it seems premature to conclude that there is a causal relationship between the occurrence of bar.sync.defer_blocking and the performance of the kernel. That may be the case, and the naming might suggest it. It is however also possible that both effects are due to an as of yet undiscovered third factor, i.e. correlation alone does not imply causation.

If I had to guess, the choice between the two machine instruction versions is likely made internally by ptxas and therefore not exposed at PTX level, and use of the defer_blocking version requires a particular combination of unknown code properties that may also allow other transformations conducive to higher performance besides use of bar.sync.defer_blocking.

Side-stepping internals (that NVIDIA is historically tight-lipped about), I wonder whether it would make sense to file an enhancement request against CUTLASS, pointing out the performance difference. Because ultimately that is the actual goal, independently of how this achieved under the hood, correct?