Can't get barrier.sync to diverge

I’m trying to do “sync all threads in block with same x”, like this

#include<stdio.h>
__global__ void ker() {
	asm("barrier.cta.sync %0,%1;"::"r"(threadIdx.x), "r"(blockDim.y*32));
}
int main() {
	ker <<< 1, dim3(2, 2, 32)>>>();
	cudaDeviceSynchronize();
	printf("%s", cudaGetErrorString(cudaGetLastError()));
}

But it keeps reporting an illegal instruction was encountered unless all instructions converge (i.e. blockDim.x=1).

According to doc, since I didn’t use the acronym bar., I didn’t put .aligned, and I’m not targeting sm_6x or below, it should allow diverge? I don’t understand what’s wrong

I don’t think it is valid to specify different barriers for the same warp.

barrier{.cta} instruction causes executing thread to wait for all non-exited threads from its warp and marks warps’ arrival at barrier.

As I understand this sentence, all active threads in a warp needs to arrive at a barrier. But your usage specifies different barriers for the same warp

1 Like

Thanks for the reply! As far as I understand, wrap is a continuous 32-thread, so if I have 32 threads in blockDim.z, all blockDim.x * blockDim.y combinations should all belong to different wrap?

A warp are continuous 32 threads, correct. But the leading dimension for thread index is x-dimension, not z-dimension.

These threads would be the first warp

x 0, y 0, z 0, warpId 0
x 1, y 0, z 0, warpId 0
x 0, y 1, z 0, warpId 0
x 1, y 1, z 0, warpId 0
x 0, y 0, z 1, warpId 0
x 1, y 0, z 1, warpId 0
x 0, y 1, z 1, warpId 0
x 1, y 1, z 1, warpId 0
x 0, y 0, z 2, warpId 0
x 1, y 0, z 2, warpId 0
x 0, y 1, z 2, warpId 0
x 1, y 1, z 2, warpId 0
x 0, y 0, z 3, warpId 0
x 1, y 0, z 3, warpId 0
x 0, y 1, z 3, warpId 0
x 1, y 1, z 3, warpId 0
x 0, y 0, z 4, warpId 0
x 1, y 0, z 4, warpId 0
x 0, y 1, z 4, warpId 0
x 1, y 1, z 4, warpId 0
x 0, y 0, z 5, warpId 0
x 1, y 0, z 5, warpId 0
x 0, y 1, z 5, warpId 0
x 1, y 1, z 5, warpId 0
x 0, y 0, z 6, warpId 0
x 1, y 0, z 6, warpId 0
x 0, y 1, z 6, warpId 0
x 1, y 1, z 6, warpId 0
x 0, y 0, z 7, warpId 0
x 1, y 0, z 7, warpId 0
x 0, y 1, z 7, warpId 0
x 1, y 1, z 7, warpId 0
#include <cstdio>

__global__
void kernel(){
        unsigned int warpId;
        asm ("mov.u32 %0, %%warpid;" : "=r"(warpId) );
        printf("x %d, y %d, z %d, warpId %d\n", threadIdx.x, threadIdx.y, threadIdx.z, warpId);
}

int main(){
        kernel<<<1,dim3(2,2,32)>>>();
        cudaDeviceSynchronize();
}
1 Like

wait what, so I always thought that “dim3 to actual thread id conversion” is like flattening array[x][y][z]. So it’s actually array[z][y][x]?

1 Like

Yes. See for example CUDA C++ Programming Guide

The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy) , the thread ID of a thread of index (x, y) is (x + y Dx) ; for a three-dimensional block of size (Dx, Dy, Dz) , the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy) .

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.