Read bank conflict from SASS or PTX files

I have a simple kernel when profiled using Nsight Compute shows bank conflict, which is correct. I have generated the SASS file from it and want to know if there is any way to detect bank conflicts based on the SASS/PTX files.

The kernel and code segment is shown:

struct s32 {
    int a; int b; int c; int d; int e; int f; int g; int h;
};

#define TESTSIZE    1024

template <typename T>
__global__ void test_kernel_conflict(T* d)
{
    __shared__ T s[TESTSIZE];

    if (threadIdx.x < TESTSIZE / 2)
        s[threadIdx.x * 2] = d[threadIdx.x]; // bank conflict
    __syncthreads();
    if (threadIdx.x >= TESTSIZE / 2)
        d[threadIdx.x] = s[threadIdx.x * 2 % TESTSIZE]; // bank conflict
}

The full code can be found here.

and the SASS representation:

	.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM75 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM52)"
	.elftype	@"ET_EXEC"


//--------------------- .text._Z20test_kernel_conflictI3s36EvPT_ --------------------------
	.section	.text._Z20test_kernel_conflictI3s36EvPT_,"ax",@progbits
	.sectionflags	@"SHF_BARRIERS=1"
	.sectioninfo	@"SHI_REGISTERS=17"
	.align	128
        .global         _Z20test_kernel_conflictI3s36EvPT_
        .type           _Z20test_kernel_conflictI3s36EvPT_,@function
        .size           _Z20test_kernel_conflictI3s36EvPT_,(.L_x_12 - _Z20test_kernel_conflictI3s36EvPT_)
        .other          _Z20test_kernel_conflictI3s36EvPT_,@"STO_CUDA_ENTRY STV_DEFAULT"
_Z20test_kernel_conflictI3s36EvPT_:
.text._Z20test_kernel_conflictI3s36EvPT_:
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 61
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 65
        /*0010*/                   S2R R14, SR_TID.X ;
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 69
        /*0020*/                   IMAD.MOV.U32 R3, RZ, RZ, 0x24 ;
        /*0030*/                   BMOV.32.CLEAR RZ, B0 ;
        /*0040*/                   BSSY B0, `(.L_x_0) ;
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 65
        /*0050*/                   ISETP.GT.U32.AND P0, PT, R14.reuse, 0x1ff, PT ;
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 69
        /*0060*/                   IMAD.WIDE.U32 R2, R14.reuse, R3, c[0x0][0x160] ;
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 68
        /*0070*/                   ISETP.GE.U32.AND P1, PT, R14, 0x200, PT ;
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 65
        /*0080*/               @P0 BRA `(.L_x_1) ;
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 66
        /*0090*/                   LDG.E.SYS R4, [R2] ;
        /*00a0*/                   LDG.E.SYS R5, [R2+0x4] ;
        /*00b0*/                   LDG.E.SYS R6, [R2+0x8] ;
        /*00c0*/                   LDG.E.SYS R7, [R2+0xc] ;
        /*00d0*/                   LDG.E.SYS R8, [R2+0x10] ;
        /*00e0*/                   LDG.E.SYS R9, [R2+0x14] ;
        /*00f0*/                   LDG.E.SYS R10, [R2+0x18] ;
        /*0100*/                   LDG.E.SYS R11, [R2+0x1c] ;
        /*0110*/                   LDG.E.SYS R12, [R2+0x20] ;
        /*0120*/                   SHF.L.U32 R0, R14, 0x1, RZ ;
        /*0130*/                   IMAD R13, R0, 0x24, RZ ;
        /*0140*/                   STS.64 [R13], R4 ;
        /*0150*/                   STS.64 [R13+0x8], R6 ;
        /*0160*/                   STS.64 [R13+0x10], R8 ;
        /*0170*/                   STS.64 [R13+0x18], R10 ;
        /*0180*/                   STS [R13+0x20], R12 ;
.L_x_1:
        /*0190*/                   BSYNC B0 ;
.L_x_0:
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 67
        /*01a0*/                   BAR.SYNC 0x0 ;
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 68
        /*01b0*/              @!P1 EXIT ;
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 69
        /*01c0*/                   SHF.L.U32 R0, R14, 0x1, RZ ;
        /*01d0*/                   LOP3.LUT R0, R0, 0x3fe, RZ, 0xc0, !PT ;
        /*01e0*/                   IMAD R0, R0, 0x24, RZ ;
        /*01f0*/                   LDS.U.64 R4, [R0] ;
        /*0200*/                   LDS.U.64 R6, [R0+0x8] ;
        /*0210*/                   LDS.U.64 R8, [R0+0x10] ;
        /*0220*/                   LDS.U.64 R10, [R0+0x18] ;
        /*0230*/                   LDS.U R13, [R0+0x20] ;
        /*0240*/                   STG.E.SYS [R2], R4 ;
        /*0250*/                   STG.E.SYS [R2+0x4], R5 ;
        /*0260*/                   STG.E.SYS [R2+0x8], R6 ;
        /*0270*/                   STG.E.SYS [R2+0xc], R7 ;
        /*0280*/                   STG.E.SYS [R2+0x10], R8 ;
        /*0290*/                   STG.E.SYS [R2+0x14], R9 ;
        /*02a0*/                   STG.E.SYS [R2+0x18], R10 ;
        /*02b0*/                   STG.E.SYS [R2+0x1c], R11 ;
        /*02c0*/                   STG.E.SYS [R2+0x20], R13 ;
	//## File "/home/sen/Documents/Development/hpctoolkit/sen_examples/Bank_conflict/bank_conflict.cu", line 70
        /*02d0*/                   EXIT ;
.L_x_2:
        /*02e0*/                   BRA `(.L_x_2);
        /*02f0*/                   NOP;
.L_x_12:

I would want to know is by looking at the SASS file, is there a way to know about the read pattern to the shared memory that is causing this bank conflict.

You can look at the Source Counters section under “Details” page.
In case of bank conflicts you should see a table for the metric “L1 Wavefronts Shared Excessive” which is the indicator for shared memory bank conflicts. It lists the source lines with the highest values for the metric.

Click on one of the source lines to view the kernel source at which the bottleneck occurs in the “Source” page. You can identify the SASS instructions which have high values for the “L1 Wavefronts Shared Excessive” metric.

Thanks. I can spot the line number from Nsight Compute. However my question was:

I would want to know by looking at the SASS file, is there a way to know about the read pattern to the shared memory that is causing this bank conflict.

How does the SASS file show which thread is accessing which particular bank?

For shared memory, the banks are arranged linearly in shared memory, that is the first 32 bits belong to bank 0, the 2nd 32 bits belong to bank 1, and so on, up to the 31st 32 bits belonging to bank 31. Then the 32nd 32 bits belong to bank 0 again, and the order repeats.

The ordering is like this:

bank:   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
loc 0:  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y
loc 32: x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y  x  y
loc 64: y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y
loc 96: y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y  y

What we have done is envisioned shared memory as a 2D array of data, where the width of the array is thirty-two 32-bit quantities. Each x or y corresponds to a 32-bit quantity. When viewed this way, each bank corresponds to a column.

The reason I have used both x and y is to visualize the addressing pattern created by this:

    s[threadIdx.x * 2] =

considered for the first warp. The first warp has threads whose threadIdx.x values range from 0 to 31 (for a typical 1D grid launch). When we multiply threadIdx.x by 2, then the index value across the warp consists of the indices 0,2,4,6,8,…,62

And that set of indices corresponds exactly to the x locations in the 2D view of shared memory that I have created. Because of that, we see that for that warp each column or “bank” in shared memory that is being accessed by that warp has two locations marked with an x. Because of that, we know that this will result in 2-way bank conflicts for the first warp.

You can repeat the process for any other warp. You could also do something similar by looking at the SASS or PTX. Considering the SASS, the equivalent shared store operation is a line like this:

    /*0140*/                   STS.64 [R13], R4 ;

You would then have to observe how the R13 register is calculated. The final calculation step is here:

    /*0130*/                   IMAD R13, R0, 0x24, RZ ;

But you would have to trace each of the source registers/operands for that operation (e.g. R0) and build up the complete calculation based on that. Its far easier to do this type of work looking at the C++ source than it is looking at the SASS.

(I believe the SASS you have shown has multiple STS operations due to compiler loop unrolling. because you are doing a structure load. But that is just a guess, it is impossible to tell from the C++ source code you have posted here in the question. )

Thank you for the detailed response. Can I know how many-way bank conflict it is from Nsight Compute? I can’t see a metric called n-way bank conflict or similar in the Memory Workload Analysis section

probably best to ask that question on the nsight compute forum