Problem about bank conflict test

Rookie_programmer · March 9, 2024, 8:40am

I test bank conflict in my A800 GPU, the kernel code as follows:

__global__ void setRowReadRow(int * out)
{
    unsigned int idx=threadIdx.y*blockDim.x+threadIdx.x;
    __shared__ int shmem[32][32];
    
    for(unsigned int m=0; m<1; m++){
      out[idx] *=  shmem[idx][m]  ;
    }
}

The block size is 32, grid size is 1, I used the ncu command to analyze the results, the command as follows:

ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared, \
              l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
              ./bank_conflict 1

Here the result:

==PROF== Connected to process 449606 (/home/xxx/bank_conflict )
==PROF== Profiling "setRowReadRow(int *)" - 0: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 449606
==WARNING== Found outstanding GPU clock reset, trying to revert...Success.
[449606] shared_mem_config@127.0.0.1
  setRowReadRow(int *) (1, 1, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
    Section: Command line profiler metrics
    -------------------------------------------------------- ----------- ------------
    Metric Name                                              Metric Unit Metric Value
    -------------------------------------------------------- ----------- ------------
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum                       31
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared.avg                           0.29
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared.max                             31
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared.min                              0
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum                             31
    -------------------------------------------------------- ----------- ------------

It turned out just as I suspected, there are 32 bank in shared memory, and 32 thread load data in the bank, the number of loops is 1, so bank conflict is 31.

But I changed the number of the loop to 2, the result as follows:

==PROF== Connected to process 454331 (/home/heyuanhong/shared_mem_config)
==PROF== Profiling "setRowReadRow(int *)" - 0: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 454331
==WARNING== Found outstanding GPU clock reset, trying to revert...Success.
[454331] shared_mem_config@127.0.0.1
  setRowReadRow(int *) (1, 1, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
    Section: Command line profiler metrics
    -------------------------------------------------------- ----------- ------------
    Metric Name                                              Metric Unit Metric Value
    -------------------------------------------------------- ----------- ------------
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum                       30
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared.avg                           0.28
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared.max                             30
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared.min                              0
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum                             30
    -------------------------------------------------------- ----------- ------------

The bank conflict is 30, i wonder about the result. Because when the m=0, 32 thread load data in bank 0 lead to 31 bank conflict, when the m=1, 32 thread load data in bank 1 also lead to 31 bank conflict.

In my understanding, there should be 62 bank conflict, why the ncu result shows only 30 bank conflict?

Greg · March 10, 2024, 3:19am

Using godbolt.org with NVC 12.3.1 -arch sm_80 -lineinfo -03

shows that compiler switches to wider LDS as the loop count increases. Additional bank conflicts will not be observed until multiple LDS instructions are required (m < 5).

Nsight Compute Shared Memory Table and Source Table will clearly show the number of instructions executed. When trying to understanding metrics it is common to have to look at the disassembly (SASS).

m < 1

setRowReadRow(int*):
 MOV R1, c[0x0][0x28] 
 S2R R0, SR_TID.Y 
 HFMA2.MMA R5, -RZ, RZ, 0, 2.384185791015625e-07 
 ULDC.64 UR4, c[0x0][0x118] 
 S2R R3, SR_TID.X 
 IMAD R0, R0, c[0x0][0x0], R3 
 IMAD.WIDE.U32 R2, R0, R5, c[0x0][0x160] 
 LDG.E R5, [R2.64] 
 SHF.L.U32 R0, R0, 0x7, RZ 
 LDS R0, [R0] 
 IMAD R5, R0, R5, RZ 
 STG.E [R2.64], R5 
 EXIT

m < 2

setRowReadRow(int*):
 MOV R1, c[0x0][0x28] 
 S2R R0, SR_TID.Y 
 HFMA2.MMA R5, -RZ, RZ, 0, 2.384185791015625e-07 
 ULDC.64 UR4, c[0x0][0x118] 
 S2R R3, SR_TID.X 
 IMAD R0, R0, c[0x0][0x0], R3 
 IMAD.WIDE.U32 R2, R0, R5, c[0x0][0x160] 
 LDG.E R4, [R2.64] 
 SHF.L.U32 R0, R0, 0x7, RZ 
 LDS.64 R6, [R0]                                <<<< switched from LDS.32 to LDS.64
 IMAD R4, R4, R6, RZ 
 IMAD R7, R4, R7, RZ 
 STG.E [R2.64], R7 
 EXIT

m < 16

setRowReadRow(int*):
 MOV R1, c[0x0][0x28] 
 S2R R0, SR_TID.Y 
 HFMA2.MMA R5, -RZ, RZ, 0, 2.384185791015625e-07 
 ULDC.64 UR4, c[0x0][0x118] 
 S2R R3, SR_TID.X 
 IMAD R0, R0, c[0x0][0x0], R3 
 IMAD.WIDE.U32 R2, R0, R5, c[0x0][0x160] 
 LDG.E R17, [R2.64] 
 SHF.L.U32 R0, R0, 0x7, RZ 
 LDS.128 R4, [R0] 
 LDS.128 R8, [R0+0x10] 
 LDS.128 R12, [R0+0x20] 
 IMAD R4, R17, R4, RZ 
 LDS.128 R16, [R0+0x30] 
 IMAD R5, R4, R5, RZ 
 IMAD R6, R5, R6, RZ 
 IMAD R7, R6, R7, RZ 
 IMAD R8, R7, R8, RZ 
 IMAD R9, R8, R9, RZ 
 IMAD R10, R9, R10, RZ 
 IMAD R11, R10, R11, RZ 
 IMAD R12, R11, R12, RZ 
 IMAD R13, R12, R13, RZ 
 IMAD R14, R13, R14, RZ 
 IMAD R15, R14, R15, RZ 
 IMAD R16, R15, R16, RZ 
 IMAD R17, R16, R17, RZ 
 IMAD R18, R17, R18, RZ 
 IMAD R19, R18, R19, RZ 
 STG.E [R2.64], R19 
 EXIT

Rookie_programmer · March 12, 2024, 3:10pm

@Greg Thanks for your answer, but I still wonder about it when m<2 the bank conflict is 30.

After load instruction switched from LDS.32 to LDS.64, why bank conflict isn’t 31, the same as m<1?

Greg · March 12, 2024, 3:33pm

For LDS.64 the load request is split into 2 groups of wavefronts:

Group 0 - Threads 0 - 15
Group 1 - Threads 16 - 31

Group 0 will result in 1 wavefront (conflict free) + 15 bank conflict wavefronts.
Group 1 will result in 1 wavefront (conflict free) + 15 bank conflict wavefronts.

1 request
32 wavefronts
30 bank conflicts

The Nsight Compute Source View columns L1 Wavefronts Shared Excessive, L1 Wavefronts Shared, and L1 Wavefronts Shared Ideal show the following:

30 L1 Wavefronts Shared Excessive
32 L1 Wavefronts Shared
2 L1 Wavefronts Shared Ideal

NCU shows that the ideal for a LDS.64 is 2 wavefronts, not 1 wavefront. The Shared Excessive is equivalent to bank conflict.

For LDS.128 the load request is split into 4 groups of wavefronts in the same fashion.

1 request
32 wavefronts (ideal = 4)
28 bank conflicts

Rookie_programmer · March 12, 2024, 4:01pm

@Greg Sorry, I know the concepts of warp and thread, but this is the first time I’ve heard of wavefront and wave shared excesive.
I’m not sure I understand your answer, where should I learn about this?

Greg · March 12, 2024, 4:39pm

Nsight Kernel Profiling Guide Metrics Decoder contains the definitions and description listed below. For a walk through of the concepts and metrics please see the GTC 2022 recording How to Understand and Optimize Shared Memory Accesses using Nsight Compute for details on Nsight Compute metrics for Shared Memory.

Copied from Nsight Kernel Profiling Guide

`Term`	`Definition`
`instruction`	An assembly (SASS) instruction. Each executed instruction may generate zero or more requests.
`request`	A command into a HW unit to perform some action, e.g. load data from some memory location. Each request accesses one or more sectors.
`sector`	Aligned 32 byte-chunk of memory in a cache line or device memory. An L1 or L2 cache line is four sectors, i.e. 128 bytes. Sector accesses are classified as hits if the tag is present and the sector-data is present within the cache line. Tag-misses and tag-hit-data-misses are all classified as misses.
`tag`	Unique key to a cache line. A request may look up multiple tags, if the thread addresses do not all fall within a single cache line-aligned region. The L1 and L2 both have 128 byte cache lines. Tag accesses may be classified as hits or misses.
`wavefront`	Unique “work package” generated at the end of the processing stage for requests. All work items of a wavefront are processed in parallel, while work items of different wavefronts are serialized and processed on different cycles. At least one wavefront is generated for each request.

A simplified model for the processing in L1TEX for Volta and newer architectures can be described as follows: When an SM executes a global or local memory instruction for a warp, a single request is sent to L1TEX. This request communicates the information for all participating threads of this warp (up to 32). For local and global memory, based on the access pattern and the participating threads, the request requires to access a number of cache lines, and sectors within these cache lines. The L1TEX unit has internally multiple processing stages operating in a pipeline.

A wavefront is the maximum unit that can pass through that pipeline stage per cycle. If not all cache lines or sectors can be accessed in a single wavefront, multiple wavefronts are created and sent for processing one by one, i.e. in a serialized manner. Limitations of the work within a wavefront may include the need for a consistent memory space, a maximum number of cache lines that can be accessed, as well as various other reasons. Each wavefront then flows through the L1TEX pipeline and fetches the sectors handled in that wavefront. The given relationships of the three key values in this model are requests:sectors is 1:N, wavefronts:sectors 1:N, and requests:wavefronts is 1:N.

A wavefront is described as a (work) package that can be processed at once, i.e. there is a notion of processing one wavefront per cycle in L1TEX. Wavefronts therefore represent the number of cycles required to process the requests, while the number of sectors per request is a property of the access pattern of the memory instruction for all participating threads. For example, it is possible to have a memory instruction that requires 4 sectors per request in 1 wavefront. However, you can also have a memory instruction having 4 sectors per request, but requiring 2 or more wavefronts.

system · March 26, 2024, 4:40pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shared memory bank conflicts and nsight metric CUDA Programming and Performance	15	5054	October 19, 2024
Ncu detects bank conflicts in matrix transposition after padding Nsight Compute cuda	5	1184	January 30, 2023
Understanding the behaivor of ldmatrix in terms of shared memory access CUDA Programming and Performance cuda	2	1109	January 12, 2024
Read bank conflict from SASS or PTX files CUDA Programming and Performance cuda , nsight , profiling	5	684	December 31, 2022
Questions about “L1 Conflicts Shared N-way” & metrics related to “Excessive” Nsight Compute	0	1121	March 29, 2024
The increase of the shared memory size leads to the bankconflict (from 9 KB shared memory) Nsight Compute	5	513	July 14, 2023
Why will different numbers of thread blocks cause bank conflicts? CUDA Programming and Performance cuda , kernel , ubuntu	1	33	August 22, 2024
Questions about "L1 Conflicts Shared N-way" & metrics related to "Excessive" CUDA Programming and Performance	5	455	April 1, 2024
How to profile such metrics like l1tex__data_bank_conflicts_pipe_lsu_mem_global? Nsight Compute cuda	2	785	March 3, 2022
Optimized shared memory index mapping function CUDA Programming and Performance	16	1239	March 20, 2023

Problem about bank conflict test

Related topics