Hello,
thank you for your support in advance.
I have been working with CUDA and I came across a strange behaviour which I did not expect regarding memory coalescence.
In short, my cuda kernel receives two parameters (1) an array of unsigned long long int and (2) a char sequence.
Now each thread must process contiguous segments of 32 chars, with an offset of 1 char each thread. That is, thread 0 will compute segment from chars 0-31, thread 1 segment 1-32, etc.
The thing is I am expecting a coalescent access since for each iteration, each thread will access a char that is sequential one thread to another, up to 32, which is the warp size.
My Kernel (at the moment the code does not make much sense because I have reduced its complexity to try to pinpoint the problem):
__global__ void kernel_index_global_any_simplest(unsigned long long int * table, const char * sequence) {
unsigned long long int k, hash = 0;
char c;
for(k=0; k<32; k++){
c = sequence[threadIdx.x + k + blockIdx.x * blockDim.x];
hash += c;
}
table[threadIdx.x + blockIdx.x * blockDim.x] = hash;
}
However when I run nvprof -f --analysis-metrics --export-profile and put it into nvidia visual profiler (version 10) and run the global memory access pattern I get that 2 transactions are being performed instead of 1, when as far as I know the transaction size can be up to 4 bytes per 32 threads = 128 bytes. In this case, I am only requiring 32 contiguous bytes, which I expected to be only one transaction as it can fetch a whole 128 byte line.
I am also posting the disassembly:
_Z32kernel_index_global_any_simplestPyPKc:
MOV R1, c[0x0][0x20] ;
S2R R5, SR_CTAID.X ;
S2R R0, SR_TID.X ;
XMAD R2, R5.reuse, c[0x0] [0x8], RZ ;
XMAD.MRG R3, R5.reuse, c[0x0] [0x8].H1, RZ ;
XMAD.PSL.CBCC R5, R5.H1, R3.H1, R2 ;
IADD3 R2.CC, R0, c[0x0][0x148], R5 ;
IADD3.X R3, RZ, c[0x0][0x14c], RZ ;
LDG.E.S8 R17, [R2] ;
LDG.E.S8 R18, [R2+0x1] ;
LDG.E.S8 R20, [R2+0x2] ;
LDG.E.S8 R19, [R2+0x3] ;
LDG.E.S8 R21, [R2+0x4] ;
LDG.E.S8 R23, [R2+0x5] ;
LDG.E.S8 R24, [R2+0x6] ;
LDG.E.S8 R25, [R2+0x7] ;
LDG.E.S8 R26, [R2+0x8] ;
LDG.E.S8 R4, [R2+0x9] ;
LDG.E.S8 R6, [R2+0xa] ;
LDG.E.S8 R7, [R2+0xb] ;
LDG.E.S8 R8, [R2+0xc] ;
LDG.E.S8 R9, [R2+0xd] ;
LDG.E.S8 R10, [R2+0xe] ;
LDG.E.S8 R11, [R2+0xf] ;
LDG.E.S8 R12, [R2+0x10] ;
LDG.E.S8 R13, [R2+0x11] ;
LDG.E.S8 R14, [R2+0x12] ;
LDG.E.S8 R15, [R2+0x13] ;
LDG.E.S8 R16, [R2+0x14] ;
DEPBAR.LE SB5, 0x9 ;
SHR R22, R17, 0x1f ;
{ IADD3 R28.CC, R20.reuse, R18, R17 ;
LDG.E.S8 R17, [R2+0x15] }
{ SHR R27, R18, 0x1f ;
LDG.E.S8 R18, [R2+0x16] }
SHR R20, R20, 0x1f ;
{ IADD3.X R27, R20, R27, R22 ;
LDG.E.S8 R20, [R2+0x18] }
DEPBAR.LE SB5, 0x9 ;
IADD3 R29.CC, R21, R19, R28 ;
{ SHR R22, R19, 0x1f ;
LDG.E.S8 R19, [R2+0x17] }
SHR R21, R21, 0x1f ;
{ IADD3.X R30, R21, R22, R27 ;
LDG.E.S8 R21, [R2+0x19] }
DEPBAR.LE SB5, 0x9 ;
{ IADD3 R27.CC, R24, R23, R29 ;
LDG.E.S8 R22, [R2+0x1a] }
SHR R23, R23, 0x1f ;
SHR R24, R24, 0x1f ;
{ IADD3.X R28, R24, R23, R30 ;
LDG.E.S8 R23, [R2+0x1b] }
LDG.E.S8 R24, [R2+0x1c] ;
DEPBAR.LE SB5, 0xa ;
{ IADD3 R29.CC, R26, R25, R27 ;
LDG.E.S8 R27, [R2+0x1e] }
SHR R25, R25, 0x1f ;
SHR R26, R26, 0x1f ;
{ IADD3.X R28, R26, R25, R28 ;
LDG.E.S8 R26, [R2+0x1d] }
LDG.E.S8 R25, [R2+0x1f] ;
DEPBAR.LE SB5, 0xb ;
IADD3 R30.CC, R6, R4, R29 ;
SHR R29, R4, 0x1f ;
SHR R6, R6, 0x1f ;
IADD3.X R6, R6, R29, R28 ;
DEPBAR.LE SB5, 0xa ;
IADD3 R4.CC, R8.reuse, R7, R30 ;
SHR R7, R7, 0x1f ;
SHR R8, R8, 0x1f ;
IADD3.X R7, R8, R7, R6 ;
DEPBAR.LE SB5, 0x9 ;
IADD3 R6.CC, R10.reuse, R9, R4 ;
SHR R9, R9, 0x1f ;
SHR R10, R10, 0x1f ;
IADD3.X R9, R10, R9, R7 ;
DEPBAR.LE SB5, 0x8 ;
IADD3 R3.CC, R12.reuse, R11, R6 ;
SHR R11, R11, 0x1f ;
SHR R12, R12, 0x1f ;
IADD3.X R11, R12, R11, R9 ;
DEPBAR.LE SB5, 0x7 ;
IADD3 R2.CC, R14, R13, R3 ;
SHR R13, R13, 0x1f ;
SHR R14, R14, 0x1f ;
IADD3.X R13, R14, R13, R11 ;
DEPBAR.LE SB5, 0x6 ;
IADD3 R3.CC, R16, R15, R2 ;
SHR R15, R15, 0x1f ;
SHR R16, R16, 0x1f ;
IADD3.X R15, R16, R15, R13 ;
IADD R5, R5, R0 ;
SHR.U32 R0, R5, 0x1d ;
DEPBAR.LE SB5, 0x5 ;
IADD3 R2.CC, R18, R17, R3 ;
SHR R17, R17, 0x1f ;
SHR R18, R18, 0x1f ;
IADD3.X R17, R18, R17, R15 ;
DEPBAR.LE SB5, 0x4 ;
IADD3 R3.CC, R20, R19, R2 ;
SHR R19, R19, 0x1f ;
SHR R20, R20, 0x1f ;
IADD3.X R19, R20, R19, R17 ;
DEPBAR.LE SB5, 0x3 ;
IADD3 R2.CC, R22.reuse, R21, R3 ;
SHR R21, R21, 0x1f ;
SHR R22, R22, 0x1f ;
IADD3.X R21, R22, R21, R19 ;
DEPBAR.LE SB5, 0x2 ;
IADD3 R8.CC, R24.reuse, R23, R2 ;
SHR R23, R23, 0x1f ;
SHR R24, R24, 0x1f ;
IADD3.X R24, R24, R23, R21 ;
ISCADD R4.CC, R5, c[0x0][0x140], 0x3 ;
IADD.X R5, R0, c[0x0][0x144] ;
DEPBAR.LE SB5, 0x1 ;
IADD3 R2.CC, R27.reuse, R26, R8 ;
SHR R26, R26, 0x1f ;
SHR R27, R27, 0x1f ;
IADD3.X R26, R27, R26, R24 ;
IADD R2.CC, R25.reuse, R2 ;
SHR R25, R25, 0x1f ;
IADD.X R3, R25, R26 ;
STG.E.64 [R4], R2 ;
EXIT ;
.L_1:
BRA `(.L_1) ;
.L_164:
In the disassembly you can see that it first performs up to 21 loads from global memory out of the 32. I have noticed that up to these 21 loads, the register pressure is nearly full, i.e. could it be that the compiler tried to save all 32 chars in different registers (for each thread) therefore filling up all registers before reaching the 32 loads, and therefore having to separate the single coalescent load into two separate ones - so that register pressure can be lowered (by computing the sum)?
If so, why does the compiler try to load as many chars as possible, without interleaving the arithmetic operations in between? Wouldn’t this result in a fully coalesced access?
(Other: I am running this with 256 threads per block)
Thank you very much for your help