I got the assembly code with the help of nsight compute,
and no problem understanding instructions like MOV, S2R and IMAD, except for the LEA.
I tried to google the cuda docs for the LEA, but only x86 LEA was there.
See the code
// The mat is a flatten 1d array storing 2d data with row = 64 and col = 32.
__global__ void
loadmatrix(float* mat)
{
__shared__ float sharedmat[32*33];
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int idy = blockIdx.y*blockDim.y + threadIdx.y;
sharedmat[threadIdx.y * 33 + threadIdx.x] = mat[idy*32 + idx];
__syncthreads();
// two matrixes (32 x 32) do their own transpose
mat[idy*32 + idx] = sharedmat[threadIdx.x * 33 + threadIdx.y];
}
The assembly code is:
00007fbe c4f9b700 MOV R1 c[0x0][0x28]
00007fbe c4f9b710 S2R R0 SR_CTAID.X
00007fbe c4f9b720 MOV R2 0x4
00007fbe c4f9b730 S2R R7 SR_TID.X
00007fbe c4f9b740 S2R R3 SR_CTAID.Y
00007fbe c4f9b750 S2R R4 SR_TID.
00007fbe c4f9b760 IMAD R0 R0 c[0x0][0x0] R7
00007fbe c4f9b770 IMAD R3 R3 c[0x0][0x4] R4
00007fbe c4f9b780 LEA R3 R3 R0 0x5
00007fbe c4f9b790 IMAD.WIDE R2 R3 R2 c[0x0][0x160]
00007fbe c4f9b7a0 LDG.E.SYS R0 [R2]
00007fbe c4f9b7b0 IMAD R5 R4 0x21 R7
00007fbe c4f9b7c0 IMAD R4 R7 0x21 R4
00007fbe c4f9b7d0 STS [R5.X4] R0
00007fbe c4f9b7e0 BAR.SYNC 0x0
00007fbe c4f9b7f0 LDS.U R7 [R4.X4]
00007fbe c4f9b800 STG.E.SYS [R2] R7
00007fbe c4f9b810 EXIT 1
00007fbe c4f9b820 BRA 0x7fbec4f9b820
I understand that two-dimensional array c may save some blockDim and threadIdx.
The registers R0 and R3 map to the global index, while the R4 and R7 maintain the local index.
The question is, what is the point of:
LEA R3, R3, R0, 0x5
I guess this line calculates the offset in the global memory by the global x and y index.
R3 = (some calculations) * 5
The c[0x0][0x160] may be the global memory entry, updated R3 is an offset, and R3 x 0x4 means the bytes for float.
But the LEA differs from the one in x86, which copies the memory value to some registers.
Would you mind telling me how the R0, R3 and 0x5 co-work for an offset?
Thank you so much!