__global__ void simulated() {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int local[2];
local[0] = 0;
local[1] = 1;
// print thread id and local array address
// if (tid < 2) {
// printf("tid:%d, %p\n", tid, local);
// }
}
I am trying to understand how cuda manages thread-local memory by using the above code snippet. Ideally, each thread will get its own local array, so the address of local
should be different. However, when I print out its address, they are all the same across all threads.
tid:0, 0x7f4ccafffce8
tid:1, 0x7f4ccafffce8
I also tried output the SASS code for this code snippet. It seems to me that the instruction-level code also uses the exact same address for all threads.
/*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ; /* 0x00000a00ff017624 */
/* 0x000fca00078e00ff */
/*0010*/ IADD3 R1, R1, -0x8, RZ ; /* 0xfffffff801017810 */
/* 0x000fe20007ffe0ff */
/*0020*/ IMAD.MOV.U32 R5, RZ, RZ, 0x1 ; /* 0x00000001ff057424 */
/* 0x000fe200078e00ff */
/*0030*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */
/* 0x000fe40000000a00 */
/*0040*/ IADD3 R2, P0, R1, c[0x0][0x20], RZ ; /* 0x0000080001027a10 */
/* 0x000fc80007f1e0ff */
/*0050*/ IADD3.X R3, RZ, c[0x0][0x24], RZ, P0, !PT ; /* 0x00000900ff037a10 */
/* 0x000fca00007fe4ff */
/*0060*/ ST.E [R2.64+0x4], R5 ; /* 0x0000040502007985 */
/* 0x000fe8000c101904 */
/*0070*/ ST.E [R2.64], RZ ; /* 0x000000ff02007985 */
/* 0x000fe2000c101904 */
/*0080*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0090*/ BRA 0x90; /* 0xfffffff000007947 */
Can anyone share some insights how cuda exactly manage local memory per thread? To compile this code, you probably need to disable all compiler optimization. Otherwise, many instructions will be skipped.