Hi! I create a float4 array in register, and then I want to write their value into shared memory. Like below, if I only use four sts128 to store 16 values, fine. If I store more than that, I have 8 sts128?? Then the printed out shared memory will randomly miss some value!!! Even I add __syncthreads() after each sts128, of no use! Why???
To save time, of course I do not want that much __syncthreads()…
There are 256 threads in total…
__device__ __forceinline__
uint32_t smem_u32addr(const void* smem_ptr) {
uint32_t addr;
asm("{.reg .u64 u64addr;\n"
" cvta.to.shared.u64 u64addr, %1;\n"
" cvt.u32.u64 %0, u64addr;}\n"
: "=r"(addr)
: "l"(smem_ptr)
);
return addr;
}
int tx16 = threadIdx.x % 16;
int ty16 = threadIdx.x / 16;
float4 f4_zero = make_float4(0.f, 0.f, 0.f, 0.f);
float4 c[4][2] = { { f4_zero } };
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx// some lines to calculate c's value
uint32_t sts_c_offset = smem_u32addr(smem_a + tx16 * 4*128 + ty16 * 4);
sts128(c[0][0].x, c[1][0].x, c[2][0].x, c[3][0].x, sts_c_offset + sizeof(float) * 0*128);
sts128(c[0][0].y, c[1][0].y, c[2][0].y, c[3][0].y, sts_c_offset + sizeof(float) * 1*128);
sts128(c[0][0].z, c[1][0].z, c[2][0].z, c[3][0].z, sts_c_offset + sizeof(float) * 2*128);
sts128(c[0][0].w, c[1][0].w, c[2][0].w, c[3][0].w, sts_c_offset + sizeof(float) * 3*128);
sts128(c[4][0].x, c[5][0].x, c[6][0].x, c[7][0].x, sts_c_offset + sizeof(float) * (64+0*128));
sts128(c[4][0].y, c[5][0].y, c[6][0].y, c[7][0].y, sts_c_offset + sizeof(float) * (64+1*128));
sts128(c[4][0].z, c[5][0].z, c[6][0].z, c[7][0].z, sts_c_offset + sizeof(float) * (64+2*128));
sts128(c[4][0].w, c[5][0].w, c[6][0].w, c[7][0].w, sts_c_offset + sizeof(float) * (64+3*128));
if (threadIdx.x == 0 && blockIdx.x == 0) {
for (int ii = 0; ii < 128; ii++) {
for (int jj = 0; jj < 128; jj++) {
if (smem_a[ii * 128 + jj] != 0) {
printf("result[%d][%d]=%f ", ii, jj, smem_a[ii * 128 + jj]);
}
}
printf("\n");
}
printf("\n");
}
__syncthreads();
I really doubt this is relevant to float4…Previous experience is, using one float4 will spend less register than 4 float! But why problem here???
Thank you!!!