Float4 register write to shared has limit?

Hi! I create a float4 array in register, and then I want to write their value into shared memory. Like below, if I only use four sts128 to store 16 values, fine. If I store more than that, I have 8 sts128?? Then the printed out shared memory will randomly miss some value!!! Even I add __syncthreads() after each sts128, of no use! Why???

To save time, of course I do not want that much __syncthreads()…

There are 256 threads in total…

__device__ __forceinline__
uint32_t smem_u32addr(const void* smem_ptr) {
	uint32_t addr;
	asm("{.reg .u64 u64addr;\n"
		" cvta.to.shared.u64 u64addr, %1;\n"
		" cvt.u32.u64 %0, u64addr;}\n"
		: "=r"(addr)
		: "l"(smem_ptr)
	);

	return addr;
}

	int tx16 = threadIdx.x % 16;
	int ty16 = threadIdx.x / 16;
	float4 f4_zero = make_float4(0.f, 0.f, 0.f, 0.f);
	float4 c[4][2] = { { f4_zero } };
        
       xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx// some lines to calculate c's value


	uint32_t sts_c_offset = smem_u32addr(smem_a + tx16 * 4*128 + ty16 * 4);
    sts128(c[0][0].x, c[1][0].x, c[2][0].x, c[3][0].x, sts_c_offset + sizeof(float) * 0*128);
    sts128(c[0][0].y, c[1][0].y, c[2][0].y, c[3][0].y, sts_c_offset + sizeof(float) * 1*128);
    sts128(c[0][0].z, c[1][0].z, c[2][0].z, c[3][0].z, sts_c_offset + sizeof(float) * 2*128);
    sts128(c[0][0].w, c[1][0].w, c[2][0].w, c[3][0].w, sts_c_offset + sizeof(float) * 3*128);


    sts128(c[4][0].x, c[5][0].x, c[6][0].x, c[7][0].x, sts_c_offset + sizeof(float) * (64+0*128));
    sts128(c[4][0].y, c[5][0].y, c[6][0].y, c[7][0].y, sts_c_offset + sizeof(float) * (64+1*128));
    sts128(c[4][0].z, c[5][0].z, c[6][0].z, c[7][0].z, sts_c_offset + sizeof(float) * (64+2*128));
    sts128(c[4][0].w, c[5][0].w, c[6][0].w, c[7][0].w, sts_c_offset + sizeof(float) * (64+3*128));

	if (threadIdx.x == 0 && blockIdx.x == 0) {
	    for (int ii = 0; ii < 128; ii++) {
	        for (int jj = 0; jj < 128; jj++) {
	            if (smem_a[ii * 128 + jj] != 0) {
	                printf("result[%d][%d]=%f  ", ii, jj, smem_a[ii * 128 + jj]);
	            }
	        }
	        printf("\n");
	    }
	    printf("\n");
	}
	__syncthreads();

I really doubt this is relevant to float4…Previous experience is, using one float4 will spend less register than 4 float! But why problem here???

Thank you!!!

Oh, I realize, it seems only contiguous register can use sts128, to write into shared memory, right?

All load and store instructions involve one data item comprising one contiguous sequence of bytes. On the register side, this involves either a single 32-bit register or a contiguous pair or quad of 32-bit registers.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.