Need some help with shared memory

Hi guys, my first post here. I’ve tried a few approaches, but I can’t seem to get the right way to use shared in this specific code.

I’ve succesfully used shared variables in other parts and codes, but on this one the results don’t match and I couldn’t figure out why.

The code is a algo calculation. I noticed there was some room for parallel computing for the W variable below. So I created a shared s_W with the same size, for threads to cooperate (each one computes a part in parallel) and, after que computation, the W local thread variable would receive the full computed values.

I thought it would work, because W is a thread local variable, but I can’t seem to figure out why the results don’t match.

The original code just has W instead of s_W and, of course, no (if thr ==) parts.

I hope someone can enlighten me :)

Thanks,

Rod.

    uint32_t W[64];
__shared__ uint32_t s_W[64];

#pragma unroll 4
for (int i = 0; i < 4; i++)
{
	__syncthreads();
	int thr = threadIdx.x;

	if (thr == 0)
	{
		uint32_t a = P[i];
		uint32_t b = P[i + 4];
		uint32_t c = h[i + 8];
		uint32_t d = P[i + 8];

		uint32_t ab = a ^ b;
		uint32_t bc = b ^ c;
		uint32_t cd = c ^ d;


		uint32_t t = (ab & 0x80808080);
		uint32_t t2 = (bc & 0x80808080);
		uint32_t t3 = (cd & 0x80808080);

		uint32_t abx = (t >> 7) * 27 ^ ((ab^t) << 1);
		uint32_t bcx = (t2 >> 7) * 27 ^ ((bc^t2) << 1);
		uint32_t cdx = (t3 >> 7) * 27 ^ ((cd^t3) << 1);

		s_W[0 + i] = abx ^ bc ^ d;
		s_W[0 + i + 4] = bcx ^ a ^ cd;
		s_W[0 + i + 8] = cdx ^ ab ^ d;
		s_W[0 + i + 12] = abx ^ bcx ^ cdx ^ ab ^ c;
	}
	if (thr == 1)
	{
		uint32_t a = P[12 + i];
		uint32_t b = h[i + 4];
		uint32_t c = P[12 + i + 4];
		uint32_t d = P[12 + i + 8];

		uint32_t ab = a ^ b;
		uint32_t bc = b ^ c;
		uint32_t cd = c ^ d;
		
		uint32_t t = (ab & 0x80808080);
		uint32_t t2 = (bc & 0x80808080);
		uint32_t t3 = (cd & 0x80808080);

		uint32_t abx = (t >> 7) * 27 ^ ((ab^t) << 1);
		uint32_t bcx = (t2 >> 7) * 27 ^ ((bc^t2) << 1);
		uint32_t cdx = (t3 >> 7) * 27 ^ ((cd^t3) << 1);

		s_W[16 + i] = abx ^ bc ^ d;
		s_W[16 + i + 4] = bcx ^ a ^ cd;
		s_W[16 + i + 8] = cdx ^ ab ^ d;
		s_W[16 + i + 12] = abx ^ bcx ^ cdx ^ ab ^ c;
	}
	if (thr == 2)
	{
		uint32_t a = h[i];
		uint32_t b = P[24 + i + 0];
		uint32_t c = P[24 + i + 4];
		uint32_t d = P[24 + i + 8];

		uint32_t ab = a ^ b;
		uint32_t bc = b ^ c;
		uint32_t cd = c ^ d;

		uint32_t t = (ab & 0x80808080);
		uint32_t t2 = (bc & 0x80808080);
		uint32_t t3 = (cd & 0x80808080);

		uint32_t abx = (t >> 7) * 27 ^ ((ab^t) << 1);
		uint32_t bcx = (t2 >> 7) * 27 ^ ((bc^t2) << 1);
		uint32_t cdx = (t3 >> 7) * 27 ^ ((cd^t3) << 1);

		s_W[32 + i] = abx ^ bc ^ d;
		s_W[32 + i + 4] = bcx ^ a ^ cd;
		s_W[32 + i + 8] = cdx ^ ab ^ d;
		s_W[32 + i + 12] = abx ^ bcx ^ cdx ^ ab ^ c;
	}
	if (thr == 3)
	{
		uint32_t a = P[36 + i];
		uint32_t b = P[36 + i + 4];
		uint32_t c = P[36 + i + 8];
		uint32_t d = h[i + 12];

		uint32_t ab = a ^ b;
		uint32_t bc = b ^ c;
		uint32_t cd = c ^ d;

		uint32_t t = (ab & 0x80808080);
		uint32_t t2 = (bc & 0x80808080);
		uint32_t t3 = (cd & 0x80808080);

		uint32_t abx = (t >> 7) * 27 ^ ((ab^t) << 1);
		uint32_t bcx = (t2 >> 7) * 27 ^ ((bc^t2) << 1);
		uint32_t cdx = (t3 >> 7) * 27 ^ ((cd^t3) << 1);

		s_W[48 + i] = abx ^ bc ^ d;
		s_W[48 + i + 4] = bcx ^ a ^ cd;
		s_W[48 + i + 8] = cdx ^ ab ^ d;
		s_W[48 + i + 12] = abx ^ bcx ^ cdx ^ ab ^ c;
	}
	__syncthreads();
	
}
__syncthreads();

#pragma unroll 64
for (int j = 0; j < 64; j++)
W[j] = s_W[j];

You may have a race condition. Try running your code with cuda-memcheck using the racecheck subtool.

[url]CUDA-MEMCHECK :: CUDA Toolkit Documentation

Thanks for the reply, txbob! I´ll run it later to see what I can find and post the results.

I have one doubt though: I thought of that when writing the code, so I made sure only 1 thread was able to write to a single shared memory adress.

After the " writing" threads (total of 4, could be 16 with more optimization) finish computing the values, it would be read by all the threads in the block.

Also, to make sure the shared s_W array had all the computed results before transferring it to the local-thread W array, I used __syncthreads().

But thanks for the help!

Anyone else got other possibilities?