Problem accessing array ellement based on variable

I’m practicing on a code to transpose an image (unsigned char)
I have the following strange problem. When I try to cast a pointer and then add an offset based on a variable I get junk in the output, using a constant works fine. I.E, doing
uchar4 val;
int data[4];

for (int i = 0 ; i < 4 ; i++)
val.x = *((char *)&data[0] + i);

((int )(out + (x04 + dx4 + i)*outStride) + y0 + dy) = *(int *)&val;

produces junk. If I change the setting of val to use a constant the results are fine, i.e

val.x = *((char *)&data[0] + 1);

I tried compiling in emulation mode and the image comes out fine, so I don’t know if this is a hardware bug or something that I’m doing wrong. This is under linux 64bit with nVidia Corporation Quadro NVS 140M (rev a1).

The full code for the kernel (the assignment is in the final loop) if it helps (shared data size is 64x64 bit and thread block is 16x16)

global void transposeImageKernelShared(unsigned char *in, size_t inStride, unsigned char *out, size_t outStride, int width, int height)
{
extern shared char s_data;

int dataStride = blockDim.x*4;

// -------------------
// Read Data Row Major
// -------------------

int x0 = blockIdx.x*blockDim.x;
int y0 = blockIdx.y*blockDim.y;

int dx = threadIdx.x;
int dy = threadIdx.y;

// Read 4 rows of data into the shared memory
for (int i = 0 ; i < 4 ; i++)
*((int *)(s_data + (dy*4 + i)*dataStride) + dx) = *((int *)(in + (y0*4 + dy*4 + i)*inStride) + x0 + dx);

// ----
// Sync
// ----
__syncthreads();

// -----------------------
// Write Data Column Major
// -----------------------

dx = threadIdx.y;
dy = threadIdx.x;

////////////////

char4 val;

 int data[4];

*(int *)&data[0] = *(int *)(s_data + (dy*4 + 0)*dataStride + dx*4);
*(int *)&data[1] = *(int *)(s_data + (dy*4 + 1)*dataStride + dx*4);
*(int *)&data[2] = *(int *)(s_data + (dy*4 + 2)*dataStride + dx*4);
*(int *)&data[3] = *(int *)(s_data + (dy*4 + 3)*dataStride + dx*4);

for (int i = 0 ; i < 4 ; i++)
{
val.x = *((char *)&data[0] + i);
val.y = *((char *)&data[1] + i);
val.z = *((char *)&data[2] + i);
val.w = *((char *)&data[3] + i);

*((int *)(out + (x0*4 + dx*4 + i)*outStride) + y0 + dy) = *(int *)&val;
 }

}

Thanks

turns out that adding a __syncthreads() to the last for solves the problem, although I’m really not clear as to why. That is the last loop should look like

for (int i = 0 ; i < 4 ; i++)

{

    __syncthreads();

val.x = *((char *)&data[0] + i);

val.y = *((char *)&data[1] + i);

val.z = *((char *)&data[2] + i);

val.w = *((char *)&data[3] + i);

*((int *)(out + (x0*4 + dx*4 + i)*outStride) + y0 + dy) = *(int *)&val;

 }

Any idea why the __syncthreads() is needed here? (i’m reading from local memory that was read from shared memory after a syncthread …)

doing the syncthread just before the loop doesn’t work. It doesn’t matter where it is in the loop, but once it’s there things work fine

Thanks

Have you tried looking at the PTX source listing and checked if the compiler emits reasonable code? You could try with and without the __syncthreads() to see if it affects code generation.