I’m practicing on a code to transpose an image (unsigned char)
I have the following strange problem. When I try to cast a pointer and then add an offset based on a variable I get junk in the output, using a constant works fine. I.E, doing
uchar4 val;
int data[4];
for (int i = 0 ; i < 4 ; i++)
val.x = *((char *)&data[0] + i);
((int )(out + (x04 + dx4 + i)*outStride) + y0 + dy) = *(int *)&val;
produces junk. If I change the setting of val to use a constant the results are fine, i.e
val.x = *((char *)&data[0] + 1);
I tried compiling in emulation mode and the image comes out fine, so I don’t know if this is a hardware bug or something that I’m doing wrong. This is under linux 64bit with nVidia Corporation Quadro NVS 140M (rev a1).
The full code for the kernel (the assignment is in the final loop) if it helps (shared data size is 64x64 bit and thread block is 16x16)
global void transposeImageKernelShared(unsigned char *in, size_t inStride, unsigned char *out, size_t outStride, int width, int height)
{
extern shared char s_data;
int dataStride = blockDim.x*4;
// -------------------
// Read Data Row Major
// -------------------
int x0 = blockIdx.x*blockDim.x;
int y0 = blockIdx.y*blockDim.y;
int dx = threadIdx.x;
int dy = threadIdx.y;
// Read 4 rows of data into the shared memory
for (int i = 0 ; i < 4 ; i++)
*((int *)(s_data + (dy*4 + i)*dataStride) + dx) = *((int *)(in + (y0*4 + dy*4 + i)*inStride) + x0 + dx);
// ----
// Sync
// ----
__syncthreads();
// -----------------------
// Write Data Column Major
// -----------------------
dx = threadIdx.y;
dy = threadIdx.x;
////////////////
char4 val;
int data[4];
*(int *)&data[0] = *(int *)(s_data + (dy*4 + 0)*dataStride + dx*4);
*(int *)&data[1] = *(int *)(s_data + (dy*4 + 1)*dataStride + dx*4);
*(int *)&data[2] = *(int *)(s_data + (dy*4 + 2)*dataStride + dx*4);
*(int *)&data[3] = *(int *)(s_data + (dy*4 + 3)*dataStride + dx*4);
for (int i = 0 ; i < 4 ; i++)
{
val.x = *((char *)&data[0] + i);
val.y = *((char *)&data[1] + i);
val.z = *((char *)&data[2] + i);
val.w = *((char *)&data[3] + i);
*((int *)(out + (x0*4 + dx*4 + i)*outStride) + y0 + dy) = *(int *)&val;
}
}
Thanks