Problem accessing array ellement based on variable

laughingrice · January 26, 2009, 12:09am

I’m practicing on a code to transpose an image (unsigned char)
I have the following strange problem. When I try to cast a pointer and then add an offset based on a variable I get junk in the output, using a constant works fine. I.E, doing
uchar4 val;
int data[4];

for (int i = 0 ; i < 4 ; i++)
val.x = *((char *)&data[0] + i);

((int )(out + (x04 + dx4 + i)*outStride) + y0 + dy) = *(int *)&val;

produces junk. If I change the setting of val to use a constant the results are fine, i.e

val.x = *((char *)&data[0] + 1);

I tried compiling in emulation mode and the image comes out fine, so I don’t know if this is a hardware bug or something that I’m doing wrong. This is under linux 64bit with nVidia Corporation Quadro NVS 140M (rev a1).

The full code for the kernel (the assignment is in the final loop) if it helps (shared data size is 64x64 bit and thread block is 16x16)

global void transposeImageKernelShared(unsigned char *in, size_t inStride, unsigned char *out, size_t outStride, int width, int height)
{
extern shared char s_data;

int dataStride = blockDim.x*4;

// -------------------
// Read Data Row Major
// -------------------

int x0 = blockIdx.x*blockDim.x;
int y0 = blockIdx.y*blockDim.y;

int dx = threadIdx.x;
int dy = threadIdx.y;

// Read 4 rows of data into the shared memory
for (int i = 0 ; i < 4 ; i++)
*((int *)(s_data + (dy*4 + i)*dataStride) + dx) = *((int *)(in + (y0*4 + dy*4 + i)*inStride) + x0 + dx);

// ----
// Sync
// ----
__syncthreads();

// -----------------------
// Write Data Column Major
// -----------------------

dx = threadIdx.y;
dy = threadIdx.x;

////////////////

char4 val;

 int data[4];

*(int *)&data[0] = *(int *)(s_data + (dy*4 + 0)*dataStride + dx*4);
*(int *)&data[1] = *(int *)(s_data + (dy*4 + 1)*dataStride + dx*4);
*(int *)&data[2] = *(int *)(s_data + (dy*4 + 2)*dataStride + dx*4);
*(int *)&data[3] = *(int *)(s_data + (dy*4 + 3)*dataStride + dx*4);

for (int i = 0 ; i < 4 ; i++)
{
val.x = *((char *)&data[0] + i);
val.y = *((char *)&data[1] + i);
val.z = *((char *)&data[2] + i);
val.w = *((char *)&data[3] + i);

*((int *)(out + (x0*4 + dx*4 + i)*outStride) + y0 + dy) = *(int *)&val;
 }

}

Thanks

laughingrice · January 26, 2009, 2:36am

turns out that adding a __syncthreads() to the last for solves the problem, although I’m really not clear as to why. That is the last loop should look like

for (int i = 0 ; i < 4 ; i++)

{

    __syncthreads();

val.x = *((char *)&data[0] + i);

val.y = *((char *)&data[1] + i);

val.z = *((char *)&data[2] + i);

val.w = *((char *)&data[3] + i);

*((int *)(out + (x0*4 + dx*4 + i)*outStride) + y0 + dy) = *(int *)&val;

 }

Any idea why the __syncthreads() is needed here? (i’m reading from local memory that was read from shared memory after a syncthread …)

doing the syncthread just before the loop doesn’t work. It doesn’t matter where it is in the loop, but once it’s there things work fine

Thanks

I’m practicing on a code to transpose an image (unsigned char)

I have the following strange problem. When I try to cast a pointer and then add an offset based on a variable I get junk in the output, using a constant works fine. I.E, doing

uchar4 val;

int data[4];

for (int i = 0 ; i < 4 ; i++)
val.x = *((char *)&data[0] + i);
((int )(out + (x04 + dx4 + i)*outStride) + y0 + dy) = *(int *)&val;

produces junk. If I change the setting of val to use a constant the results are fine, i.e
val.x = *((char *)&data[0] + 1);
I tried compiling in emulation mode and the image comes out fine, so I don’t know if this is a hardware bug or something that I’m doing wrong. This is under linux 64bit with nVidia Corporation Quadro NVS 140M (rev a1).

The full code for the kernel (the assignment is in the final loop) if it helps (shared data size is 64x64 bit and thread block is 16x16)

global void transposeImageKernelShared(unsigned char *in, size_t inStride, unsigned char *out, size_t outStride, int width, int height)

{
extern __shared__ char s_data[];
int dataStride = blockDim.x*4;

// -------------------
// Read Data Row Major

// -------------------
int x0 = blockIdx.x*blockDim.x;
int y0 = blockIdx.y*blockDim.y;
int dx = threadIdx.x;
int dy = threadIdx.y;
// Read 4 rows of data into the shared memory
for (int i = 0 ; i < 4 ; i++)

*((int *)(s_data + (dy*4 + i)*dataStride) + dx) = *((int *)(in + (y0*4 + dy*4 + i)*inStride) + x0 + dx);
// ----
// Sync

// ----

__syncthreads();
// -----------------------
// Write Data Column Major

// -----------------------
dx = threadIdx.y;
dy = threadIdx.x;
////////////////

char4 val;

int data[4];

*(int *)&data[0] = *(int )(s_data + (dy4 + 0)dataStride + dx4);
*(int *)&data[1] = *(int *)(s_data + (dy*4 + 1)*dataStride + dx*4);

*(int *)&data[2] = *(int *)(s_data + (dy*4 + 2)*dataStride + dx*4);

*(int *)&data[3] = *(int *)(s_data + (dy*4 + 3)*dataStride + dx*4);
for (int i = 0 ; i < 4 ; i++)
{

val.x = *((char *)&data[0] + i);

val.y = *((char *)&data[1] + i);

val.z = *((char *)&data[2] + i);

val.w = *((char *)&data[3] + i);

*((int *)(out + (x0*4 + dx*4 + i)*outStride) + y0 + dy) = *(int *)&val;

 }
}

Thanks

Fugl · January 26, 2009, 11:30am

Have you tried looking at the PTX source listing and checked if the compiler emits reasonable code? You could try with and without the __syncthreads() to see if it affects code generation.

Topic		Replies	Views
possible bug with nvcc - char values in registers CUDA Programming and Performance	0	2797	February 15, 2009
Problem with accessing array of pointers Array of pointers CUDA Programming and Performance	5	4119	October 18, 2010
pointer to shared memory compiler problems CUDA Programming and Performance	19	14533	June 7, 2008
Why taking so much time? CUDA Programming and Performance	22	3390	June 27, 2009
Nvc++ PSTL: scan failed to synchronize: cudaErrorIllegalAddress nvc, nvc++ and nvfortran	4	804	March 22, 2021
nvcc horribly breaking float4 read CUDA Programming and Performance	5	9645	June 25, 2008
problem with array offset CUDA Programming and Performance	28	3223	February 3, 2011
Memory access violation Reading text file, sending content to GPU CUDA Programming and Performance	5	2370	March 5, 2012
How to use all 16KB shared memory CUDA Programming and Performance	39	19309	April 1, 2010
Odd divergent branching phenomenon attributed to shared memory variables and the powf() function CUDA Programming and Performance	5	3676	January 12, 2010

Problem accessing array ellement based on variable

Related topics