Global Memory Write Problem

quirin · September 18, 2007, 5:34pm

The following kernel does nothing but reading out an [font=“Courier”]int2[/font]value from global memory, copy it into four register, add to each x-component a constant number (here it is 1,2,3 and 4) and write the results out to another location in global memory.

#include <GL/glew.h>

#include <cutil.h>

#define TYPE int2

__device__ TYPE* d_in;

__device__ TYPE* d_out;

__global__ void zefix(TYPE* g_in, TYPE* g_out) {

  TYPE data0;

  TYPE data1;

  TYPE data2;

  TYPE data3;

  TYPE data = g_in[0];

//   __syncthreads();

  data0 = data;

  data1 = data;

  data2 = data;

  data3 = data;

 data0.x = data.x + 1;

  data1.x = data.x + 2;

  data2.x = data.x + 3;

  data3.x = data.x + 4;

  g_out[0] = data0;

  g_out[1] = data1;

  g_out[2] = data2;

  g_out[3] = data3;

  __syncthreads();

}

int main(int argc, char** argv) {

  CUDA_SAFE_CALL(cudaMalloc((void**)&d_in,    sizeof(TYPE)*4));

  CUDA_SAFE_CALL(cudaMemset((void*)d_in, 0, sizeof(TYPE)*4));

  CUDA_SAFE_CALL(cudaMalloc((void**)&d_out,    sizeof(TYPE)*4));

  CUDA_SAFE_CALL(cudaMemset((void*)d_out, 0, sizeof(TYPE)*4));

  CUT_DEVICE_INIT();

  cudaThreadSynchronize();

  zefix <<< 1, 1 >>> (d_in, d_out);

  cudaThreadSynchronize();

 TYPE* host = (TYPE*)malloc(sizeof(TYPE) * 4);

  CUDA_SAFE_CALL(cudaMemcpy(host, d_out, sizeof(

                 TYPE) * (4), cudaMemcpyDeviceToHost));

  cudaThreadSynchronize();

  for (int i = 0; i < 4; i++) {

    printf("%d \n",  host[i].x);

  }

}

However the output is [font=“Courier”][1,5,3,7][/font] instead of [font=“Courier”][1,2,3,4][/font]. Of course, this does not occur in emulation mode. I tested it on two different 8800 GTS and one GTX and in all three cases the output is not as expected.

Inserting a [font=“Courier”]__syncthread()[/font] after the read, produces a correct output, however from my understanding this should not be necessary.

Then I found out that this behaviour does not occurs for scalar types, such as plain [font=“Courier”]int[/font] and [font=“Courier”]float![/font] Note that I have used [font=“Courier”]int2[/font] here and it is not working either with [font=“Courier”]float2[/font]. However it works with [font=“Courier”]int3[/font], [font=“Courier”]float3[/font], [font=“Courier”]int4[/font], and [font=“Courier”]float4[/font].

So I have checked the ptx code and in deed there is something weird when using 2-component types: those writes producing wrong results are performed using two separate write commands (one for each component), whereas the correct stores use just one write command.

So what am I doing wrong here? How can I make the compiler do it in one write instead of two? Did anyone encounter a similar problem?

wumpus · September 18, 2007, 8:36pm

I know this problem, reported this to NVidia even. They say it’s fixed in their internal development version, but of course that doesn’t help us anything.

quirin · September 19, 2007, 6:07am

Do you know any other effects that come along with that problem?

I fixed it by adding 0 using a variable located in constant memory. Certainly not nice, but I’ve got my deadlines. However it took me like 5h to convince myself, I am not stupid. Is there any bug data base where you can check, what stuff is not working?

wumpus · September 19, 2007, 9:54am

It seems you can only see your own reported bugs in the bug base… too bad, would have saved them a lot of duplicate reports IMO

I was writing a UYVY deinterleaver, and got exactly the same problem as you did. I solved it by reading uint32’s instead of aligned 4 byte structures, then getting the values out with bit shifting.

quirin · September 19, 2007, 11:28am

Where do I find that bug data base?

I stumbled across more (really minor) bugs (in one of the sdk samples) and something urges me to report them.

MisterAnderson42 · September 19, 2007, 1:16pm

The bug database is only accessible to registered developers: [url=“NVIDIA Developer Program | NVIDIA Developer”]http://developer.nvidia.com/page/registere...er_program.html[/url]

You can register if you are using cuda in industry or for academic research.

Topic		Replies	Views
global memory writing problem CUDA Programming and Performance	0	855	September 24, 2009
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25740	July 5, 2007
Conditional write to global memory CUDA Programming and Performance	3	3483	September 4, 2007
Issue with Writing to Global memory CUDA Programming and Performance	5	2892	May 16, 2009
Strange error when reading global memory CUDA Programming and Performance	4	1365	June 9, 2009
global memory read after write CUDA Programming and Performance	4	3307	March 25, 2009
Missing writes to global mem CUDA Programming and Performance	3	1097	April 22, 2009
Global Memory Reading Problem CUDA Programming and Performance	4	1420	October 10, 2009
write to global memory from multiple threads and racing conditions CUDA Programming and Performance	3	3308	April 26, 2009
strage low of writing global mem CUDA Programming and Performance	5	2173	February 22, 2012

Global Memory Write Problem

Related topics