Speed-Loss by Writing to Global Mem

Hi there,

I’ve got the following problem:

I’ve got code where I do some computational stuff.

I’m running the function with 64 blocks and 256 threads.

This is no problem as long as I don’t try to get the results.

But when I’m copying the data to the global memory is gets

very slow.

Here the code:

The function calculate:

__global__ void calculate(int* t)


    int bits[60];

    int idx = blockIdx.x*blockDim.x+threadIdx.x;


    some computational stuff


   t[idx] = bits[1];


the code in the main:

#define BlockSize 64

#define ThreadSize 256

int BT = BlockSize*ThreadSize;

int * t;

(cudaMalloc((void**) &t, BT*sizeof(int)));


As long as the line “t[idx] = bits[1];” is not in the function,

everything is real fast.

Is there any way I can make this call faster?

The array “bits” only contains 0 or 1 and I have to check if

one of the computated results (which is stored in bits[1])

in the threads is 1.

Then I’ve tried something like

if (bits[1])


for “t[idx] = bits[1];” .

where t was an integer but it didn’t help.

What I didn’t understand there was, that

if left the if-line it was as fast as usual…

Can anyone explain this or give me a hint how to do this better?

Thanks a lot,

Claus Massion

If I understand correctly, this: t[idx] = bits[1]; gives 256 times the same value (each thread copies the value to global memory). You could try:

if (!idx) 

  t[idx] = bits[1];

In this part:

if (bits[1])


each thread will write to the same memory location, which could give strange results.

I didn’t understand the last part of your post.

Dear Claus,

pasting the upper line (“t[idx] = bits[1];”) into your code or leaving it out you will never know wether your algorithm or the write back to global memory is the bottleneck, because if you don’t write the data back you can not be sure wether the compiler is optimizing some lines (or maybe all lines) of your code away. It makes no sence to process data that is never output. In fact the compiler will normally not compile any data, which has nothing to do with the ouput data.

For example if you have an aimless loop like this:

int j = 2;

for(int i; i<999999; i++) {

j *= 2;


and j is never used again than this loop normally will not be compiled.

I would suggest to do some write backs without any processing.

I mean just write back some zeros or any random data.

If the write back gets faster than the algorithm should be the bottleneck.

Sincerely, chubi

chubaca is correct. Kernels which do not produce any output are optimized away completely so that it doesn’t perform any computational stuff at all.