Hi there,

I’ve got the following problem:

I’ve got code where I do some computational stuff.

I’m running the function with 64 blocks and 256 threads.

This is no problem as long as I don’t try to get the results.

But when I’m copying the data to the global memory is gets

very slow.

Here the code:

The function calculate:

```
__global__ void calculate(int* t)
{
int bits[60];
int idx = blockIdx.x*blockDim.x+threadIdx.x;
---
some computational stuff
---
t[idx] = bits[1];
}
```

the code in the main:

```
#define BlockSize 64
#define ThreadSize 256
int BT = BlockSize*ThreadSize;
int * t;
(cudaMalloc((void**) &t, BT*sizeof(int)));
calculate<<<BlockSize,ThreadSize>>>(t);
```

As long as the line “t[idx] = bits[1];” is not in the function,

everything is real fast.

Is there any way I can make this call faster?

The array “bits” only contains 0 or 1 and I have to check if

one of the computated results (which is stored in bits[1])

in the threads is 1.

Then I’ve tried something like

```
if (bits[1])
t[0]++;
```

for “t[idx] = bits[1];” .

where t was an integer but it didn’t help.

What I didn’t understand there was, that

if left the if-line it was as fast as usual…

Can anyone explain this or give me a hint how to do this better?

Thanks a lot,

Claus Massion