A simple problem

I have a large array, say, the number is 20 millions maybe.
Now I want to multiply each element by 2, the block size is 512, I code it like this:

int bx = blockIdx.x;
int tx = threadIdx.x;

binary[bx<<9+tx]= binary[bx<<9+tx]<<1;
__syncthreads();

Will this give me good performance? I think maybe I should deal with share memory or other tricky things?

Thanks
:)

The code you posted won’t work since + has higher priority than << in C.

int bx = blockIdx.x;

int tx = threadIdx.x;

binary[bx*512+tx]= binary[bx*512+tx]*2;

Would get you reasonably good performance.

According to Eric, it would be better if you use 256 threads and read/write int2 instead of int.

Shared memory won’t accelerate such well-coalesced operation.

So reading/writing an int2 per thread is faster than and int, according to you? Do you have done benchmarks confirming this? In both cases, coalescing will happen.

I said “according to Eric”, he did a benchmark in another post…

If you’d give a reference, that would help.

I’m interested in the link too, searching the forum for author “Eric” didn’t turn up much

I guess Eric is osiris, but that doesn’t help much.

Seems this is the topic:

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtopic=40211[/url]

But reading the text file attached it seems that the highest performance is achieved for reading/writing int1. Reading int2 only works at 90% of that at max.

Thanks guys :)
I change my code, but I can not get the time for excuting, someone can tell me why?
The code is this:

 dim3 threads(BLOCK_SIZE,1,1);
 dim3 grid( (int)ceil((float)(n/BLOCK_SIZE)),1,1);

 t1 =clock();
 compute<<< grid, threads >>>(d_Binary);
 t2 =clock();
 total = t2-t1;

In compute, it is
int bx = blockIdx.x;
int tx = threadIdx.x;

binary[(bx<<9)+tx]= binary[(bx<<9)+tx]<<1;
__syncthreads();

And I use total/CLOCKS_PER_SEC to get the time, and I can only get 0:(

You can’t measure extremely small times. Run the kernel 1000 times (or more) in a loop, then call cudaThreadSynchronize(), then measure the final time.

use QueryPerformanceCounter() in Windows or the timer in the CUDA SDK.

Also, you need to call cudaThreadSynchronize() or cuCtxSynchronize() before the second clock() or you get wrong timings.