I have a large array, say, the number is 20 millions maybe.
Now I want to multiply each element by 2, the block size is 512, I code it like this:
int bx = blockIdx.x;
int tx = threadIdx.x;
binary[bx<<9+tx]= binary[bx<<9+tx]<<1;
__syncthreads();
Will this give me good performance? I think maybe I should deal with share memory or other tricky things?
Thanks
:)
The code you posted won’t work since + has higher priority than << in C.
int bx = blockIdx.x;
int tx = threadIdx.x;
binary[bx*512+tx]= binary[bx*512+tx]*2;
Would get you reasonably good performance.
According to Eric, it would be better if you use 256 threads and read/write int2 instead of int.
Shared memory won’t accelerate such well-coalesced operation.
wumpus
October 8, 2007, 8:48am
3
So reading/writing an int2 per thread is faster than and int, according to you? Do you have done benchmarks confirming this? In both cases, coalescing will happen.
I said “according to Eric”, he did a benchmark in another post…
If you’d give a reference, that would help.
wumpus
October 9, 2007, 11:42am
6
I’m interested in the link too, searching the forum for author “Eric” didn’t turn up much
I guess Eric is osiris, but that doesn’t help much.
wumpus
October 9, 2007, 12:35pm
8
Seems this is the topic:
[url=“The Official NVIDIA Forums | NVIDIA ”]http://forums.nvidia.com/index.php?showtopic=40211[/url]
But reading the text file attached it seems that the highest performance is achieved for reading/writing int1. Reading int2 only works at 90% of that at max.
Thanks guys :)
I change my code, but I can not get the time for excuting, someone can tell me why?
The code is this:
dim3 threads(BLOCK_SIZE,1,1);
dim3 grid( (int)ceil((float)(n/BLOCK_SIZE)),1,1);
t1 =clock();
compute<<< grid, threads >>>(d_Binary);
t2 =clock();
total = t2-t1;
In compute, it is
int bx = blockIdx.x;
int tx = threadIdx.x;
binary[(bx<<9)+tx]= binary[(bx<<9)+tx]<<1;
__syncthreads();
And I use total/CLOCKS_PER_SEC to get the time, and I can only get 0:(
You can’t measure extremely small times. Run the kernel 1000 times (or more) in a loop, then call cudaThreadSynchronize(), then measure the final time.
vvolkov
October 11, 2007, 1:19am
11
use QueryPerformanceCounter() in Windows or the timer in the CUDA SDK.
Also, you need to call cudaThreadSynchronize() or cuCtxSynchronize() before the second clock() or you get wrong timings.