A simple problem

dingshuai1985 · October 6, 2007, 10:58pm

I have a large array, say, the number is 20 millions maybe.
Now I want to multiply each element by 2, the block size is 512, I code it like this:

int bx = blockIdx.x;
int tx = threadIdx.x;

binary[bx<<9+tx]= binary[bx<<9+tx]<<1;
__syncthreads();

Will this give me good performance? I think maybe I should deal with share memory or other tricky things?

Thanks
:)

asadafag · October 7, 2007, 7:48am

The code you posted won’t work since + has higher priority than << in C.

int bx = blockIdx.x;

int tx = threadIdx.x;

binary[bx*512+tx]= binary[bx*512+tx]*2;

Would get you reasonably good performance.

According to Eric, it would be better if you use 256 threads and read/write int2 instead of int.

Shared memory won’t accelerate such well-coalesced operation.

wumpus · October 8, 2007, 8:48am

So reading/writing an int2 per thread is faster than and int, according to you? Do you have done benchmarks confirming this? In both cases, coalescing will happen.

asadafag · October 9, 2007, 4:08am

I said “according to Eric”, he did a benchmark in another post…

vvolkov · October 9, 2007, 11:11am

If you’d give a reference, that would help.

wumpus · October 9, 2007, 11:42am

I’m interested in the link too, searching the forum for author “Eric” didn’t turn up much

vvolkov · October 9, 2007, 12:24pm

I guess Eric is osiris, but that doesn’t help much.

wumpus · October 9, 2007, 12:35pm

Seems this is the topic:

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtopic=40211[/url]

But reading the text file attached it seems that the highest performance is achieved for reading/writing int1. Reading int2 only works at 90% of that at max.

dingshuai1985 · October 10, 2007, 2:48pm

Thanks guys :)
I change my code, but I can not get the time for excuting, someone can tell me why?
The code is this:

 dim3 threads(BLOCK_SIZE,1,1);
 dim3 grid( (int)ceil((float)(n/BLOCK_SIZE)),1,1);

 t1 =clock();
 compute<<< grid, threads >>>(d_Binary);
 t2 =clock();
 total = t2-t1;

In compute, it is
int bx = blockIdx.x;
int tx = threadIdx.x;

binary[(bx<<9)+tx]= binary[(bx<<9)+tx]<<1;
__syncthreads();

And I use total/CLOCKS_PER_SEC to get the time, and I can only get 0:(

MisterAnderson42 · October 10, 2007, 3:54pm

You can’t measure extremely small times. Run the kernel 1000 times (or more) in a loop, then call cudaThreadSynchronize(), then measure the final time.

vvolkov · October 11, 2007, 1:19am

use QueryPerformanceCounter() in Windows or the timer in the CUDA SDK.

Also, you need to call cudaThreadSynchronize() or cuCtxSynchronize() before the second clock() or you get wrong timings.

Topic		Replies	Views
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7006	January 30, 2008
How to optimize my simple code CUDA Programming and Performance	2	1067	April 1, 2010
Timing CUDA Code To find the best way to time CUDA code CUDA Programming and Performance	5	1906	January 6, 2009
compare 2 array - stack CUDA Programming and Performance	6	1413	June 9, 2016
Newbie Question: Threads What's going on here? CUDA Programming and Performance	5	2226	July 18, 2008
Multiplying a system of 3x3 matrices efficiently CUDA Programming and Performance	2	8796	September 11, 2009
'for' loop performance hacks? CUDA Programming and Performance	17	10427	February 28, 2009
Urgent help with threads please! CUDA Programming and Performance	21	10784	March 6, 2008
Grids and Threads question CUDA Programming and Performance	2	4421	August 7, 2007
optimization questions CUDA Programming and Performance	2	781	February 25, 2012

A simple problem

Related topics