Newbie questions :)

lightenix · November 15, 2007, 9:24pm

Hello

I finally managed to write first program in CUDA :) (it finds max elment in an array).

Well I run into few questions while writting this example:

whenever i have some multiplication or division by 2, i should use “shift” instead of mul or div, this includes for sentences?
e.g.: for(…; …; currentDepth = currentDepth / 2)

If some graphic card has 256MB ram, how is this memory split between texture memory, global memory (this memory is usually named as device memory if i am right?)?
How would be faster to find some max element in 256MB big array. Would be better to pass array as texture or as cudaArray or on some other way (where would be array then located: global,texture,device…memory?)?

is measuring time like this ok for GPU and CPU or not:
gettimeofday(&startTime, NULL);
findMaxKernel<<<dimGrid, dimBlock>>>(Ad, BLOCK_SIZE, BLOCK_SIZE, Cd);
gettimeofday(&stopTime, NULL);
printf(“GPU: %ld\n”, stopTime.tv_usec-startTime.tv_usec);
(in kernel i used __syncthread(); as (almost) last command)

when i was trying this example in emurelease mode i got wrong results, this is probably due to serial execution of threads? (but should this still be the case when i used __syncthreads() ?)

when exactly bank conflicts occur, only at writting or also at reading?

anything else to what i should also be paying special attention?

Thanks for answers :)

AndreiB · November 16, 2007, 5:51am

Division is VERY costly on GPU, so make your best to avoid it. However, compiler is smart enough to replace divisiont on 2^n with n-bit shifts and % (2^n-1) with AND.
Not sure, but my guess is that texture memory is allocated from global memory.
No, it’s not okay. Your findMaxKernel will return instanly because kernel launches are async. You need to call cudaThreadSyncronize() after launching kernel to make sure it has finished.
Have you tried to debug your code in emu* mode? I think there’s some problem with your code, not emulation.
Read manual. Reading also may create banck conflict.

wumpus · November 16, 2007, 3:08pm

Yes. This is good practice both for GPU and CPU programming.

global, local, texture and constant memory are allocated from the same memory pool. (constant memory is handled specially though, as the cache has to be managed)

You would need some kind of reduction, there is a reduction example in the SDK. Just storing in global memory would be best IMO (use coalesced reads!)

No, that will give you a pointless number. You need to use cudaSynchronizeThread() before starting and stopping timers.

You should get the same results, probably this means a bug in your program.

Both.

lightenix · November 17, 2007, 9:50pm

Thanks for answers.

Well i should still get the same result, if i overwritted existing data?

My findMax example works the following way:

at start with each thread copies 1 element to the shared memory
step 1:
turn off every odd thread (return)
every active thread then takes 2 elements at some position X,Y (pos = x*width+y)
then it finds max(pos,pos+1) and stores it at pos/2
calculate new thread’s index (for finding if this thread is then odd or even)
go to step 1 (this is repeating as long as there are more then 2 threads)
when only 1 thread exists, then it finds max from (left and right subtree) 2 elements
and stores it in return value

PS: if i get different result (sometimes) after some runs of a program, this probably means i forgot to sync threads somewhere (in release mode)?

AndreiB · November 18, 2007, 7:39am

It is absolute must to have __syncthreads() after copying to shared memory.

BTW, your idea with ‘switching off’ odd threads will have great impact on performance because number of threads in a warp will be reducing.

wumpus · November 18, 2007, 11:06am

Indeed, you really shouldn’t turn off every odd thread. If you really cannot find something to do for part of your threads, use a < or > on threadIdx (preferably aligned to the warp size, 32).
Otherwise you will create thread divergence in every thread.

lightenix · November 18, 2007, 6:38pm

Oooo

Ok, thanks :)

MisterAnderson42 · November 18, 2007, 9:23pm

Your maximum reduction can be done in the same way as a sum reduction. This post has a short sample code showing how to do it: [url=“http://forums.nvidia.com/index.php?showtopic=29856&hl=sum+reduction”]The Official NVIDIA Forums | NVIDIA

The trick to avoiding divergent warps is not to do the odd/even split but have successively smaller groups of threads work on the data, but always threads 0 to M.

Topic		Replies	Views
Finding max in array CUDA Programming and Performance	15	42297	November 26, 2017
Min Max problem in parallel CUDA Programming and Performance	2	1586	September 25, 2008
about finding a max number from a big array CUDA Programming and Performance	10	4691	January 7, 2020
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18568	December 13, 2007
Trade offs between loading cost of loading to shared memory and working directly on global memory CUDA Programming and Performance	4	510	November 8, 2021
parallel maximum detection bad performance CUDA Programming and Performance	11	3534	June 5, 2008
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16310	January 30, 2011
How would you do this? CUDA Programming and Performance	12	4468	August 5, 2008
CUDA Warp Synchronization Problem CUDA Programming and Performance	5	2163	February 27, 2011
Efficient use of shared memory CUDA Programming and Performance	29	4518	December 2, 2019

Newbie questions :)

Related topics