Newbie questions :)


I finally managed to write first program in CUDA :) (it finds max elment in an array).

Well I run into few questions while writting this example:

  1. whenever i have some multiplication or division by 2, i should use “shift” instead of mul or div, this includes for sentences?
    e.g.: for(…; …; currentDepth = currentDepth / 2)

If some graphic card has 256MB ram, how is this memory split between texture memory, global memory (this memory is usually named as device memory if i am right?)?
How would be faster to find some max element in 256MB big array. Would be better to pass array as texture or as cudaArray or on some other way (where would be array then located: global,texture,device…memory?)?

is measuring time like this ok for GPU and CPU or not:
gettimeofday(&startTime, NULL);
findMaxKernel<<<dimGrid, dimBlock>>>(Ad, BLOCK_SIZE, BLOCK_SIZE, Cd);
gettimeofday(&stopTime, NULL);
printf(“GPU: %ld\n”, stopTime.tv_usec-startTime.tv_usec);
(in kernel i used __syncthread(); as (almost) last command)

when i was trying this example in emurelease mode i got wrong results, this is probably due to serial execution of threads? (but should this still be the case when i used __syncthreads() ?)

when exactly bank conflicts occur, only at writting or also at reading?

anything else to what i should also be paying special attention?

Thanks for answers :)

  1. Division is VERY costly on GPU, so make your best to avoid it. However, compiler is smart enough to replace divisiont on 2^n with n-bit shifts and % (2^n-1) with AND.

  2. Not sure, but my guess is that texture memory is allocated from global memory.

  3. No, it’s not okay. Your findMaxKernel will return instanly because kernel launches are async. You need to call cudaThreadSyncronize() after launching kernel to make sure it has finished.

  4. Have you tried to debug your code in emu* mode? I think there’s some problem with your code, not emulation.

  5. Read manual. Reading also may create banck conflict.

Yes. This is good practice both for GPU and CPU programming.

global, local, texture and constant memory are allocated from the same memory pool. (constant memory is handled specially though, as the cache has to be managed)

You would need some kind of reduction, there is a reduction example in the SDK. Just storing in global memory would be best IMO (use coalesced reads!)

No, that will give you a pointless number. You need to use cudaSynchronizeThread() before starting and stopping timers.

You should get the same results, probably this means a bug in your program.


Thanks for answers.

Well i should still get the same result, if i overwritted existing data?

My findMax example works the following way:

at start with each thread copies 1 element to the shared memory
step 1:
turn off every odd thread (return)
every active thread then takes 2 elements at some position X,Y (pos = x*width+y)
then it finds max(pos,pos+1) and stores it at pos/2
calculate new thread’s index (for finding if this thread is then odd or even)
go to step 1 (this is repeating as long as there are more then 2 threads)
when only 1 thread exists, then it finds max from (left and right subtree) 2 elements
and stores it in return value

PS: if i get different result (sometimes) after some runs of a program, this probably means i forgot to sync threads somewhere (in release mode)?

It is absolute must to have __syncthreads() after copying to shared memory.

BTW, your idea with ‘switching off’ odd threads will have great impact on performance because number of threads in a warp will be reducing.

Indeed, you really shouldn’t turn off every odd thread. If you really cannot find something to do for part of your threads, use a < or > on threadIdx (preferably aligned to the warp size, 32).
Otherwise you will create thread divergence in every thread.


Ok, thanks :)

Your maximum reduction can be done in the same way as a sum reduction. This post has a short sample code showing how to do it:…l=sum+reduction

The trick to avoiding divergent warps is not to do the odd/even split but have successively smaller groups of threads work on the data, but always threads 0 to M.