a) If I have only one warp in a block (32 threads), is there a need to use __synchthreads() anywhere?
I tested without it and it seems it works sometimes and sometimes doesn’t.
b) The atomic lock from the book “cuda by example” works well with block wise synchronization. What modifications
are needed to make it work for thread-wise synchronization with in a warp. For example, if two threads with in the same warp
access a critical section in global memory, how is it handled? I hope this is clear enough.
a) Potentially yes. The [font=“Courier New”]__syncthreads()[/font] not only synchronizes the threads, it also tells the compiler to discard and reload copies of shared memory contents in registers. If you don’t use [font=“Courier New”]__syncthreads()[/font], you need to declare your shared memory variables as [font=“Courier New”]volatile[/font].
b) Threadwise locks in CUDA are usually not a good idea (they are inefficient and usually there are better ways either via reduction schemes or using the implicit synchronicity of the warp). If you still want threadwise locks, I don’t currently have time to describe them, but you’ll find examples in the forum.
Why would you have only a single warp in your block? You won’t get more than 8 blocks in a single SM, so at most 8 warps in your SM which can potentially hold 48 warps for Fermi cards. I have always found that kernels launched with small block sizes take twice or more time to compute than an “optimal” size block, this optimal usually lies between 256 and 512.
If this single block of a single warp is using the entire shared memory, and that’s the reason you can’t have more than one warp in your block, that would mean 1 warp in the SM, because the Shared Memory is per SM, this gives you 1 warp out of 48, that looks like it would really hurt even if all your work is in shared.
I would like to hear about your experience with performance, especially as a function of warp size.
To tera: your signature is indeed helpful, could you please tell how you run under cuda-memcheck? Is that a compiler flag?
The gpu I am using is a very old one with cuda 1.1 capability. Register usage of my kernel limits the number of threads per block to 256, so I have them as 8 blocks with 1 warp each on a single MP. With one active block per MP (i.e 256 threads in one block), the performance is similar. But the former gives me fine grained control because I am using block-wise locks. For Fermi cards, I am still going to use 8 blocks per MP, but the number of warps per block should increase to harness full power of the device.
I have generally found for my application ( that involves generating and probing a tree on global memory) 8 warps per MP hide the latency really well. I have also freedom of doing more computation before probing, so optimizing that can give me better results. It is sort of a tree on top of monte carlo simulations at the leafs. A block gets a node from the global memory and then the warps do mone carlo simulations on it. So reducing the number of threads per block helps in threads getting more nodes to work on at the same time. I am not experienced enough but I hope it is clear.
I see what you mean, however, because at most 8 blocks can be mapped to a single SM (or MP if you like) even on newer machines you should make your block granularity larger, say 64 or 128 at least, especially because you’re not losing performance. When you use larger blocks, you’ll take advantage of the resources on the newer machine and you code will scale with them.
I did a small computation according to what you said and it looks like you’re using 32 registers per thread, that means on the Fermi cards you’ll lose half the resources of the machine by making your blocks of only 32 threads. 64 is the number to use there, but with newer cards the number of registers is almost sure to grow, so perhaps you could make a function that choose the right block size for your kernel, querying the device for register size, maximum number of residing blocks per SM, and knowing how much registers you need, you can maximize the use of resources.
Keep in mind that these are not the only constraints there are many more to consider, like shared memory (if you use it), maximum block size, maximum warps per SM…
I never worried about putting exactly 8 blocks in the SM, what I found more important is the number of warps that can fit in the SM, of course minding the register usage and shared memory usage, and that’s where the magic number 256 threads per block comes into play, it is large enough and small enough to optimize the resources used.
I should have been careful about that performance statement I made. For a fixed number of simulations requested, the time required to finish the job almost equal. But the behavior of the algorithm is significantly altered. When I use more number of blocks, the tree grows rapidly (which is what I want). Here is a quick test I did by changing the block size for a fixed number of simulation.
nBlocks X nThreads Time in ms Total nodes in tree
112X32 6609 17487
56X64 5782 8685
28X128 5859 4159
14X256 5907 4097
The device has 14 multiprocessors so the number of blocks is multiple of that. As you can see when using more blocks the tree grows quite rapidly but the time required to finish the job is slightly larger. I can’t say if that additional time is due to using a small block size of 32. The tree is grown in global memory and it may well be due to latency. Block size of 64 gives the smallest time and an average growth rate so may be I will use that but using the minimum computational unit (i.e warp) seems suitable for the algorithm.
[i]Edit: There is yet another parameter that controls tree growth. How many simulations before the tree is consulted? The above results were done for 64 simulations (nLoops) before consult. I took the last block size and changed this new parameter
You are right after all. If I lower the current value of nLoop from 64 to 16 (or even 8) and use a block size of 256 , I get a lower time and larger tree, all round win! [/i]
Yes that is about right. I use 28 registers but due to some other avoidable constraint (power-of-two block size) I chose 256 block size.
For the fermi card, I will have to do a test like above to choose suitable configuration.
I learned about the cuda occupancy calculator recently which takes into consideration things you mentioned. I got about 33% but I think maximizing
To lift the runtime restriction, you need a dedicated GPU for CUDA that is not running the user interface. Under Linux just configure X11 to not run on the card. Under Windows you need the TCC driver which is limited to Tesla cards.
Just prepend [font=“Courier New”]cuda-memcheck[/font] to the command for starting your program.
EDIT: Check my updated signature for more info. External Image