Since GTX280 has more multi-processors, it is possible that your blocks are spread out resulting in lesser number of active blocks per MP – This can possibly expose register latencies that has decreased your performance…
Spawn more blocks on the GTX280 (twice as much as 8800GTX) and then see the results coming.
Hopefully you have another cudaThreadSynchronize after the kernel calls, or else your timing data is bunk. Also, are you starting the timer before any cuda* calls? If so, then you are timing the driver initialization, too which is large.
You do realize that you are completely under-utilizing the hardware, right? At least 100 blocks are typically needed before all the MPs are fully warmed up on 8800 GTX and probably even more on GTX 280. Much of the speed increases of the GTX 280 comes from its 30 MPs as opposed to the 16 on 8800 GTX.
You should really be using 32 or 64 threads per block minimum. As far as the occupancy calculator is telling us, the amount of registers for 64 threads (2 warps) is the minimum number of registers required on a MP.
A block needs to be at least 32 threads, since this the size of a warp. A warp is a group of threads that is physically executed in parallel on the card.
By using only 1 thread per block, im guessing only one Scalar processor per multiprocessor is doing any valuable work.
Frankly, this is all in the programming guide, and you should read/understand it before trying to tackle anything on your own from which you wish to draw conclusions. If you just want to see some quick code, look at the exemples in the SDK.
a multiprocessor executes one warp ‘at a time’. A warp = 32 threads. They all execute the same instruction.
a multiprocessor can have 24 warps in flight (32 warps on GT200). That means all those warps can be active, although only 1 warp is doing calculation at a time, the other warps are e.g. waiting for data to come in from global memory.
a GPU has a number of multiprocessors (e.g. 30 on a GTX280)
you define a grid (1D or 2D) of blocks (maximum dimensions 65535x65535)
Each block contains the same amount of threads (that are indexed in a 1D, 2D or 3D fashion). Maximum amount of threads per block = 512.
Threads within a block can communicate through shared memory and synchronize with eachother (syncthreads)
A kernel is a program that is executed for all threads in a grid.
Mapping the software on the hardware.
A block runs on 1 MP.
A MP can run more than 1 block concurrently (depending on register and shared memory usage of the kernel, the maximum amount of blocks per MP is 8)
When more blocks are requested than can be run concurrently on the MP’s available on the GPU, the excess blocks are scheduled as soon as other blocks have finished (that is why there is only synchronization within a block)
I thought a MP needs to have a minimum of 192 threads to avoid register stalling…
Where is that 64 thing coming up from? I know that even if your block size is 32, the CUDA occupany calculator will use 64 times the registers per block… Which means – you are under-utilizing the registers. Thats as much as I know on this topic.
I still think you are under-utilizing the power of GPU because you can’t just have 32-threads per MP. You need atleast 192 threads per MP to saturate the GPU…
Assuming you can accomodate 6 active blocks (6*32 = 192) in a MP, you need atleast 96 blocks on 8800GTX and 180blocks on the higher one…
But actually the algorithm which I’m using executes each kernel function for level by level (of a tree).
for example, assume that there is a binary tree which has 1024 leaf nodes.
Then first, the kernel function is executed for 1024 leaf nodes, so I need 1024 threads at this time.
next, the kernel function is executed for 512 number of parent nodes of the leaf nodes.
and whole program repeats this when kernel function is executed for the root node. (-> for the root node, the utilization is the worst. Since just one thread is needed)
I think this algorithm is not good to be computed on GPU.
Moreover, if I make a tree which has 4096 leaf nodes (then the number of whole nodes of the tree is 8191), I can’t copy the data from cpu to gpu because of lack of memory. (a node of a tree has many data, but I can’t decrease it…)
I’m trying to find somewhere else which can be parallelized in my code. (Gosh, it may be a quite confusing task.)
If I can’t, it’s hard to increase the number of blocks.