GTX280 is slower than 8800GTX ?!

Hi,

I tested my CUDA program twice on differnt environment.

First I tested it on 8800GTX (cpu: intel core2duo e6550, os: windows xp)

and then, I tested it on GTX280 (cpu: intel core2quad q9450, os: windows vista)

The computation time on 8800GTX is about 37(ms).

But the computation time on GTX280 is about 113(ms)!!

I ran exact same code both on them.

The code to measure computation time is like below

I don’t understand why the result on CTX280 is worse than that of 8800GTX…

Any advice will helpful for me.

Thank you.

Since GTX280 has more multi-processors, it is possible that your blocks are spread out resulting in lesser number of active blocks per MP – This can possibly expose register latencies that has decreased your performance…

Spawn more blocks on the GTX280 (twice as much as 8800GTX) and then see the results coming.

You mean, spread theads in one block into other blocks?

Actually my program has the dimension of grid and block like below.

dim3 grid(1, 1, 1)

dim3 block(128, 1, 1)

than to make their dimension to

dim3 grid(2, 1, 1)

dim3 block(64, 1, 1)

or

dim3 grid(4, 1, 1)

dim3 block(32, 1, 1)

,etc…

is helpful?

I’ll try it as soon as possible (because GTX280 is not available right now ;( )

Thanks!

Hopefully you have another cudaThreadSynchronize after the kernel calls, or else your timing data is bunk. Also, are you starting the timer before any cuda* calls? If so, then you are timing the driver initialization, too which is large.

You do realize that you are completely under-utilizing the hardware, right? At least 100 blocks are typically needed before all the MPs are fully warmed up on 8800 GTX and probably even more on GTX 280. Much of the speed increases of the GTX 280 comes from its 30 MPs as opposed to the 16 on 8800 GTX.

But the number of all computed data is just 128.

it’s hard to increase the number of blocks.

then this case isn’t good to be computed by CUDA?

A thread block gets mapped to a multiprocessor. If you are using 1 thread block, then you have 15 (8800gtx) or 29 (gtx280) multiprocessors doing absolutely nothing.

Are there any partial calculations that can be parallelized?
128 is a pretty small number of items to calculate under the CUDA model.

Only you can answer if its worth it, by running your algorithm on the CPU and comparing results.

I see…

With 128 number of items, it’s just 2 times faster than CPU version.

Actually, I’m implementing Featherstone’s Divide-and-conquer algorithm for articulated body using CUDA.

But I think this algorithm is not so efficient to implement using CUDA.

Anyway,

Thank you for all your answers!! :)

p.s.

Oh, I’ve just fixed the dimension of grid and block

from grid(1,1,1) and block(128,1,1) to grid(128, 1, 1) and block(1, 1, 1)

Computation time is similiar.

But I can test more than 128 number of items!

Before fixing it, I can’t use more than 128 number of items because of the lack of local memory and registers.

You should really be using 32 or 64 threads per block minimum. As far as the occupancy calculator is telling us, the amount of registers for 64 threads (2 warps) is the minimum number of registers required on a MP.

A block needs to be at least 32 threads, since this the size of a warp. A warp is a group of threads that is physically executed in parallel on the card.
By using only 1 thread per block, im guessing only one Scalar processor per multiprocessor is doing any valuable work.

Frankly, this is all in the programming guide, and you should read/understand it before trying to tackle anything on your own from which you wish to draw conclusions. If you just want to see some quick code, look at the exemples in the SDK.

Sorry, although I’ve read programming guide, I still haven’t unerstood the relationship between a warp and a block, the concept of half warp, etc.

Then, is a block same as a warp? I’m always confusing about it.

I know that warp size is 32 and one MP can run 24 warps concurrently (so 32*24=768 threads).

But a warp is a group of threads that executed in parallel on the card, then why the programming guide says that 24 warps are executed concurrently?

Thanks!

I recommend you read this:
[url=“http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture7-threading%20hardware.ppt”]http://courses.ece.uiuc.edu/ece498/al1/lec...%20hardware.ppt[/url]

Thank you for your kindness :)

It must be helpful for me to understand cuda better.

And as you guys adviced to me, I’ve fixed my program to have minimum 32 threads per block.

Then I can get better performance!

of course there is no lack of memory since the number of thread is smaller than 128, so I can increase the number of items.

As there are more items, performance of CUDA version is better than that of CPU version.

Now I understand something that how CUDA works physically.

Thanks a lot again.

CUDA in a nutshell:

Hardware:
a multiprocessor executes one warp ‘at a time’. A warp = 32 threads. They all execute the same instruction.
a multiprocessor can have 24 warps in flight (32 warps on GT200). That means all those warps can be active, although only 1 warp is doing calculation at a time, the other warps are e.g. waiting for data to come in from global memory.
a GPU has a number of multiprocessors (e.g. 30 on a GTX280)

Software:
you define a grid (1D or 2D) of blocks (maximum dimensions 65535x65535)
Each block contains the same amount of threads (that are indexed in a 1D, 2D or 3D fashion). Maximum amount of threads per block = 512.
Threads within a block can communicate through shared memory and synchronize with eachother (syncthreads)
A kernel is a program that is executed for all threads in a grid.

Mapping the software on the hardware.
A block runs on 1 MP.
A MP can run more than 1 block concurrently (depending on register and shared memory usage of the kernel, the maximum amount of blocks per MP is 8)
When more blocks are requested than can be run concurrently on the MP’s available on the GPU, the excess blocks are scheduled as soon as other blocks have finished (that is why there is only synchronization within a block)

A block should be a multiple of 64 threads, or else you get conflicts at the register file (or so we are told).

I thought a MP needs to have a minimum of 192 threads to avoid register stalling…

Where is that 64 thing coming up from? I know that even if your block size is 32, the CUDA occupany calculator will use 64 times the registers per block… Which means – you are under-utilizing the registers. Thats as much as I know on this topic.

Yeony,

I still think you are under-utilizing the power of GPU because you can’t just have 32-threads per MP. You need atleast 192 threads per MP to saturate the GPU…

Assuming you can accomodate 6 active blocks (6*32 = 192) in a MP, you need atleast 96 blocks on 8800GTX and 180blocks on the higher one…

Right you are, but with only 192 threads total (if we are still on that), using 32 threads block could mean twice as many MPs are actually used. Anyway, that was the thought behind my suggestion.

Thank you for your answers.

But actually the algorithm which I’m using executes each kernel function for level by level (of a tree).

for example, assume that there is a binary tree which has 1024 leaf nodes.

Then first, the kernel function is executed for 1024 leaf nodes, so I need 1024 threads at this time.

next, the kernel function is executed for 512 number of parent nodes of the leaf nodes.

and whole program repeats this when kernel function is executed for the root node. (-> for the root node, the utilization is the worst. Since just one thread is needed)

I think this algorithm is not good to be computed on GPU.

Moreover, if I make a tree which has 4096 leaf nodes (then the number of whole nodes of the tree is 8191), I can’t copy the data from cpu to gpu because of lack of memory. (a node of a tree has many data, but I can’t decrease it…)

I’m trying to find somewhere else which can be parallelized in my code. (Gosh, it may be a quite confusing task.)

If I can’t, it’s hard to increase the number of blocks.

you dont need 1024 threads to operate on 1024 data items (or nodes)

You can still have 64 threads and run a “for” loop.

For example, to clear 1024 elements in a FOR loop with blockDim.x number of threads, I would do this:

for(int i=threadIdx.x; i<1024; i+=blockDim.x)

  globalMemory[i] = 0;

If I want multiple blocks to do this, I would do this:

dataPerBlock = 1024/gridDim.x;

globalMemoryPointer = globalMemory[blockIdx.x*dataPerBlock];

for(int i=threadIdx.x; i<dataPerBlock; i+=blockDim.x)

{

 globalMemoryPointer[i] = 0;

}

OR

in a more complete and beautiful way like this:

for(int i= blockIdx.x*blockDim.x + threadIdx.x; i<1024; i+=blockDim.x*gridDim.x)

{

  globalMemory[i] = 0;

}

I have dealt with a binomial tree that you were talking (in finance) and I was able to get 100x to 220x performance on a 8800 GTX.

Initially, it would appear that things wont work well on GPU. But if you understand the overheads and good GPU programming practices, you can get wonderful results!

So, Dont loose hope! It would work out good for you! Good Luck!

If you could share more details on this algo, It would be helpful! (anyway, I’ll re-read this thread to see if u have mentioned it b4)