need a help from employees or guys who know compiler well

Hi, everyone.

I encounter a performance problem when i program with cuda, so i did some test on both GPU and CPU, what i got show that the more an automatic variable in a kernel is changed, the more expensive access to it is, somenoe said that some code may be discarded in the optimization stage, my test seem consistent with that, but it is unnormal compared with a version on CPU. (What i do in the programs is just to increase an automatic variable in a double for loop) I have two questions, hope guys who know compiler well, especially cuda compiler give me a hand, i will very appreciate.

1 Does cuda compiler really discard some code which is unuseful for the result output? my test on CPU shows that gcc doesn’t do in that way even if an automatic varialbe is never accessed after its declaration.

2 Although clock rate is similar(gpu: 1.5Ghz, cpu: 1.66Ghz), my test shows that doing the same thing on GPU is twenty times slower than that on CPU, it is unnornal! why do things happen like this?

cuda code and times recorded by cuda event api:

[codebox]global void count_test(int *d_test_count)


unsigned int item_idx = blockIdx.x * blockDim.x + threadIdx.x;

unsigned int count = 0, address = 0;

unsigned int i, j;

for(i = 0; i < 232; i++) 

	for(j = 0; j < 512; j++)      


d_test_count[item_idx] = count;//i;//j;//item_idx;// address;//item_cnt[threadIdx.x];0;//

// take time of 6.296416 ms when assign count to d_test_count[item_idx]

// take time of 6.292064 ms when assign j to d_test_count[item_idx]

// take time of 0.303104 ms when assign i to d_test_count[item_idx]

// take time of 0.291360 ms when assign item_idx to d_test_count[item_idx]

// take time of 0.288374 ms when assign zero to d_test_count[item_idx]

// take time of 0.289696 ms when assign address to d_test_count[item_idx]	


int main( int argc, char **argv)


    int *test_count;        

CUDA_SAFE_CALL( cudaMalloc((void**)&test_count, 1024*sizeof(int)) );


    count_test<<<1, 1>>>(test_count); 




cpu code and times recorded by gettimeofday():

[codebox]#include <stdlib.h>

#include <stdio.h>

#include <unistd.h>

#include <sys/time.h>

#define seconds™ gettimeofday(&tp,(struct timezone *)0);\

	                  tm = tp.tv_sec + tp.tv_usec/1000000.0

struct timeval tp;

void h_count_test(int *h_test_count)


unsigned int count = 0, address = 0; 

unsigned int i, j;

for(i = 0; i < 232; i++) 

	for(j = 0; j < 512; j++)      


h_test_count[0] = 0;//address;//j;//i;//item_idx;// item_cnt[threadIdx.x];count;//count;//

// take time of 0.000310 s when assign count to d_test_count[item_idx]

// take time of 0.000310 s when assign j to d_test_count[item_idx]

// take time of 0.000310 s when assign i to d_test_count[item_idx]

// take time of 0.000330 s when assign zero to d_test_count[item_idx]

// take time of 0.000310 s when assign address to d_test_count[item_idx]	


int main( int argc, char **argv)


int *h_test_count = (int *) malloc(1024*sizeof(int));

double t1, t2;




printf("time cost of h_test_count() is %f\n", t1);

printf("time cost of h_test_count() is %f\n", t2);

printf("time cost of h_test_count() is %f\n", t2-t1);


device : Geforce 8800 GT

cpu: dual core, 1.66Ghz

OS: Fedora 7

thanks in advance

¸Your question has already been answered in the two threads you have already made asking the same exact question. So lets hope 3 is enough.
Yes, it gets optimized away.

If youre using only one thread on the gpu, then the test is completly worthless and not worth investigating.

Why is it meaningless to run one thread on gpu? can cuda get a very high enhancement of performance if its computation capability of one thread is much weaker than cpu’s?

A GPU is designed to run multiple threads. The only reason a GPU can process data faster than a CPU is because it can process many more (240 for a GTX280) of those threads in parallel compared to a CPU (4 in todays average desktop). So even though the individual execution units in a GPU are clocked much slower then today’s CPU, the massive parallelism in a GPU allows more computation to be done in the same amount of time. Running one thread on a GPU isn’t going to give you any performance indications worth a second glance, if thats what your after.

i see your point and i agree that running one thread on gpu is meaningless, what i am doing is just to get a vivid idea about a cuda program which runs on gpu so as to preestimate its performance enhancement compared to its counterpart version on cpu throught some computation because i will wrte a cuda program after i can ensure that it is faster than a cpu version , maybe it is unreasonable…

anyway, thanks you

IMO that is a poor approach to determine whether CUDA can give a speed up or not. CUDA programs get a speedup because of 1) many cores and 2) increased memory bandwidth (though there are some constraints) . By running only 1 thread you are using neither feature - so you will never be able to determine if you will get a speed up or not. You will never be able to take advantage of the memory bandwidth for a single thread as the memory latency will kill the performance. Also for some applications you will notice a sizeable speedup only if the data size is large enough (dense matrix-matrix multiplication becomes faster only when matrices are bigger than a certain size (500* 500 i think) ). You will have to write a program that uses a reasonable number of threads ( a warp on every multiprocessor is a decent start) that operates on a reasonable data size to determine whether the application is worth porting or not.

I find that in general it is very difficult to pre-estimate performance enhancements. It really needs writing a real CUDA program and trying. There are simple cases where you can pre-estimate it clearly (like matrix addition e.g., where you can coalesce all accesses and just have to count the amount of memory read&written & divide it by the device bandwidth to get an approximation of the running time of a kernel). In general you need to try, especially when you are going to use shared memory or texture memory e.g.

In my idea, run one thread on GPU is not meaningless.

When want to convert a C/C++ function code to CUDA code.

First of all I only run with one thread without timeout occur and compare result with C/C++ code.

This means that your allocate, copy from Host memory to Device memory is correct, and other problems likes the accuracy of operations…etc

If the result of running one thread in GPU is OK, I will break the “loop/for” inside code by using many threads.

Finally, I will optimize CUDA code to get the best performance in GPU.

Because a CPU core is out-of-order with low instruction latencies, and a GPU is in-order with high instruction latencies.

A CPU core is designed to process only 1 thread at a time (or 2, with hyperthreading). A GPU is designed very differently. Even a single core (out of 240) needs 24 threads executing on it to get peak performance (ie, hyperthreading is the default). The core starts executing an instruction for thread 1, then starts an instruction for thread 2, then thread 3, etc, and after it launches all 24 instructions, the results for thread 1 finally come in. If there is only one thread, an SP spends 96% of its clock cycles waiting for its results.

Yes :yes: I agree with you.alex_dubinsky

this is the reason why when running one thread in GPU is very very slow. some time timeout occurs.

I mean that if CPU and GPU have same frequency. for instance 1.5GHz.

Program running on CPU takes 60ms, but when running on GPU (one thread) more than 6sec???.

I have experimented this program.

There are 232*512 additions and 1 writting back in my test code, computational operations is dominative, memory access can be omitted. What i am doing in this simple program is just is to get a concrete idea about the time costs of computational operations both on gpu and on cpu. I don’t mean i expect that cuda can get a speedup only by it can take much less time to do one operations on gpu than traditionanl program do on cpu. The result surprised me because it seem unnormal since my gpu and cpu have a similar clock rate…

btw i agree with what you said about many cores & memory bandwidth and always invoke much enough blocks to utilize the hardware

This is what i expect, and can right explain what i got through the test

Thanks a lot :)

Because the effective clock rate is 1/4 for GPU (a warp is processed in 4 clocks)

You already need 192 threads per multiprocessor (block) to hide read-after-write dependencies. So when running 1 thread per MP, you are effectively dividing your clock frequency by another 24!

That makes for a grand total of 192 times slower.

Why can you not accept it when people explain time after time that benchmarking with only 1 thread is meaningless

May be you not understand my explain above clearly,

In my idea, with running one thread on GPU I don’t care the speed of program.

The things I care is the correct of copied data and the accuracy of result after running on GPU (compare with result on CPU).

This is the first step to Test.

after that I optimize my CUDA code by using many thread…etc.

For a design point of view, this is not the best way to handle it. In my experience, it is better to find a solution for your problem using parallel programming i.s.o. reimplementing a for/while in multiple threads. You need to take memory, latency, block/grid sizes, etc into account, which will have a huge impact on overall performance.

Yes, I know. :)

When I program a new project, I always thinking how to deploy the best performance on CUDA.

certainly what kind of memory, how to process, latency, etc.

But some time when convert a C/C++ function to CUDA, I didn’t have enough time, or condition to understand clearly the function which I want to convert.

In this case I use the method above. :)

I didn’t tell that was the best method, is it right?

Then why did you complain it was 100x slower with 1 thread on GPU vs CPU?

And as S. Warris said, for non-embarrassingly parallel algorithms, you need to design your program in a parallel fashion (I agree, the problems where you can just remove the for loops are the nicest to port to CUDA ;))

NO, I didn’t complain. :no:

Because I guessed that problem must occurs.

When read the post of alex_dubinsky. I thought my guess is right. :)

I told that using one thread is just a method, so I tried to explain to running with one method is not meaningless in some case.

Yes you did:

From this question in your original post (and the two previous threads you created), everyone correctly assumed that you were asking about horrible performance.

If your true question is about why some simple code doesn’t work, I’d suggest opening a new thread with that question to avoid confusion. But I’ll tell you right now, if it is running for 6 seconds you are most likely hitting the launch timeout. Read the FAQ or any of the 1000 posts on this forum about it. I’d also strongly suggest that you check for error conditions after every single CUDA call (make sure to compile in debug mode if you use CUDA_SAFE_CALL and CUT_CHECK_ERROR) before making a new post.

Not the same person :whistling: