Hi, everyone.
I encounter a performance problem when i program with cuda, so i did some test on both GPU and CPU, what i got show that the more an automatic variable in a kernel is changed, the more expensive access to it is, somenoe said that some code may be discarded in the optimization stage, my test seem consistent with that, but it is unnormal compared with a version on CPU. (What i do in the programs is just to increase an automatic variable in a double for loop) I have two questions, hope guys who know compiler well, especially cuda compiler give me a hand, i will very appreciate.
1 Does cuda compiler really discard some code which is unuseful for the result output? my test on CPU shows that gcc doesn’t do in that way even if an automatic varialbe is never accessed after its declaration.
2 Although clock rate is similar(gpu: 1.5Ghz, cpu: 1.66Ghz), my test shows that doing the same thing on GPU is twenty times slower than that on CPU, it is unnornal! why do things happen like this?
cuda code and times recorded by cuda event api:
[codebox]global void count_test(int *d_test_count)
{
unsigned int item_idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int count = 0, address = 0;
unsigned int i, j;
for(i = 0; i < 232; i++)
for(j = 0; j < 512; j++)
count++;
d_test_count[item_idx] = count;//i;//j;//item_idx;// address;//item_cnt[threadIdx.x];0;//
// take time of 6.296416 ms when assign count to d_test_count[item_idx]
// take time of 6.292064 ms when assign j to d_test_count[item_idx]
// take time of 0.303104 ms when assign i to d_test_count[item_idx]
// take time of 0.291360 ms when assign item_idx to d_test_count[item_idx]
// take time of 0.288374 ms when assign zero to d_test_count[item_idx]
// take time of 0.289696 ms when assign address to d_test_count[item_idx]
}
int main( int argc, char **argv)
{
int *test_count;
CUDA_SAFE_CALL( cudaMalloc((void**)&test_count, 1024*sizeof(int)) );
...
count_test<<<1, 1>>>(test_count);
...
}
[/codebox]
cpu code and times recorded by gettimeofday():
[codebox]#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#define seconds™ gettimeofday(&tp,(struct timezone *)0);\
tm = tp.tv_sec + tp.tv_usec/1000000.0
struct timeval tp;
void h_count_test(int *h_test_count)
{
unsigned int count = 0, address = 0;
unsigned int i, j;
for(i = 0; i < 232; i++)
for(j = 0; j < 512; j++)
count++;
h_test_count[0] = 0;//address;//j;//i;//item_idx;// item_cnt[threadIdx.x];count;//count;//
// take time of 0.000310 s when assign count to d_test_count[item_idx]
// take time of 0.000310 s when assign j to d_test_count[item_idx]
// take time of 0.000310 s when assign i to d_test_count[item_idx]
// take time of 0.000330 s when assign zero to d_test_count[item_idx]
// take time of 0.000310 s when assign address to d_test_count[item_idx]
}
int main( int argc, char **argv)
{
int *h_test_count = (int *) malloc(1024*sizeof(int));
double t1, t2;
seconds(t1);
h_count_test(h_test_count);
seconds(t2);
printf("time cost of h_test_count() is %f\n", t1);
printf("time cost of h_test_count() is %f\n", t2);
printf("time cost of h_test_count() is %f\n", t2-t1);
}[/codebox]
device : Geforce 8800 GT
cpu: dual core, 1.66Ghz
OS: Fedora 7
thanks in advance