in my application i would like give a comparison between implementation of the same code in CPU and in GPU
the CPU is Intel® core™ Quad CPU Q6600 @ 2.4 GHz
the GPU is Geforce GTS 250 128 cores
so when i execute the code i use the gettimeofday() function to calculate the time
but the result show that the execution in CPU is faster than in GPU !!!
even if I increase the number of database usully the time of CPU lower than GPU
but I must have the opposite : execution time in GPU less than CPU
my code is just subtraction between two tables
code in GPU
… global void incrementArrayOnDevice(float *a,float *c,float res ,int N)
{
int idx = blockIdx.xblockDim.x + threadIdx.x;
if (idx<N) res[idx]=a[idx]-c[idx];
}
…
code in CPU
…
for(int i=0;i<600;i++)
res[i]=a[i]-b[i];
…
so i dont what’s the problem if somebody have any idea about this result ?
Come back when you have a problem that works on millions of elements. ;) Six hundred? Of course the CPU is faster.
Note that PCI express bandwidth is often slower than the speed of DRAM access of the CPU. Meaning that when you’ve got to copy the data to the GPU and back, you’re definitely going to be slower when the GPU has little work to do.
Give the GPU some real work, something FLOP heavy. Like an FFT, or some other heavy processing. Per data element that is read from memory you’ll have to do significantly more than one FLOP. Ideally with some transcendental functions, because that is where the GPU shines. Then you will see some speedup.
Come back when you have a problem that works on millions of elements. ;) Six hundred? Of course the CPU is faster.
Note that PCI express bandwidth is often slower than the speed of DRAM access of the CPU. Meaning that when you’ve got to copy the data to the GPU and back, you’re definitely going to be slower when the GPU has little work to do.
Give the GPU some real work, something FLOP heavy. Like an FFT, or some other heavy processing. Per data element that is read from memory you’ll have to do significantly more than one FLOP. Ideally with some transcendental functions, because that is where the GPU shines. Then you will see some speedup.
As Chris said, the problem size is too small for you to benefit.
For a quick example, try the ‘nbody’ app from the SDK:
CPU
$ ./nbody -benchmark -n=30720 -cpu
30720 bodies, total time for 10 iterations: 32679.305 ms
= 0.289 billion interactions per second
= 5.776 single-precision GFLOP/s at 20 flops per interaction
GPU
$ ./nbody -benchmark
30720 bodies, total time for 10 iterations: 407.776 ms
= 23.143 billion interactions per second
= 462.861 single-precision GFLOP/s at 20 flops per interaction
As Chris said, the problem size is too small for you to benefit.
For a quick example, try the ‘nbody’ app from the SDK:
CPU
$ ./nbody -benchmark -n=30720 -cpu
30720 bodies, total time for 10 iterations: 32679.305 ms
= 0.289 billion interactions per second
= 5.776 single-precision GFLOP/s at 20 flops per interaction
GPU
$ ./nbody -benchmark
30720 bodies, total time for 10 iterations: 407.776 ms
= 23.143 billion interactions per second
= 462.861 single-precision GFLOP/s at 20 flops per interaction