I have an algorithm that is highly scalable.
If we take two images it takes areas of those images and compute some distances for each area.
It does it in a recursive way for each area.
I thought that this could be efficently done in CUDA but after a small experiment I’ve conducted I’m not so convinced anymore.
The experiment I did is just to test the speed of Cuda versus CPU computation.
I started creating a summation of arrays of 10000 int values.
CPU 1 - GPU 0
GPU is overwhelmed by CPU, much slower.
I thought probably that the float computation would have shown the CPU limitations,
10000 float values
CPU 1 - GPU 0
GPU is overwhelmed by CPU, slow again.
So I checked how much time was spent in the copy to device and copy from device actions and still surprise,
the timing was shared between the three actions (copy/sum/copy) evenly.
So to check I ran the bandwidth test:
Running on... Device 0: GeForce 8400 GS Quick Mode Host to Device Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) 33554432 2051.8 Device to Host Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1398.5 Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4119.0 [bandwidthTest] - Test results: PASSED
So, basically I can copy with speed of 2GB/s from Host to Device but the time to copy just some float/int is really big compared to the one of the CPU.
How can it be?
Do you have any clues?