How much faster you think it can goes?

I have an algorithm that is highly scalable.

If we take two images it takes areas of those images and compute some distances for each area.

It does it in a recursive way for each area.

I thought that this could be efficently done in CUDA but after a small experiment I’ve conducted I’m not so convinced anymore.

The experiment I did is just to test the speed of Cuda versus CPU computation.

I started creating a summation of arrays of 10000 int values.

Result:

CPU 1 - GPU 0

GPU is overwhelmed by CPU, much slower.

I thought probably that the float computation would have shown the CPU limitations,

10000 float values

Result:

CPU 1 - GPU 0

GPU is overwhelmed by CPU, slow again.

So I checked how much time was spent in the copy to device and copy from device actions and still surprise,

the timing was shared between the three actions (copy/sum/copy) evenly.

So to check I ran the bandwidth test:

Running on...

Device 0: GeForce 8400 GS

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     2051.8

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     1398.5

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     4119.0

[bandwidthTest] - Test results:

PASSED

So, basically I can copy with speed of 2GB/s from Host to Device but the time to copy just some float/int is really big compared to the one of the CPU.

How can it be?

Do you have any clues?

Thanks.

I have an algorithm that is highly scalable.

If we take two images it takes areas of those images and compute some distances for each area.

It does it in a recursive way for each area.

I thought that this could be efficently done in CUDA but after a small experiment I’ve conducted I’m not so convinced anymore.

The experiment I did is just to test the speed of Cuda versus CPU computation.

I started creating a summation of arrays of 10000 int values.

Result:

CPU 1 - GPU 0

GPU is overwhelmed by CPU, much slower.

I thought probably that the float computation would have shown the CPU limitations,

10000 float values

Result:

CPU 1 - GPU 0

GPU is overwhelmed by CPU, slow again.

So I checked how much time was spent in the copy to device and copy from device actions and still surprise,

the timing was shared between the three actions (copy/sum/copy) evenly.

So to check I ran the bandwidth test:

Running on...

Device 0: GeForce 8400 GS

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     2051.8

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     1398.5

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     4119.0

[bandwidthTest] - Test results:

PASSED

So, basically I can copy with speed of 2GB/s from Host to Device but the time to copy just some float/int is really big compared to the one of the CPU.

How can it be?

Do you have any clues?

Thanks.