[Thrust] Performance Array size

Hi all !

In my code I have to perform two “basic” operations, a minimum reduction and an exclusive scan. At this stage I have 3 options.

  1. Write my own kernels
  2. Use thrust
  3. Transfer data on CPU, do computation there and transfer back on GPU.

The second option will likely be faster and easier than the first one. But more interestingly I found out that option 3 is almost always faster than 1 (expected) and 2 !

I’m dealing with arrays having almost 2’000’000 elements. Is that still to small for the GPU to outperform the CPU ? What would be the “critical” array size ?

I’m running my code on a C2050 and my CPU is an overclocked (3.8 GHz) core i7 965.

Any advice would be greatly appreciated.

Hi all !

In my code I have to perform two “basic” operations, a minimum reduction and an exclusive scan. At this stage I have 3 options.

  1. Write my own kernels
  2. Use thrust
  3. Transfer data on CPU, do computation there and transfer back on GPU.

The second option will likely be faster and easier than the first one. But more interestingly I found out that option 3 is almost always faster than 1 (expected) and 2 !

I’m dealing with arrays having almost 2’000’000 elements. Is that still to small for the GPU to outperform the CPU ? What would be the “critical” array size ?

I’m running my code on a C2050 and my CPU is an overclocked (3.8 GHz) core i7 965.

Any advice would be greatly appreciated.

Strange, considering that pushing the data over PCI express bus appears to be faster than performing the operation in the card’s (much faster bandwidth-wise) global memory.

You’d think that a scan or a reduction operate at or near the theoretical bandwidth limits of the cards.

Strange, considering that pushing the data over PCI express bus appears to be faster than performing the operation in the card’s (much faster bandwidth-wise) global memory.

You’d think that a scan or a reduction operate at or near the theoretical bandwidth limits of the cards.

If the data is already on GPU memory it shouldn’t be benificial to sent it to the CPU, compute, and send the answer back.

A few months ago i wrote a reduction kernel here (The Official NVIDIA Forums | NVIDIA) that did 2 million elements in 155.5 usec. A min reduction should be able to be done at the same speed. This should be way faster than option 3.

I also rewrote (but haven’t tested much yet) this reduction code for finding max and min and later posted for vivek80 ( which never gave me any feedback :( ) on this forum…

If the data is already on GPU memory it shouldn’t be benificial to sent it to the CPU, compute, and send the answer back.

A few months ago i wrote a reduction kernel here (The Official NVIDIA Forums | NVIDIA) that did 2 million elements in 155.5 usec. A min reduction should be able to be done at the same speed. This should be way faster than option 3.

I also rewrote (but haven’t tested much yet) this reduction code for finding max and min and later posted for vivek80 ( which never gave me any feedback :( ) on this forum…

Hi Magorath,

Option 2 ought to be much faster than option 3. As cbuchner1 says, these kernels are all memory bound, so copying over the bus twice should be considerably slower.

I wrote a small benchmark (attached) and ran it on a C2050 with a Core i7 950 CPU (3.07 GHz) and got the following results

[codebox]$ nvcc -O2 bench.cu -o bench -I ~/scratch/thrust/

$ ./bench

(method 2) compute on device 0.597184 ms

(method 3) copy + compute on host 12.2763 ms[/codebox]

If you run the attached code, do you get similar results? If so, how does your benchmark differ?
bench.cu (1.63 KB)

Hi Magorath,

Option 2 ought to be much faster than option 3. As cbuchner1 says, these kernels are all memory bound, so copying over the bus twice should be considerably slower.

I wrote a small benchmark (attached) and ran it on a C2050 with a Core i7 950 CPU (3.07 GHz) and got the following results

[codebox]$ nvcc -O2 bench.cu -o bench -I ~/scratch/thrust/

$ ./bench

(method 2) compute on device 0.597184 ms

(method 3) copy + compute on host 12.2763 ms[/codebox]

If you run the attached code, do you get similar results? If so, how does your benchmark differ?

That’s nice, neat and real cool code… :)

Is it possible to call the Thrust code/structures from within kernels? or only from host code?

eyal

That’s nice, neat and real cool code… :)

Is it possible to call the Thrust code/structures from within kernels? or only from host code?

eyal

Thanks!

Algorithms like sort, reduce, scan, etc. must be called from the host, though you can extend them with your own “functors” which will execute on the device. In the future we hope to have device-side algorithms as well, but that’s a ways off.

Anyway, the Thrust quick-start guide has a good summary of the library’s features. Feel free to post questions on our mailing list.

Thanks!

Algorithms like sort, reduce, scan, etc. must be called from the host, though you can extend them with your own “functors” which will execute on the device. In the future we hope to have device-side algorithms as well, but that’s a ways off.

Anyway, the Thrust quick-start guide has a good summary of the library’s features. Feel free to post questions on our mailing list.

Yes, Tempting to use Thrust for such applications! Nice.

Yes, Tempting to use Thrust for such applications! Nice.

Thanks for writing this small benchmark. Here are the results:

[codebox]

(method 2) compute on device 1.96662 ms

(method 3) copy + compute on host 9.76102 ms

[/codebox]

As the first cuda-call gets a small overhead due to initialization of the driver and so on, I added another call outside the region under time measurement. Here are the numbers:

[codebox]

(method 2) compute on device 1.7433 ms

(method 3) copy + compute on host 9.68877 ms

[/codebox]

Ok. So moving to the host to perform the calculation is not a reasonable alternative. Even if I take 200’000 elements the numbers remain more interesting on the device.

My question is now, why is my card so much slower than yours ? A factor of more than 3 cannot be negligible ! I do not have ECC activated.

Thanks for writing this small benchmark. Here are the results:

[codebox]

(method 2) compute on device 1.96662 ms

(method 3) copy + compute on host 9.76102 ms

[/codebox]

As the first cuda-call gets a small overhead due to initialization of the driver and so on, I added another call outside the region under time measurement. Here are the numbers:

[codebox]

(method 2) compute on device 1.7433 ms

(method 3) copy + compute on host 9.68877 ms

[/codebox]

Ok. So moving to the host to perform the calculation is not a reasonable alternative. Even if I take 200’000 elements the numbers remain more interesting on the device.

My question is now, why is my card so much slower than yours ? A factor of more than 3 cannot be negligible ! I do not have ECC activated.

Here are more details:

I’m running fedora 13 with the driver 256.4 and cuda 3.1. There is no screen plugged on the card.

Here are more details:

I’m running fedora 13 with the driver 256.4 and cuda 3.1. There is no screen plugged on the card.

Ok. I got it. The profiler was activated. Here are the numbers without profiling:

[codebox]

(method 2) compute on device 0.557664 ms

(method 3) copy + compute on host 9.1033 ms

[/codebox]

I thought that the profiler wasn’t using much ressource. I’m wrong…

Ok. I got it. The profiler was activated. Here are the numbers without profiling:

[codebox]

(method 2) compute on device 0.557664 ms

(method 3) copy + compute on host 9.1033 ms

[/codebox]

I thought that the profiler wasn’t using much ressource. I’m wrong…