[Thrust] Performance Array size

Magorath · October 7, 2010, 7:35am

Hi all !

In my code I have to perform two “basic” operations, a minimum reduction and an exclusive scan. At this stage I have 3 options.

Write my own kernels
Use thrust
Transfer data on CPU, do computation there and transfer back on GPU.

The second option will likely be faster and easier than the first one. But more interestingly I found out that option 3 is almost always faster than 1 (expected) and 2 !

I’m dealing with arrays having almost 2’000’000 elements. Is that still to small for the GPU to outperform the CPU ? What would be the “critical” array size ?

I’m running my code on a C2050 and my CPU is an overclocked (3.8 GHz) core i7 965.

Any advice would be greatly appreciated.

Magorath · October 7, 2010, 7:35am

Hi all !

In my code I have to perform two “basic” operations, a minimum reduction and an exclusive scan. At this stage I have 3 options.

Write my own kernels
Use thrust
Transfer data on CPU, do computation there and transfer back on GPU.

The second option will likely be faster and easier than the first one. But more interestingly I found out that option 3 is almost always faster than 1 (expected) and 2 !

I’m dealing with arrays having almost 2’000’000 elements. Is that still to small for the GPU to outperform the CPU ? What would be the “critical” array size ?

I’m running my code on a C2050 and my CPU is an overclocked (3.8 GHz) core i7 965.

Any advice would be greatly appreciated.

cbuchner1 · October 7, 2010, 12:04pm

Strange, considering that pushing the data over PCI express bus appears to be faster than performing the operation in the card’s (much faster bandwidth-wise) global memory.

You’d think that a scan or a reduction operate at or near the theoretical bandwidth limits of the cards.

cbuchner1 · October 7, 2010, 12:04pm

Strange, considering that pushing the data over PCI express bus appears to be faster than performing the operation in the card’s (much faster bandwidth-wise) global memory.

You’d think that a scan or a reduction operate at or near the theoretical bandwidth limits of the cards.

Jimmy_Pettersson · October 7, 2010, 1:53pm

If the data is already on GPU memory it shouldn’t be benificial to sent it to the CPU, compute, and send the answer back.

A few months ago i wrote a reduction kernel here (The Official NVIDIA Forums | NVIDIA) that did 2 million elements in 155.5 usec. A min reduction should be able to be done at the same speed. This should be way faster than option 3.

I also rewrote (but haven’t tested much yet) this reduction code for finding max and min and later posted for vivek80 ( which never gave me any feedback :( ) on this forum…

Jimmy_Pettersson · October 7, 2010, 1:53pm

If the data is already on GPU memory it shouldn’t be benificial to sent it to the CPU, compute, and send the answer back.

A few months ago i wrote a reduction kernel here (The Official NVIDIA Forums | NVIDIA) that did 2 million elements in 155.5 usec. A min reduction should be able to be done at the same speed. This should be way faster than option 3.

I also rewrote (but haven’t tested much yet) this reduction code for finding max and min and later posted for vivek80 ( which never gave me any feedback :( ) on this forum…

nbell · October 7, 2010, 4:24pm

Hi Magorath,

Option 2 ought to be much faster than option 3. As cbuchner1 says, these kernels are all memory bound, so copying over the bus twice should be considerably slower.

I wrote a small benchmark (attached) and ran it on a C2050 with a Core i7 950 CPU (3.07 GHz) and got the following results

[codebox]$ nvcc -O2 bench.cu -o bench -I ~/scratch/thrust/

$ ./bench

(method 2) compute on device 0.597184 ms

(method 3) copy + compute on host 12.2763 ms[/codebox]

If you run the attached code, do you get similar results? If so, how does your benchmark differ?
bench.cu (1.63 KB)

nbell · October 7, 2010, 4:24pm

Hi Magorath,

Option 2 ought to be much faster than option 3. As cbuchner1 says, these kernels are all memory bound, so copying over the bus twice should be considerably slower.

I wrote a small benchmark (attached) and ran it on a C2050 with a Core i7 950 CPU (3.07 GHz) and got the following results

[codebox]$ nvcc -O2 bench.cu -o bench -I ~/scratch/thrust/

$ ./bench

(method 2) compute on device 0.597184 ms

(method 3) copy + compute on host 12.2763 ms[/codebox]

If you run the attached code, do you get similar results? If so, how does your benchmark differ?

eyalhir74 · October 7, 2010, 7:40pm

That’s nice, neat and real cool code… :)

Is it possible to call the Thrust code/structures from within kernels? or only from host code?

eyal

eyalhir74 · October 7, 2010, 7:40pm

That’s nice, neat and real cool code… :)

Is it possible to call the Thrust code/structures from within kernels? or only from host code?

eyal

nbell · October 8, 2010, 2:06am

Thanks!

Algorithms like sort, reduce, scan, etc. must be called from the host, though you can extend them with your own “functors” which will execute on the device. In the future we hope to have device-side algorithms as well, but that’s a ways off.

Anyway, the Thrust quick-start guide has a good summary of the library’s features. Feel free to post questions on our mailing list.

nbell · October 8, 2010, 2:06am

Thanks!

Algorithms like sort, reduce, scan, etc. must be called from the host, though you can extend them with your own “functors” which will execute on the device. In the future we hope to have device-side algorithms as well, but that’s a ways off.

Anyway, the Thrust quick-start guide has a good summary of the library’s features. Feel free to post questions on our mailing list.

Sarnath · October 8, 2010, 6:43am

Yes, Tempting to use Thrust for such applications! Nice.

Sarnath · October 8, 2010, 6:43am

Yes, Tempting to use Thrust for such applications! Nice.

Magorath · October 8, 2010, 6:49am

Thanks for writing this small benchmark. Here are the results:

[codebox]

(method 2) compute on device 1.96662 ms

(method 3) copy + compute on host 9.76102 ms

[/codebox]

As the first cuda-call gets a small overhead due to initialization of the driver and so on, I added another call outside the region under time measurement. Here are the numbers:

[codebox]

(method 2) compute on device 1.7433 ms

(method 3) copy + compute on host 9.68877 ms

[/codebox]

Ok. So moving to the host to perform the calculation is not a reasonable alternative. Even if I take 200’000 elements the numbers remain more interesting on the device.

My question is now, why is my card so much slower than yours ? A factor of more than 3 cannot be negligible ! I do not have ECC activated.

Magorath · October 8, 2010, 6:49am

Thanks for writing this small benchmark. Here are the results:

[codebox]

(method 2) compute on device 1.96662 ms

(method 3) copy + compute on host 9.76102 ms

[/codebox]

As the first cuda-call gets a small overhead due to initialization of the driver and so on, I added another call outside the region under time measurement. Here are the numbers:

[codebox]

(method 2) compute on device 1.7433 ms

(method 3) copy + compute on host 9.68877 ms

[/codebox]

Ok. So moving to the host to perform the calculation is not a reasonable alternative. Even if I take 200’000 elements the numbers remain more interesting on the device.

My question is now, why is my card so much slower than yours ? A factor of more than 3 cannot be negligible ! I do not have ECC activated.

Magorath · October 8, 2010, 7:31am

Here are more details:

I’m running fedora 13 with the driver 256.4 and cuda 3.1. There is no screen plugged on the card.

Magorath · October 8, 2010, 7:31am

Here are more details:

I’m running fedora 13 with the driver 256.4 and cuda 3.1. There is no screen plugged on the card.

Magorath · October 8, 2010, 8:46am

Ok. I got it. The profiler was activated. Here are the numbers without profiling:

[codebox]

(method 2) compute on device 0.557664 ms

(method 3) copy + compute on host 9.1033 ms

[/codebox]

I thought that the profiler wasn’t using much ressource. I’m wrong…

Magorath · October 8, 2010, 8:46am

Ok. I got it. The profiler was activated. Here are the numbers without profiling:

[codebox]

(method 2) compute on device 0.557664 ms

(method 3) copy + compute on host 9.1033 ms

[/codebox]

I thought that the profiler wasn’t using much ressource. I’m wrong…

Topic		Replies	Views
My thrust code is 10 times slower than CPU, what did I do wrong GPU-Accelerated Libraries cuda , thrust	8	1714	November 15, 2022
Modern GPU CUDA Programming and Performance	30	6087	April 11, 2016
Thrust::minmax_element slower than host implementation with OpenCV CUDA Programming and Performance opencv , cuda	10	2171	December 6, 2020
Thrust: Out of memory for large array CUDA Programming and Performance	12	6954	May 3, 2012
Starter Question Gpu exec time vs Cpu exec time CUDA Programming and Performance	1	3217	February 16, 2012
Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,. CUDA Programming and Performance	53	15428	March 24, 2018
reduction of dynamic array CUDA Programming and Performance	16	2257	December 1, 2016
CUDA reduction CUDA Programming and Performance	10	51574	June 7, 2009
Thrust v1.0 release A high-level C++ template library for CUDA CUDA Programming and Performance	11	16915	May 30, 2009
Problem about ScanLargearry I get different results :( CUDA Programming and Performance	36	15466	October 20, 2007

[Thrust] Performance Array size

Related topics