In my code I have to perform two “basic” operations, a minimum reduction and an exclusive scan. At this stage I have 3 options.
Write my own kernels
Use thrust
Transfer data on CPU, do computation there and transfer back on GPU.
The second option will likely be faster and easier than the first one. But more interestingly I found out that option 3 is almost always faster than 1 (expected) and 2 !
I’m dealing with arrays having almost 2’000’000 elements. Is that still to small for the GPU to outperform the CPU ? What would be the “critical” array size ?
I’m running my code on a C2050 and my CPU is an overclocked (3.8 GHz) core i7 965.
In my code I have to perform two “basic” operations, a minimum reduction and an exclusive scan. At this stage I have 3 options.
Write my own kernels
Use thrust
Transfer data on CPU, do computation there and transfer back on GPU.
The second option will likely be faster and easier than the first one. But more interestingly I found out that option 3 is almost always faster than 1 (expected) and 2 !
I’m dealing with arrays having almost 2’000’000 elements. Is that still to small for the GPU to outperform the CPU ? What would be the “critical” array size ?
I’m running my code on a C2050 and my CPU is an overclocked (3.8 GHz) core i7 965.
Strange, considering that pushing the data over PCI express bus appears to be faster than performing the operation in the card’s (much faster bandwidth-wise) global memory.
You’d think that a scan or a reduction operate at or near the theoretical bandwidth limits of the cards.
Strange, considering that pushing the data over PCI express bus appears to be faster than performing the operation in the card’s (much faster bandwidth-wise) global memory.
You’d think that a scan or a reduction operate at or near the theoretical bandwidth limits of the cards.
If the data is already on GPU memory it shouldn’t be benificial to sent it to the CPU, compute, and send the answer back.
A few months ago i wrote a reduction kernel here (The Official NVIDIA Forums | NVIDIA) that did 2 million elements in 155.5 usec. A min reduction should be able to be done at the same speed. This should be way faster than option 3.
I also rewrote (but haven’t tested much yet) this reduction code for finding max and min and later posted for vivek80 ( which never gave me any feedback :( ) on this forum…
If the data is already on GPU memory it shouldn’t be benificial to sent it to the CPU, compute, and send the answer back.
A few months ago i wrote a reduction kernel here (The Official NVIDIA Forums | NVIDIA) that did 2 million elements in 155.5 usec. A min reduction should be able to be done at the same speed. This should be way faster than option 3.
I also rewrote (but haven’t tested much yet) this reduction code for finding max and min and later posted for vivek80 ( which never gave me any feedback :( ) on this forum…
Option 2 ought to be much faster than option 3. As cbuchner1 says, these kernels are all memory bound, so copying over the bus twice should be considerably slower.
I wrote a small benchmark (attached) and ran it on a C2050 with a Core i7 950 CPU (3.07 GHz) and got the following results
Option 2 ought to be much faster than option 3. As cbuchner1 says, these kernels are all memory bound, so copying over the bus twice should be considerably slower.
I wrote a small benchmark (attached) and ran it on a C2050 with a Core i7 950 CPU (3.07 GHz) and got the following results
Algorithms like sort, reduce, scan, etc. must be called from the host, though you can extend them with your own “functors” which will execute on the device. In the future we hope to have device-side algorithms as well, but that’s a ways off.
Anyway, the Thrust quick-start guide has a good summary of the library’s features. Feel free to post questions on our mailing list.
Algorithms like sort, reduce, scan, etc. must be called from the host, though you can extend them with your own “functors” which will execute on the device. In the future we hope to have device-side algorithms as well, but that’s a ways off.
Anyway, the Thrust quick-start guide has a good summary of the library’s features. Feel free to post questions on our mailing list.
Thanks for writing this small benchmark. Here are the results:
[codebox]
(method 2) compute on device 1.96662 ms
(method 3) copy + compute on host 9.76102 ms
[/codebox]
As the first cuda-call gets a small overhead due to initialization of the driver and so on, I added another call outside the region under time measurement. Here are the numbers:
[codebox]
(method 2) compute on device 1.7433 ms
(method 3) copy + compute on host 9.68877 ms
[/codebox]
Ok. So moving to the host to perform the calculation is not a reasonable alternative. Even if I take 200’000 elements the numbers remain more interesting on the device.
My question is now, why is my card so much slower than yours ? A factor of more than 3 cannot be negligible ! I do not have ECC activated.
Thanks for writing this small benchmark. Here are the results:
[codebox]
(method 2) compute on device 1.96662 ms
(method 3) copy + compute on host 9.76102 ms
[/codebox]
As the first cuda-call gets a small overhead due to initialization of the driver and so on, I added another call outside the region under time measurement. Here are the numbers:
[codebox]
(method 2) compute on device 1.7433 ms
(method 3) copy + compute on host 9.68877 ms
[/codebox]
Ok. So moving to the host to perform the calculation is not a reasonable alternative. Even if I take 200’000 elements the numbers remain more interesting on the device.
My question is now, why is my card so much slower than yours ? A factor of more than 3 cannot be negligible ! I do not have ECC activated.