Problem about ScanLargearry I get different results :(

Hi, when I change the element number to 20million and run the sample code scanLargeArray, it tells me that the scan speed for CPU and GPU are 128 and 25 ms respectively. But I write a simple code for prefix only using C with the same number of elements and compile it with gcc -O2, it results in 40 ms or so. What happened to me?

Thanks :)

Don’t you mean “what will happen” to you for revealing this dark secret?

someone can give a answer?
or there is some dark side in cuda performance :)

I think here you’re talking about the CPU code performance. Just look at the code in the scan sample that does scan on the CPU and compare to what you do in your implementation. Also, make sure that you compile both (the sample and your code) with the same optimization flags.


Thanks, I think the only difference is I use gcc -O2 when compiling CPU code, but it does not make any sense when you say GPU is faster than CPU if you don’t use the best CPU compile, does it?

I don’t think we made the claim that the sample has the best CPU version. It’s there to simply illustrate how to compare GPU and CPU performance for users’ own apps. One could claim that your compile isn’t best because you didn’t use the best optimizing compiler on your CPU.

The SDK samples are for instructional purposes. You can definately see that comparing the performance of the matrix multiply sample with the CUBLAS sgemm (the latter is optimized). I wouldn’t consider scan a good candidate for breakthrough speedups vs cpu (one reason is that parallelizations have at most 50% efficiency). The big benefit would come from running scan on data that’s already on the gpu, coming from some other computations and destined for yet others.


I have a quick question. I will really appreciate your answers:

From the Programming Guide given by NVIDIA, I cannot figure out what configuration of the number of blocks and the number of warps per block, is the best configuration for the best performance per multiprocessor or per processor.

For example, in scanLargeArray code, you use 256 threads per block, however, why choose 256 instead of other numbers? What is the relationship between the number of threads per block and the number of multiprocessor and the number of processor per multiprocessor?

Suppose I want to process 10Million integer prefix sum, what number can you imagine could be one of the best configurations to get the best performance?

Thank you very much! You are my god! Anyone is very welcome to answer my above questions! Thank you all!

This really deserves its own thread. Here is my take on the subject: There are WAY too many variables to predict what block size will be the fastest. Just benchmark a representative test case for all possible block sizes (in multiples of 32 of course) and take the fastest.

Of course, the choice of the kernel implementation may necessitate the use of a particular block size. For instance, I believe the prefix sum kernels require a power of 2 size.

It appears to me that 256 or 192 threads per block usually gives the best results, but yeah, best is to try it out for specific algorithms.

For scanLargeArray, I get the best result using 48 blocks 224 threads using another scheme. Nearly 75% faster than the SDK version for large data, and only uses O(1) additional storage. Ironically, the idea comes from the SDK’s histogram sample.

224 threads is the maximum I can get since I use 72 bytes shared memory per thread, and my algorithm requires # threads to be multiply of 16. 48 blocks is found by experiment on a 8800GTX, it may be different for other cards.

The code compares SDK version (tscan), my version (ttest) and CPU version (tcpub). It prints time of 10 scans, using windows API for timing.

thanks, man, it is good!

it runs 50% faster on my machine. Good job man!

How does it compare to these results from UIUC:…HallOfFame.html?

Also, I think there’s an implementation of scan floating around that does 16,000,000 elements in about 5.5ms.


Oh, Paulius, I can not open the link, can you give it again?

Replace _ with /
My scan:
n=16000001 Time: 6.44551 ms
no error

0.2% faster than the best’s 6.452000 ms
However, David’s version is recursive and requires temp storage, that makes my constant memory usage quite an advantage:)

By the way, I tuned my parameters for ~100k arrays (which I often used), for a 16M array, it may be better to increase # of blocks and do block-scan on CPU.

David’s loading method may help me. I’ll try to improve my version when I have time.

To Paulius:

Can you point out where the floating point scan is?
Admittedly, I don’t think my scan would work well for a 16M float scan. The summing order would likely result in too much error.

I didn’t mean that the scan code is for floating point, I meant that the code is “floating around.”

I think the main bottleneck for scan is gmem reads/writes - you have to read/write each element twice (since you have to process it twice, hence the 50% efficiency). So, one way to judge how close you’re getting to the limit is to measure the memory throughput you’re achieving with your implementation.

I think the time you’re getting is really good and probably getting close to the limit.


David’s load is indeed faster! And now i reached 5.43ms. Thanks for the URL, Paulius!!!

n=8000001 Time: 2.37097 ms
no error
n=16000001 Time: 5.43002 ms
no error

This is a dll that performs the integer scan.
Usage: scan(target,source,# of ints)

Also, the 2nd kernel of the scan happens to be a repro case where 36 regs / 224 threads fails with no error. Is it a known issue?
Occupancy is critical for that kernel, and I really want to get back to 224 threads… (currently 192)

Indeed, my scan gets around this by reading twice and writing once. But now seems I’m still only getting 33G/s throughput:(

On the other hand, that means maybe I could still improve that!:)

That’s the best time I’ve seen or heard of so far. Nice job.