When I change the elements of the source code in scanLargeArray, test fails when the element is more than 1 million, so who can tell me why? thanks very much!!
It works fine in my tests for reasonable values of numElements. What are you changing?
Since the maximum number of thread blocks in one dimension of a grid in CUDA is 65535, and each thread block processes 512 elements, that means the max numElements is 33553920. If you run more than this you will get a kernel launch failure.
However you can get TEST FAILED even before that. Because the code uses single-precision floats, you are limited by the precision of the floating point mantissa. The mantissa is 23 bits but you get one implicit bit for free, for an effective 24 bits of mantissa. 2^24 = 16777216. Since we initialize the input array to all 1s, if you add more than 16777216 values you will exceed the precision.
If you run in debug mode, you will see the following message printed out by the CPU “computeGold” function when it detects that single precision is exceeded.
“Warning: exceeding single-precision accuracy. Scan will be inaccurate.”
Similarly, if you change the input values so they use more than one bit each, then you will more quickly exceed single precision. For example, if you just use rand() and take all 32 bits, then you can add 16777216 / 32 = 524288 values before you run out of precision and start accumulating errors. If you also scale the random values so they have a large range, then you may introduce errors sooner.
If you want to scan integers, you can convert the code to use int or unsigned int and then you will have a full 32 bits of precision, which means you should be able to scan about 128M 32-bit integers before getting errors.
Hope this helps, and let me know if there is some other error you are seein.
Thanks for your replying :)
I have changed the float to int and I have another question now…
In my computer when I use the sample code for the prefix, it tells me that the CPU time for scanning a 20 million large array is 128ms but I write another simple code for computing that, it just takes me 40 ms when I use gcc -O2 to compile it. So why???
So anyone can tell me why :(
The CPU implementation included with scanLargeArray is included for checking correctness of the GPU implementation only. It is by no means the most efficient implementation of scan – you can probably do much better yourself (as you have). This is not intended as a CPU vs. GPU benchmark.
Thank you very much for answering my previous questions.
I have two more quick questions. I will really appreciate your answers:
- From the Programming Guide given by NVIDIA, I cannot figure out what configuration of the number of blocks and the number of warps per block, is the best configuration for the best performance per multiprocessor or per processor.
For example, in your scanLargeArray code, you use 256 threads per block, however, why choose 256 instead of other numbers? What is the relationship between the number of threads per block and the number of multiprocessor and the number of processor per multiprocessor?
Suppose I want to process 10Million integer prefix sum, what number can you imagine could be one of the best configurations to get the best performance?
- When you did you prefix sum, did you note any other more efficient algorithms? Could you please give me some hint about those?
Thank you very much! You are my god! Anyone is very welcome to answer my above questions! Thank you all!