Just be careful that threadIdx.x and blockIdx.x have to be interpreted slightly differently. I tried to show it in the parameters of the device functions.
PS: The reason for the kernels not overlapping probably is how Windows has internal work pipelines to communicate with the GPU and forward tasks. This sometimes leads to delays. Perhaps it could also be solved by changing some Windows settings or perhaps by using Cuda graphs? Perhaps also by inserting a delay. It is only the first kernel invocation after all, so perhaps you can also just accept it.
That looks neat! Thanks for the hint, I will try that.
Yeah, that crossed my mind, too. Apart from the “initial” delay, the rest of the profiling session looks pretty good. In the next step I will implement a better time measuring mechanism I saw in some Nvidia example so I can better compare my overlapping Linux version with the custom-copy-kernel version under Windows. If the custom version comes anywhere close to the original overlapping version, I suppose I’ll accept it.
I conducted a series of performance tests, aiming to compare various configurations for my copy kernel, that I would like to share.
General setup:
no. of test points: 50,000,000 (381.47 MiB)
chunksize: 5,000,000 (38.15 MiB)
streams: 1
For each grid configuration, I launched 2 test runs, following each other as closely as Nsight System would allow. Each run contained 10 launches of cpyToDeviceKernel, copying 1 chunk of data each. The mentioned durations (min/max/avg) each refer to 1 transferred chunk of 5,000,000 PointXY objects.
Tests utilizing mapped memory:
numBlocks = 40
40 x 1024 threads, run 1:
min: 12.753 ms
max: 13.738 ms
avg: 13.285 ms
40 x 1024 threads, run 2:
min: 12.571 ms
max: 13.977 ms
avg: 13.209 ms
40 x 768 threads, run 1:
min: 12.72 ms
max: 18.798 ms
avg: 13.96 ms
40 x 768 threads, run 2:
min: 12.558 ms
max: 14.173 ms
avg: 13.165 ms
40 x 512 threads, run 1:
min: 12.005 ms
max: 18.813 ms
avg: 16.538 ms
40 x 512 threads, run 2:
min: 13.684 ms
max: 19.375 ms
avg: 17.213 ms
40 x 256 threads, run 1:
min: 11.006 ms
max: 18.914 ms
avg: 16.664 ms
40 x 256 threads, run 2:
min: 12.25 ms
max: 18.7 ms
avg: 16.147 ms
numBlocks = 60
60 x 1024 threads, run 1:
min: 12.162 ms
max: 18.889 ms
avg: 16.354 ms
60 x 1024 threads, run 2:
min: 17.766 ms
max: 20.974 ms
avg: 18.638 ms
60 x 768 threads, run 1:
min: 16.505 ms
max: 20.962 ms
avg: 17.678 ms
60 x 768 threads, run 2:
min: 17.074 ms
max: 20.137 ms
avg: 17.94 ms
60 x 512 threads, run 1:
min: 16.7 ms
max: 20.031 ms
avg: 17.854 ms
60 x 512 threads, run 2:
min: 17.144 ms
max: 21.131 ms
avg: 18.632 ms
60 x 256 threads, run 1:
min: 17.184 ms
max: 19.288 ms
avg: 17.875 ms
60 x 256 threads, run 2:
min: 14.125 ms
max: 19.766 ms
avg: 17.982 ms
Tests utilizing pinned memory:
numBlocks = 40
40 x 1024 threads, run 1:
min: 8.172 ms
max: 9.339 ms
avg: 8.523 ms
40 x 1024 threads, run 2:
min: 8.208 ms
max: 9.193 ms
avg: 8.501 ms
40 x 768 threads, run 1:
min: 8.194 ms
max: 9.233 ms
avg: 8.42 ms
40 x 768 threads, run 2:
min: 8.169 ms
max: 8.739 ms
avg: 8.398 ms
40 x 512 threads, run 1:
min: 7.946 ms
max: 9.618 ms
avg: 9.005 ms
40 x 512 threads, run 2:
min: 7.935 ms
max: 10.512 ms
avg: 9.951 ms
40 x 256 threads, run 1:
min: 5.013 ms
max: 7.819 ms
avg: 5.939 ms
40 x 256 threads, run 2:
min: 4.344 ms
max: 7.478 ms
avg: 5.853 ms
numBlocks = 60
60 x 1024 threads, run 1:
min: 6.439 ms
max: 9.17 ms
avg: 7.857 ms
60 x 1024 threads, run 2:
min: 6.679 ms
max: 11.305 ms
avg: 8.461 ms
60 x 768 threads, run 1:
min: 6.403 ms
max: 11.724 ms
avg: 8.698 ms
60 x 768 threads, run 2:
min: 7.755 ms
max: 9.338 ms
avg: 8.822 ms
60 x 512 threads, run 1:
min: 6.521 ms
max: 9.287 ms
avg: 8.096 ms
60 x 512 threads, run 2:
min: 4.909 ms
max: 9.07 ms
avg: 6.963 ms
60 x 256 threads, run 1:
min: 4.084 ms
max: 8.099 ms
avg: 6.459 ms
60 x 256 threads, run 2:
min: 6.143 ms
max: 8.376 ms
avg: 7.622 ms
For reference, cudaMemcpyAsync() with the same chunksize:
cudaMemcpyAsync()
run 1:
min: 5.306 ms
max: 6.142 ms
avg: 5.492 ms
run 2:
min: 5.305 ms
max: 6.047 ms
avg: 5.667 ms
Conclusion:
Obviously, pinned memory holds a significant advantage over mapped memory concerning the copy speed. We had already gathered that.
The results for the same configuration sometimes differ a lot from run 1 to run 2. That’s the reason why I took 2 rounds for each configuration in the first place. Still, it surprises me to see that especially the best-case and worst-case results can diverge that much.
Talking about the best-case and the average-case results, the smaller blocksizes (x512 and x256 threads) seem to give the best results. I will see if I can get close to these when using a more production-like configuration with 2 or 3 streams.
The best results using cpyToDeviceKernel are pretty close to what I got out of using cudaMemcpyAsync. I find this a promising observation.
Wow. Nice overview. And very good results that you matched the cudaMemcpyAsync speed!
Not sure how much of min/max outliers is from the measurement method.
You can measure times on the GPU with the global timer or in the stream with a host callback.
Perhaps you want to add test runs for even less threads than 256?
You will need the threads as soon as you do parallel computation. Although 256 is low enough.
I’m currently using the std::chrono::high_resolution_clock because I want to be able to measure CPU and GPU calculations using the same method. Ultimately, I want to compare the time CPU and GPU need performing the same task.
Sure, I can try that.
By the way, I’m just testing with 2/3 streams and different grid configurations again. You said blockSize should always be a multiple of 128. Where did you get this constraint? Wouldn’t using a multiple of the warpSize (=32) be possible?
Sounds good. So this is actual runtime variation (probably still due to Windows), not some Nsight Systems artifact.
Yes. But I fear that kernels in other streams will not be able to use the freed threads (I am not 100% sure). You could try it out.
So even if the copy kernel is smaller, perhaps the computation kernel cannot be larger and still be running at the same time.
There are some hints that the reservation of resources is not independent between the partitions.
We have 4 SM Partitions per SM. Apart from whether it is possible, should we fill the partitions evenly?
For compute, one tries to divide the work evenly on the partitions, as each has their own computational units.
For (slow) memory accesses, it probably would not matter to distribute them evenly?
However, it could be that the number of transactions in flight (for slow latency) fills pipelines within the partitions.
I have to correct myself here: I am indeed using the high_resolution_clock, but only for measuring the total time needed for a complete run of H2D, computation, D2H. That is what I’m finally going to compare amongst different CPUs and GPUs. The measurings of the performance tests above are actually taken from Nsight Systems. I got that mixed up, sorry for the confusion.
Hm, okay. Maybe I will try this out. It would give me more flexibility when trying to find the optimal grid configuration while trying to prevent copying operations from completely occupying the GPU. On the other hand, that makes loads of additional test cases :-D