Why doesn't overlapping data transfers and kernel execution work here?

Okay, let’s do another trick on top to get parallelism:

Instead of two streams, just use one stream. Instead of 3 kernels use 1 kernel and assign the function according to threadIdx.x.

H2D() still copies the next block, and D2H() copies back the previous block.

numBlocks = 20;
blockSize = 1024:

(alternatively 40 blocks with 768 threads each)

__device__ copy()
{
    if(threadIdx.x < 896)
        H2D(128, threadIdx.x - 768);
    else
        D2H(128, threadIdx.x - 896);
}

__global__ allInOneKernel()
{
    if (threadIdx.x < 768)
        compute(768, threadIdx.x);
    else
        copy();
}

Just be careful that threadIdx.x and blockIdx.x have to be interpreted slightly differently. I tried to show it in the parameters of the device functions.

PS: The reason for the kernels not overlapping probably is how Windows has internal work pipelines to communicate with the GPU and forward tasks. This sometimes leads to delays. Perhaps it could also be solved by changing some Windows settings or perhaps by using Cuda graphs? Perhaps also by inserting a delay. It is only the first kernel invocation after all, so perhaps you can also just accept it.

1 Like

That looks neat! Thanks for the hint, I will try that.

Yeah, that crossed my mind, too. Apart from the “initial” delay, the rest of the profiling session looks pretty good. In the next step I will implement a better time measuring mechanism I saw in some Nvidia example so I can better compare my overlapping Linux version with the custom-copy-kernel version under Windows. If the custom version comes anywhere close to the original overlapping version, I suppose I’ll accept it.

I conducted a series of performance tests, aiming to compare various configurations for my copy kernel, that I would like to share.

General setup:

  • no. of test points: 50,000,000 (381.47 MiB)
  • chunksize: 5,000,000 (38.15 MiB)
  • streams: 1

For each grid configuration, I launched 2 test runs, following each other as closely as Nsight System would allow. Each run contained 10 launches of cpyToDeviceKernel, copying 1 chunk of data each. The mentioned durations (min/max/avg) each refer to 1 transferred chunk of 5,000,000 PointXY objects.

Tests utilizing mapped memory:

numBlocks = 40

40 x 1024 threads, run 1:

  • min: 12.753 ms
  • max: 13.738 ms
  • avg: 13.285 ms

40 x 1024 threads, run 2:

  • min: 12.571 ms
  • max: 13.977 ms
  • avg: 13.209 ms

40 x 768 threads, run 1:

  • min: 12.72 ms
  • max: 18.798 ms
  • avg: 13.96 ms

40 x 768 threads, run 2:

  • min: 12.558 ms
  • max: 14.173 ms
  • avg: 13.165 ms

40 x 512 threads, run 1:

  • min: 12.005 ms
  • max: 18.813 ms
  • avg: 16.538 ms

40 x 512 threads, run 2:

  • min: 13.684 ms
  • max: 19.375 ms
  • avg: 17.213 ms

40 x 256 threads, run 1:

  • min: 11.006 ms
  • max: 18.914 ms
  • avg: 16.664 ms

40 x 256 threads, run 2:

  • min: 12.25 ms
  • max: 18.7 ms
  • avg: 16.147 ms
numBlocks = 60

60 x 1024 threads, run 1:

  • min: 12.162 ms
  • max: 18.889 ms
  • avg: 16.354 ms

60 x 1024 threads, run 2:

  • min: 17.766 ms
  • max: 20.974 ms
  • avg: 18.638 ms

60 x 768 threads, run 1:

  • min: 16.505 ms
  • max: 20.962 ms
  • avg: 17.678 ms

60 x 768 threads, run 2:

  • min: 17.074 ms
  • max: 20.137 ms
  • avg: 17.94 ms

60 x 512 threads, run 1:

  • min: 16.7 ms
  • max: 20.031 ms
  • avg: 17.854 ms

60 x 512 threads, run 2:

  • min: 17.144 ms
  • max: 21.131 ms
  • avg: 18.632 ms

60 x 256 threads, run 1:

  • min: 17.184 ms
  • max: 19.288 ms
  • avg: 17.875 ms

60 x 256 threads, run 2:

  • min: 14.125 ms
  • max: 19.766 ms
  • avg: 17.982 ms

Tests utilizing pinned memory:

numBlocks = 40

40 x 1024 threads, run 1:

  • min: 8.172 ms
  • max: 9.339 ms
  • avg: 8.523 ms

40 x 1024 threads, run 2:

  • min: 8.208 ms
  • max: 9.193 ms
  • avg: 8.501 ms

40 x 768 threads, run 1:

  • min: 8.194 ms
  • max: 9.233 ms
  • avg: 8.42 ms

40 x 768 threads, run 2:

  • min: 8.169 ms
  • max: 8.739 ms
  • avg: 8.398 ms

40 x 512 threads, run 1:

  • min: 7.946 ms
  • max: 9.618 ms
  • avg: 9.005 ms

40 x 512 threads, run 2:

  • min: 7.935 ms
  • max: 10.512 ms
  • avg: 9.951 ms

40 x 256 threads, run 1:

  • min: 5.013 ms
  • max: 7.819 ms
  • avg: 5.939 ms

40 x 256 threads, run 2:

  • min: 4.344 ms
  • max: 7.478 ms
  • avg: 5.853 ms
numBlocks = 60

60 x 1024 threads, run 1:

  • min: 6.439 ms
  • max: 9.17 ms
  • avg: 7.857 ms

60 x 1024 threads, run 2:

  • min: 6.679 ms
  • max: 11.305 ms
  • avg: 8.461 ms

60 x 768 threads, run 1:

  • min: 6.403 ms
  • max: 11.724 ms
  • avg: 8.698 ms

60 x 768 threads, run 2:

  • min: 7.755 ms
  • max: 9.338 ms
  • avg: 8.822 ms

60 x 512 threads, run 1:

  • min: 6.521 ms
  • max: 9.287 ms
  • avg: 8.096 ms

60 x 512 threads, run 2:

  • min: 4.909 ms
  • max: 9.07 ms
  • avg: 6.963 ms

60 x 256 threads, run 1:

  • min: 4.084 ms
  • max: 8.099 ms
  • avg: 6.459 ms

60 x 256 threads, run 2:

  • min: 6.143 ms
  • max: 8.376 ms
  • avg: 7.622 ms

For reference, cudaMemcpyAsync() with the same chunksize:

cudaMemcpyAsync()

run 1:

  • min: 5.306 ms
  • max: 6.142 ms
  • avg: 5.492 ms

run 2:

  • min: 5.305 ms
  • max: 6.047 ms
  • avg: 5.667 ms

Conclusion:

  • Obviously, pinned memory holds a significant advantage over mapped memory concerning the copy speed. We had already gathered that.
  • The results for the same configuration sometimes differ a lot from run 1 to run 2. That’s the reason why I took 2 rounds for each configuration in the first place. Still, it surprises me to see that especially the best-case and worst-case results can diverge that much.
  • Talking about the best-case and the average-case results, the smaller blocksizes (x512 and x256 threads) seem to give the best results. I will see if I can get close to these when using a more production-like configuration with 2 or 3 streams.
  • The best results using cpyToDeviceKernel are pretty close to what I got out of using cudaMemcpyAsync. I find this a promising observation.
1 Like

Wow. Nice overview. And very good results that you matched the cudaMemcpyAsync speed!

Not sure how much of min/max outliers is from the measurement method.
You can measure times on the GPU with the global timer or in the stream with a host callback.

Perhaps you want to add test runs for even less threads than 256?

You will need the threads as soon as you do parallel computation. Although 256 is low enough.

I’m currently using the std::chrono::high_resolution_clock because I want to be able to measure CPU and GPU calculations using the same method. Ultimately, I want to compare the time CPU and GPU need performing the same task.

Sure, I can try that.
By the way, I’m just testing with 2/3 streams and different grid configurations again. You said blockSize should always be a multiple of 128. Where did you get this constraint? Wouldn’t using a multiple of the warpSize (=32) be possible?

Sounds good. So this is actual runtime variation (probably still due to Windows), not some Nsight Systems artifact.

Yes. But I fear that kernels in other streams will not be able to use the freed threads (I am not 100% sure). You could try it out.
So even if the copy kernel is smaller, perhaps the computation kernel cannot be larger and still be running at the same time.
There are some hints that the reservation of resources is not independent between the partitions.

We have 4 SM Partitions per SM. Apart from whether it is possible, should we fill the partitions evenly?

  • For compute, one tries to divide the work evenly on the partitions, as each has their own computational units.

  • For (slow) memory accesses, it probably would not matter to distribute them evenly?
    However, it could be that the number of transactions in flight (for slow latency) fills pipelines within the partitions.

I have to correct myself here: I am indeed using the high_resolution_clock, but only for measuring the total time needed for a complete run of H2D, computation, D2H. That is what I’m finally going to compare amongst different CPUs and GPUs. The measurings of the performance tests above are actually taken from Nsight Systems. I got that mixed up, sorry for the confusion.

Hm, okay. Maybe I will try this out. It would give me more flexibility when trying to find the optimal grid configuration while trying to prevent copying operations from completely occupying the GPU. On the other hand, that makes loads of additional test cases :-D