Trying to run cudaMemsetAsync in a more timely manner

The problem I’m seeing is that a DtoH cudaMemcpyAsync is blocking a cudaMemsetAsync from completing in a timely manner. That is, I issue a cudaMemcpyAsync in one stream, then issue a cudaMemsetAsync in a different stream, the cudaMemcpyAsync takes a long time to complete, and then only after that’s done does the cudaMemsetAsync run and finish. This occurs, despite issuing the cudaMemcpyAsync and cudaMemsetAsync in different streams (and using pinned memory). Is that expected behavior? Normally I would expect a CPU lock or some other higher-level error, but I’ve spent a lot of time looking at the CPU code, and I’m currently out of ideas… which usually means my assumptions are wrong.

This is on a v100 with 6 copy engines and CUDA 10.1.

I was going to attach two screenshots of the nvvp timeline, but there’s not an easy way to attach them to this thread.

In the first shot, you would see that the DtoH copy is initiated in the second thread in the small sliver that I’ve highlighted with a red circle. There’s also a cudaStreamSynchronize call that syncs on the stream after the cudaMemCpyAsync[s] are done.

In the second screenshot, you would see that the cudaMemsetAsync call takes a long time to be eligible to be run, but once it’s run, it is very quick.

My apologies for this being a horrible question w.r.t. the amount of concrete information that I can divulge (i.e., code) – I can’t copy+paste the code as it’s proprietary and not mine, and I’m struggling to make a minimal viable example that demonstrates my issue at the moment. That being said, there’s a good chance one of my assumptions is wrong, and maybe someone here can help me understand what it is. Thanks for your time.

From a practical perspective: Examine closely whether the call to cudaMemsetAsync() is actually needed. More often than not, that is not the case, because the initialization can be rolled into whatever kernel uses the data next.

Write your own kernel to do the cudaMemsetAsync. That can run concurrently with cudaMemcpyAsync

Bingo – good ideas @njuffa and @Robert_Crovella. I was avoiding issuing the clear in a kernel, because I’m thinking that a call from the CPU to the device board that clears 6GB of memory could be quicker (not sure how much of it really needs to be cleared, but it’s a non-trivial fraction). It sounds like clearing it from within the kernel is a viable way to go, though.

Another thought I had overnight is that I might be able to issue the memset as a cleanup call (as opposed to a preparation call, as it is now). The locking I do in CPU-land would ensure an appropriate ordering, and so I wouldn’t need multiple CPU-threads to refer to any particular stream (which makes it a little cleaner to implement).

I’ll give these ideas a try and report back. Thanks!

Moving the memset to “immediately” after the kernel returns (or even putting it in the same stream) has the same performance problem, because other memcpy’s can sneak in between the kernel finishing and the memset. There’s nothing surprising about this – it’s a classic race condition. Just reporting back w.r.t. my second paragraph in post #4.

Hmmm, well, the below implementation took a while (150ms). Oddly enough, the duration of executing the clear_scratch_space_kernel was about the same as the earlier cudaStreamSynchronize call. I think my real problem may be as @njuffa pointed out – it takes a while to reset 6GB.

__global__ void clear_scratch_space_kernel(UserCampaignEvalScratchSpace * scratch_space) {
    const int BLOCKS = GPU_BLOCK_COUNT_FOR_SCRATCH_SPACE_INIT; // 80
    const int THREADS = GPU_THREAD_COUNT_FOR_SCRATCH_SPACE_INIT; // 256
    const int idx = blockIdx.x * THREADS + threadIdx.x;
    size_t size = sizeof(UserCampaignEvalScratchSpace);
    size_t size_of_typical_chunk = round_up(size / (BLOCKS * THREADS), GPU_CACHE_LINE_SIZE_IN_BYTES);

    // Due to truncation, the threads at the end won't have anything to do.  This is a little sloppy but costs us
    // hardly anything in performance, so we do the simpler thing.

    size_t this_threads_offset = idx * size_of_typical_chunk;
    if (this_threads_offset > sizeof(UserCampaignEvalScratchSpace)) {
        return;
    }

    size_t size_of_this_threads_chunk;
    if (this_threads_offset + size_of_typical_chunk >= sizeof(UserCampaignEvalScratchSpace)) {
        // We are the last thread, so we do a partial write
        size_of_this_threads_chunk = sizeof(UserCampaignEvalScratchSpace) - this_threads_offset; // TODO: check for an off-by-one error here
    } else {
        size_of_this_threads_chunk = size_of_typical_chunk;
    }
    void * starting_address = reinterpret_cast<void *>(reinterpret_cast<char *>(scratch_space) + this_threads_offset);
    memset((void *) starting_address, 0, size_of_this_threads_chunk);
}

no you don’t want to use memset in (this kind of) kernel code.

You should be able to predict performance of a good quality memset kernel very easily, its duration should be the total bytes to be cleared/set divided by the device-to-device memory bandwidth reported by bandwidthTest. If you’re not hitting that, your kernel is poorly written. Use a grid-stride loop:

https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

where each thread is writing either 4, 8, or 16 bytes (tune that to see what works best, but there shouldn’t be a huge difference among those choices.) Don’t use memset to do this 4/8/16 byte write either. memset writes a byte at a time. bad bad bad

Now, that assumes nothing else is going on. If you are simultaneously doing a cudaMemcpyAsync to or from device memory, that eats into the available bandwdith. So your memset kernel performance will be reduced by the bandwidth consumed by other concurrent operations.

Not a super-fan of cross-posting, but here are the results from the experiments I ran to benchmark different forms of clearing memory.

Looks like you responded with the idea while I was developing. Glad to get confirmation that that’s a viable option @Robert_crovella!