When using a single cuda stream, looking at the execution with NSIGHT Visual Studio Edition, the execution of cudaMemsetAsync does not respect the order in which the command was enqueued. It always executes during the cuda runtime call, and the rest of commands (memory copies, kernels, etc) execute later.
I provide a code that reproduces the effect:
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
typedef unsigned int uint;
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code,
const char *file,
int line,
bool abort = true) {
if (code != cudaSuccess) {
fprintf(stderr, "GPUassert: %s %s %d\n",
cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void kernel_setAZero(uint* data) {
*data = 0u;
}
void launch_setAZero(uint* data, cudaStream_t stream) {
kernel_setAZero <<<1, 1, 0, stream>>>(data);
gpuErrchk(cudaGetLastError());
}
int main() {
uint *data;
gpuErrchk(cudaMalloc(&data, sizeof(uint)));
cudaStream_t stream;
gpuErrchk(cudaStreamCreate(&stream));
launch_setAZero(data, stream);
gpuErrchk(cudaMemsetAsync(data, 0, sizeof(uint), stream));
gpuErrchk(cudaStreamSynchronize(stream));
std::cout << "Executed!!" << std::endl;
return 0;
}
As you can see in the code, using a single stream first the kernel “kernel_setAZero” is enqueued in the stream, and later is the cudaMemsetAsync, on the same stream.
If you use NSight Visual Studio to look at it, you will see that the execution order visualized is the inverse of the expected.
Additionally, using the cudaMemsetAsync prevents the overlapping of transfers and computation under certain conditions, but I’m not sure if this is expected behavior on Windows.
System configuration:
widnows 10 pro 17334.286
vs2017 update 8 (compiler version 14.15)
Quadro P4000 (also tested on Quadro M4000 and Quadro P6000)
Tested both on CUDA 9.1 (WDDM driver 391.03) and CUDA 10.0 (WDDM driver 411.63)
NSight Visual Studio edition 5.5 for CUDA 9.1
NSight Visual Studio edition 6.0 for CUDA 10.0