Hi, I am trying to profile my code where (hopefully) cudaMallocAsync calls are overlapped with another kernel execution, when I try to profile the program with nsys I can see the malloc call in the CUDA API row but not in the appropriate stream.
cudaMallocAsync and cudaFreeAsync are host side activities just like regular cudaMalloc and cudaFree. They do not appear in the “CUDA HW” rows.
So I cannot overlap mallocs with the kernel computation?
No, you can. But this is no different than any other host work. If you have a kernel running and perform an allocation, it will overlap.
You can see on this picture that cudaMalloc and cudaMallocAsync execute concurrently to the kernel.
__global__
void long_running_kernel(){
for(int i = 0; i < 10; i++){
__nanosleep(10000000);
}
}
int main(){
cudaSetDevice(0);
long_running_kernel<<<1,1>>>();
void* ptr1;
void* ptr2;
cudaMalloc(&ptr1, 1024*1024);
cudaMallocAsync(&ptr2, 1024*1024, (cudaStream_t)0);
cudaFreeAsync(ptr2, (cudaStream_t)0);
cudaFree(ptr1);
cudaDeviceSynchronize();
}
``
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.