Nsys doesn't track cudaMallocAsync on Stream row

lucag2001 · November 25, 2024, 9:17am

Hi, I am trying to profile my code where (hopefully) cudaMallocAsync calls are overlapped with another kernel execution, when I try to profile the program with nsys I can see the malloc call in the CUDA API row but not in the appropriate stream.

striker159 · November 25, 2024, 9:34am

cudaMallocAsync and cudaFreeAsync are host side activities just like regular cudaMalloc and cudaFree. They do not appear in the “CUDA HW” rows.

lucag2001 · November 25, 2024, 10:54am

So I cannot overlap mallocs with the kernel computation?

striker159 · November 25, 2024, 12:50pm

No, you can. But this is no different than any other host work. If you have a kernel running and perform an allocation, it will overlap.

You can see on this picture that cudaMalloc and cudaMallocAsync execute concurrently to the kernel.

__global__
void long_running_kernel(){
    for(int i = 0; i < 10; i++){
        __nanosleep(10000000);
    }
}

int main(){

    cudaSetDevice(0);

    long_running_kernel<<<1,1>>>();
    void* ptr1;
    void* ptr2;
    cudaMalloc(&ptr1, 1024*1024);
    cudaMallocAsync(&ptr2, 1024*1024, (cudaStream_t)0);
    cudaFreeAsync(ptr2, (cudaStream_t)0);
    cudaFree(ptr1);
    cudaDeviceSynchronize();
}
``

system · December 9, 2024, 12:51pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
The impact of cudaMalloc(）and cudaFree() on the overlapping of kernel executions and data transfer CUDA Programming and Performance	0	993	July 22, 2020
streams not overlapping CUDA Programming and Performance	1	1553	May 23, 2011
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	998	February 1, 2022
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2952	December 18, 2008
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	898	July 7, 2017
accessing device memory during kernel execution CUDA Programming and Performance	3	1537	March 10, 2010
cudaMalloc() vs cudaMallocManaged() wrt to cudaMemcpy() CUDA Programming and Performance	9	4589	October 11, 2018
cudamemcpy2Dasync + stream create stream for 2D array CUDA Programming and Performance	5	3832	May 27, 2009
Asynchronous problem with cudaMalloc CUDA Programming and Performance	2	977	May 22, 2023
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	167	November 18, 2024

Nsys doesn't track cudaMallocAsync on Stream row

Related topics