Asynchronous problem with cudaMalloc

xia.425 · July 12, 2020, 8:04pm

Hi
I read the post https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/#disqus_thread

const int N = 1 << 20;

global void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}

int main()
{
const int num_streams = 8;

cudaStream_t streams[num_streams];
float *data[num_streams];

for (int i = 0; i < num_streams; i++) {
    cudaStreamCreate(&streams[i]);

    cudaMalloc(&data[i], N * sizeof(float));
    
    // launch one worker kernel per stream
    kernel<<<1, 64, 0, streams[i]>>>(data[i], N);

    // launch a dummy kernel on the default stream
    kernel<<<1, 1>>>(0, 0);
}

cudaDeviceReset();

return 0;

}

I have questions about cudaMalloc because in cuda c programming guide, it claims that device memory allocation is synchronous. However, in this post, it did not affect the asynchronous execution. However, if I add one cudaFree, it will affect the asynchronous and make it much slower.

xia.425 · July 21, 2020, 2:17am

Anyone can help me?

cs21resch11004 · May 22, 2023, 3:35am

Hi,
Sorry for such a late reply. I think my answer will help people in the future.
cudaMalloc is asynchronous, but cudaFree is not.
cudaFree is needed to be synchronous, due to the possible run-time error.
Suppose a kernel uses array d_array at line I1 and the programmer calls cudaFree on d_array at line I3.
If cudaFree was asynchronous, the memory could have been deallocated, and the kernel would return a segmentation fault. However, this behavior is impossible in CUDA due to its synchronous behavior. The kernel would finish the execution, and then the deallocation would happen. In this way, no run-time error can possibly occur.
N.B. This is entirely my theory. I have not found any documentation regarding this.

Topic		Replies	Views
cudaMallocAsync()/cudaFreeAsync() in a multi-threaded environment CUDA Programming and Performance cuda , ubuntu	3	1561	December 7, 2022
The impact of cudaMalloc(）and cudaFree() on the overlapping of kernel executions and data transfer CUDA Programming and Performance	0	1020	July 22, 2020
Multi-threaded CPU application is not asynchronous when using cudaFree CUDA Programming and Performance	0	693	November 25, 2013
Asynchronous cudaMallocFree/cudaFreeAsync per GPU? CUDA Programming and Performance	1	57	February 3, 2025
Ambiguity in the description of cudaFree API? CUDA Programming and Performance cuda	3	472	April 1, 2024
Can cudaFreeAsync be used to free unified memory allocated with cudaMallocManaged? CUDA Programming and Performance cuda	2	53	April 26, 2025
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1 Technical Blog	1	704	September 13, 2024
Asynchronous cudaMalloc CUDA Programming and Performance	3	11853	July 2, 2012
cudaFree painfully slow CUDA Programming and Performance	4	4644	January 29, 2010
Does "cudaMallocFromPoolAsync" conflict with "cudaMalloc"? CUDA Programming and Performance	1	67	March 5, 2025

Asynchronous problem with cudaMalloc

Related topics