What's the cudaMalloc's implicit synchronize means?

liuliang1 · June 17, 2025, 7:29am

now,I’m always confused about the behavior of cudaMalloc(cudaMemcpy,cudaFree as well)'s synchronize. the test code is beblow:
include <pthread.h>
include <stdio.h>

const int N = 1 << 20;

global void kernel(float* x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x)
{
x[i] = sqrt(pow(3.14159, i));
}
}

void* launch_kernel(void* dummy)
{
float* data;
cudaMalloc(&data, N * sizeof(float));
// cudaStream_t stream;
// cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
kernel<<<1, 64>>>(data, N);
float* data1;
cudaMalloc(&data1, N * sizeof(float));
cudaStreamSynchronize(0);
return NULL;
}

int main()
{
const int num_threads = 8;

pthread_t threads[num_threads];

for (int i = 0; i < num_threads; i++)
{
    if (pthread_create(&threads[i], NULL, launch_kernel, 0))
    {
        fprintf(stderr, "Error creating threadn");
        return 1;
    }
}

for (int i = 0; i < num_threads; i++)
{
    if (pthread_join(threads[i], NULL))
    {
        fprintf(stderr, "Error joining threadn");
        return 2;
    }
}

cudaDeviceReset();

return 0;

}
and the nsight system report is captured as blow:

so my question is that :
if cudaMalloc synchronize the device with host,the end of second cudaMalloc’s timeline in the cuda API(host) must be the behind the end of the kernel executing on CUDA HW(device).
does cudaMalloc means an implicit cudaStreamSynchronize in the default stream legacy? or an implicit cudaDeviceSynchronize throughout the device?or some thing else?
if I put the default stream per thread compile option on,is it means cudaMalloc ,cudaMemcpy,cudaFree running on the default stream per thread?can cudaMalloc ,cudaMemcpy,cudaFree block host and other streams?does the default stream legacy exists if default stream per thread compile option is on?

I really appreciate it if you reply my question quickly,thanks!!

Topic		Replies	Views
Device memory allocation implicit synchronization CUDA Programming and Performance	2	955	July 25, 2020
CUDA implicit synchronization behavior and conditions in detail CUDA Programming and Performance	2	2456	April 29, 2023
Are cudaMemCpy and cudaMalloc blocking/synchronous? CUDA Programming and Performance	1	983	September 30, 2024
Stream sync behaving like a device sync on first use of device API fns printf, cudaMalloc etc CUDA Programming and Performance cuda , synchronization	17	294	June 3, 2026
Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it? CUDA Programming and Performance	5	699	August 23, 2023
Is cudaMemcpyAsync + cudaStreamSynchronize on default stream equal to cudaMemcpy (non-async) CUDA Programming and Performance	7	4358	December 12, 2019
Ambiguity in the description of cudaFree API? CUDA Programming and Performance cuda	2	501	April 1, 2024
Confusion about implicit inter-stream synchronization brought by cudaMemsetAsync CUDA Programming and Performance	4	899	December 30, 2023
Implicit synchronization CUDA Programming and Performance	6	3779	April 30, 2015
Asynchronous problem with cudaMalloc CUDA Programming and Performance	2	1070	May 22, 2023

What's the cudaMalloc's implicit synchronize means?

Related topics