Multi-threaded CPU application is not asynchronous when using cudaFree

miguelL · November 25, 2013, 10:48pm

I created an application that has multiple CPU threads whereby each CPU Thread creates a separate stream in the same context on my GPU. I have a Tesla K20c. I’m using Windows 7 64 bit and Cuda 5.5.

Here is my code:

#include "gpuCode.cuh"

__global__ void kernelAddConstant1(int *g_a, const int b)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    g_a[idx] += b;
    for (int i = 0; i < 4000000.0; i++)
    {
        if (i%2 == 0)
        {
            g_a[idx] += 5;
        }
        else
        {
            g_a[idx] -= 5;
        }
    }
}


// a predicate that checks whether each array elemen is set to its index plus b
int correctResult(int *data, const int n, const int b)
{
    for (int i = 0; i < n; i++)
    {
        if (data[i] != i + b)
        {
            return 0;
        }
    }
    return 11;
}

int gpuDo()
{
    cudaSetDevice(0);
    cudaStream_t stream;
    cudaStreamCreate( &stream );
    
    int *a;
    int *d_a;
    
    unsigned int n;
    unsigned int nbytes;
    
    int b;
    
    n = 2 * 8192/16;
    nbytes = n * sizeof(int);
    b = 7;      // value by which the array is incremented
    
    cudaHostAlloc( (void**)&a, nbytes, cudaHostAllocDefault ) ;
    cudaMalloc((void **)&d_a, nbytes);
    
    for (unsigned int i = 0; i < n; i++)
        a[i] = i;

    unsigned int nbytes_per_kernel = nbytes;
    dim3 gpu_threads(128);  // 128 threads per block
    dim3 gpu_blocks(n / gpu_threads.x);
    
    cudaMemsetAsync(d_a, 0, nbytes_per_kernel, stream);
    
    cudaMemcpyAsync(d_a, a, nbytes_per_kernel, cudaMemcpyHostToDevice, stream);
    
    
    kernelAddConstant1<<<gpu_blocks, gpu_threads, 0, stream>>>(d_a, b);

    cudaMemcpyAsync(a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost, stream);
    cudaStreamSynchronize ( stream ) ;
    cudaStreamDestroy(stream);
    
    //cudaFree(d_a);
    
    int bResult = correctResult(a, n, b);

    //if (a)
        //cudaFreeHost(a); // free CPU memory
    
    return bResult;
}

void gpuEnd()
{
    cudaDeviceReset();
}

When I leave cudaFree and cudaFreeHost commented out I achieve the following result:

External Media
External Media

This is perfect except that I have a memory leak because I’m not using cudaFree and cudaFreeHost. When I do use cudaFree and cudaFreeHost I get the following result:

External Media
External Media

This is bad. When using cudaFree some streams wait for others to finish first and some streams work asynchronously. I’m assuming this is because cudaFree is not asynchronous which is fine but that doesn’t explain why it sometimes works as in the first three kernels called but not at other times? If cudaFree is called but the GPU is already busy doing something else is it possible to have the CPU continue computing and let cudaFree occur automatically the first chance it gets? Is there another way to approach this issue? Thanks for any help you can give!

Topic		Replies	Views
Asynchronous problem with cudaMalloc CUDA Programming and Performance	2	1036	May 22, 2023
cudaMallocAsync()/cudaFreeAsync() in a multi-threaded environment CUDA Programming and Performance cuda , ubuntu	3	1561	December 7, 2022
cudaFree in parallel with CUDA kernel CUDA Programming and Performance	1	4222	December 29, 2010
cudaFree while kernel is executing CUDA Programming and Performance	1	9139	February 15, 2011
Asynchronous cudaMallocFree/cudaFreeAsync per GPU? CUDA Programming and Performance	1	57	February 3, 2025
cudaFree does not free memory on Kepler CUDA Programming and Performance	2	2338	June 20, 2012
Looping kernel calls Unspecified launch error on cudaFree() ?? CUDA Programming and Performance	5	1785	May 13, 2009
cudaFree painfully slow CUDA Programming and Performance	4	4644	January 29, 2010
The impact of cudaMalloc(）and cudaFree() on the overlapping of kernel executions and data transfer CUDA Programming and Performance	0	1020	July 22, 2020
cudaFreeArray CUDA Programming and Performance	0	3344	February 26, 2008

Multi-threaded CPU application is not asynchronous when using cudaFree

Related topics