Got wrong result when not using cudaDeviceSynchronize in threads

czhm13 · July 17, 2018, 10:45pm

Hi, I packed 2 cuda reduce kernels as a device function and tried to call the function inside another kernel. I find that a cudaDeviceSynchronize() function must be called after the cudaReduceSum() to get the correct result.

The stream of kernels is set to 0 (default), so my understanding is that the code should run serial within each thread, thus no synchronize needed.

Could somebody tell me what is the real issue here? Thanks!

__global__ void kernM(dim3 grid,dim3 block,
                          const unsigned int comps,
                          const unsigned int pts,
                          const unsigned int dim,
                          const float *X, float* loggamma,
                          float *working, float *Pi,float *Mu,
                          float* Sigma, const float weight,
                          const float tol_spt){

    const unsigned int tid = threadIdx.x;

    float *loggammaK=&loggamma[tid*pts];
    float *workingK=&working[tid*pts*dim*dim];

    cudaReduceSum(grid,block,pts,loggammaK,workingK,0);
    cudaDeviceSynchronize();

    const float maxArg=workingK[0];

    cudaReduceSum(grid,block,pts,workingK,workingK,0);
    cudaDeviceSynchronize();
   
}

__global__
void kernReduceSum(unsigned int num,const float *d_in, float *d_out)
{
    unsigned int tid = threadIdx.x;
    unsigned int idx = blockIdx.x * (blockDim.x * 2) + threadIdx.x;

    volatile __shared__ float sdata[1024];
    sdata[tid] = 0;

    if (idx + blockDim.x< num) {
        sdata[tid] = d_in[idx] + d_in[idx+blockDim.x];
    }
    else if(idx<num){
        sdata[tid] = d_in[idx];
    }
    __syncthreads();

    if (blockDim.x >= 1024) { if (tid < 512) { sdata[tid] += sdata[tid + 512]; } __syncthreads(); }
    if (blockDim.x >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockDim.x >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockDim.x >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
    if (tid < 32) {
        if (blockDim.x >= 64) sdata[tid] += sdata[tid + 32];
        if (blockDim.x >= 32) sdata[tid] += sdata[tid + 16];
        if (blockDim.x >= 16) sdata[tid] += sdata[tid + 8];
        if (blockDim.x >= 8) sdata[tid] += sdata[tid + 4];
        if (blockDim.x >= 4) sdata[tid] += sdata[tid + 2];
        if (blockDim.x >= 2) sdata[tid] += sdata[tid + 1];
    }
    if (tid == 0) d_out[blockIdx.x] = sdata[0];
}

__host__ __device__
void cudaReduceSum(const dim3 grid,const dim3 block,
                   const unsigned int num, const float* dIn,float *dOut,
                   cudaStream_t stream) {
    kernReduceSum<<<(grid.x+1)/2,block,0,stream>>>(num,dIn,dOut);
    if((grid.x+1)/2!=1) {
        const int n = (grid.x / 2) + (grid.x % 2);
        kernReduceSum << < 1, block ,0,stream>> > (n, dOut, dOut);
    }
}

Another issue is that I tried to create a stream in a host function, but cudaStreamCreate function is very slow. Any possible reason? Thanks!

auto start = std::chrono::high_resolution_clock::now();

    cudaStream_t streams[1];

    cudaStreamCreate(&streams[0]);

auto end = std::chrono::high_resolution_clock::now();

    auto duration =
            std::chrono::duration_cast<std::chrono::duration<float>>(end - start);
    std::cout << "Time elapsed: " << duration.count() << "s" << std::endl;

   // result: Time elapsed: 3.15592s

Robert_Crovella · July 18, 2018, 1:06am

First of all, I suggest providing a complete code that someone else could test if you want help with a question like this.

Second, a CDP (Cuda Dynamic Parallelism) code can be challenging. I recommend doing rigorous error checking, just as you would in host code, on device code kernel launches and use of the device runtime API. Without a complete code, its impossible to tell how many outstanding kernel launches you will have, since you are launching kernels on each device thread. The machine has capacity limits here, including a pending launch limit. The cudaDeviceSynchronize() calls could possibly be keeping you from hitting pending launch limits when you are launching a sequence of kernels.

If you are having trouble with a CDP code, you might want to read the relevant section of the programming guide. It’s not all that long:

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism[/url]

Regarding the long time for stream creation, maybe this is just CUDA start-up overhead. If this is the first call, try issuing another CUDA call before timing this call, or else do a 2nd stream create and time that also.

czhm13 · July 18, 2018, 1:54am

txbob:

First of all, I suggest providing a complete code that someone else could test if you want help with a question like this.

Second, a CDP (Cuda Dynamic Parallelism) code can be challenging. I recommend doing rigorous error checking, just as you would in host code, on device code kernel launches and use of the device runtime API. Without a complete code, its impossible to tell how many outstanding kernel launches you will have, since you are launching kernels on each device thread. The machine has capacity limits here, including a pending launch limit. The cudaDeviceSynchronize() calls could possibly be keeping you from hitting pending launch limits when you are launching a sequence of kernels.

If you are having trouble with a CDP code, you might want to read the relevant section of the programming guide. It’s not all that long:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism

Regarding the long time for stream creation, maybe this is just CUDA start-up overhead. If this is the first call, try issuing another CUDA call before timing this call, or else do a 2nd stream create and time that also.

Thanks! txbob.

russellra18 · January 6, 2024, 7:35am

Not utilizing cudaDeviceSynchronize in threads may lead to incorrect results due to asynchronous execution. This function ensures synchronization, vital for consistent outcomes in CUDA programs. Incorporate it judiciously to harmonize threads, enhancing reliability and precision in parallel processing tasks, especially when dependencies exist among different GPU operations.

harsith534 · January 23, 2024, 8:36am

It seems like the issue might be related to the asynchronous nature of CUDA execution. Although you set the stream to 0 (default), CUDA execution can still be asynchronous. The cudaDeviceSynchronize() call is likely used to ensure that all previous CUDA tasks are completed before proceeding.

If you want to avoid explicit synchronization, you can explore using cudaStreamSynchronize() with the specific stream ID after launching the kernel. This should ensure synchronization within the specified stream, making it unnecessary to use cudaDeviceSynchronize().

Additionally, carefully managing memory transactions, thread synchronization within the kernels, and using appropriate synchronization primitives may help address the issue without relying on global device synchronization.

russellra18 · January 26, 2024, 10:43am

Synchronization matters! Omitting cudaDeviceSynchronize in threads can lead to unexpected results due to race conditions. Prioritize synchronization to ensure accurate and consistent outcomes in CUDA programming.

Topic		Replies	Views
Synchronization synchronizing a n body problem. CUDA Programming and Performance	8	4397	September 22, 2009
Question regarding device synchronization Legacy PGI Compilers	1	1120	February 4, 2019
Do i really need to use cudaDeviceSynchronize in this scenario ? CUDA Programming and Performance	2	1082	February 11, 2019
__syncthreads() not a subset of cudaDeviceSynchronize()? CUDA Programming and Performance	3	670	June 2, 2022
Kernel function calls in regards to cudaSynchronizeDevice(); CUDA Programming and Performance	2	707	May 25, 2017
Synchronization problem CUDA Programming and Performance	0	878	December 3, 2012
stream synchronize problem CUDA Programming and Performance	2	781	August 28, 2017
cudaDeviceSynchronize() doesn't wait for kernels launched by other CPU threads, why? CUDA Programming and Performance synchronization	7	2434	October 12, 2021
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6315	May 26, 2009
Stream synchronization problem didn't synchronize but returned no error CUDA Programming and Performance	0	2803	July 14, 2008

Got wrong result when not using cudaDeviceSynchronize in threads

Related topics