Calling a kernel function from an OpenMP loop

sarah.blind · August 24, 2021, 9:59am

Hello everyone,

I am new to parallel programming and I am more theoretician than programmer, so please apologize for any obvious mistakes!

My goal : I want to call a kernel function inside an OpenMP loop.

My problem is that even when my kernel function kernel_normalization_voxels is empty, and even for a same number of threads sometimes the code continues to the end sometimes not.

Here is my code :

#pragma omp parallel num_threads(num_slices_to_reconstruct)
{

int current_slice = omp_get_thread_num();

memset(normalization_voxels, 0, sizeof(float) * current_slice * (*NB_VOXELS) * nb_OS);

int numero_subset, num_projection;

for (numero_subset = 0; numero_subset < nb_OS; numero_subset++)
{
    for (int i = 0; i < NB_PROJECTIONS / nb_OS; i++)
    {
        num_projection = Table_OS[numero_subset * NB_PROJECTIONS / nb_OS + i];
        kernel_normalization_voxels << <nb_cuda_cores, 32 >> > (...);
        cudaDeviceSynchronize();
    }
}

}

normalization_voxels, NB_VOXELS are defined with a cudaMallocManged ;
nb_OS, nb_cuda_cores and num_slices_to_reconstruct and the table Table_OS are defined on the CPU ;
NB_PROJECTIONS is #define in a .h file.

Even with an empty kernel kernel_normalization_voxels

global void kernel_projection(…)
{
/* … */
}

my program does not go all the time through all the code to the end.

Does anyone have an idea where the problem might come from?

Thank you a lot for your help!

cbuchner1 · August 24, 2021, 10:44am

You might try compiling the CUDA code with --default-stream and use
cudaStreamSynchronize(CU_STREAM_PER_THREAD);

https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html

Then you don’t synchronize the entire device, but spearately per thread.
I am not sure if this will fix the problem, but at least each OpenMP thread will then act more independently from the others.

sarah.blind · August 27, 2021, 7:39am

Thank you very much for your response!
I have been working on it since tuesday… I encounter different issues that do not facilitate the resolution of my OpenMP - CUDA problem. First I failed to solve my problem with the launch of my .cu in console mode. I get this error message which I did not succeed to resolve

error: asm operand type size(8) does not match type/size implied by constraint ‘r’

Here is what I tried, I followed your link and this one GPU Pro Tip: CUDA 7 Streams Simplify Concurrency | NVIDIA Developer Blog. I added #define CUDA_API_PER_THREAD_DEFAULT_STREAM at the top of my programm and added the option -default-stream per-thread in kernel.cu -> properties -> CUDA C/C++ -> Command Line. Can this work ?

Then I modified my code into

Creation of the streams
the OpenMP loop with four calls to kernel functions on a precise stream

cudaStream_t* streams = (cudaStream_t*)calloc(omp_get_max_threads(), sizeof(cudaStream_t) );

for (int i = 0; i < omp_get_max_threads(); i++)
{
cudaStreamCreate(&streams[i]);
}

#pragma omp parallel
{
#pragma omp for
for (int current_slice = 0; current_slice < total_cuts; current_slice++)
{

    for (1)
    {
        for (2)
        {
            for (3)
            {
                kernel_projection << <nb_cuda_cores, 32, 0, streams[index_thread] >> > (...);
                cudaDeviceSynchronize();
            }

            for (4)
            {
                kernel_retroprojection << <nb_cuda_cores, 32, 0, streams[index_thread] >> > (...);
                cudaDeviceSynchronize();
            }
            
            kernel_object_update << <nb_cuda_cores, 32, 0, streams[index_thread] >> > (...);
            cudaDeviceSynchronize();

            kernel_tab_to_redistribute<< <nb_cuda_cores, 32, 0, streams[index_thread] >> > (...);
            cudaDeviceSynchronize();
        } 
    }
    cudaStreamSynchronize(CU_STREAM_PER_THREAD);
}

}

kernel_retroprojection constructs a table that kernel_object_update uses. Then, kernel_tab_to_redistribute sets the values of the table to 0.
I need to set those values to 0 for each course of the loop (3).

Does someone see something obviously false ? Because with the call of kernel_tab_to_redistribute, the result of my program is false. Without kernel_tab_to_redistribute, the result is still false but a little closer to the right result.