Simple tab construction with CUDA runs even slower than the non-parallel cpu version

Hello everyone,

I’m new to gpu programming and I don’t undersetand why my simple code below runs so slowly (~20 min!).
If anyone has an idea… I would be happy to understand a little more about what’s behind all this.

In my main I have a simple loop in which I call a global function, on 1 block of 32 threads :

for (indice_voxel = 0; indice_voxel < (*NB_VOXELS); indice_voxel++)
 {
       if (Objet[indice_voxel])
       {
              kernel<<<1,32>>>( NB_VOXELS, Objet, nombre_projections_par_OS, Table_OS, U, S,\
                   Table_LSF, normalisation_detecteurs,  numero_subset, projected_solution, indice_voxel );       

              cudaDeviceSynchronize());
       }
}

Every parameters of kernel_projection has been allocated the same way, with a cudaMallocManaged.

The goal of my global function is to build a tab projected_solution. To do this, the informations needed are retrieved from the different parameters given as input to the function.

My function works fine (the tab is built properly and I have used CUDA error checking) but incredibly slowly. This is what I would like to understand.

   __global__ void kernel(int* NB_VOXELS, float* Objet, int *nombre_projections_par_OS, \
    int* Table_OS, int* U, int* S, float* Table_LSF, float* normalisation_detecteurs, int numero_subset, float* projected_solution, int indice_voxel)
    {
      int distance, shift, indice_detecteur, int numero_de_la_projection;
      float* pointeur_A, * pointeur_D;

    int indexThreadInGrid = blockIdx.x * blockDim.x + threadIdx.x;

    if (indexThreadInGrid < (*nombre_projections_par_OS))
    {
        numero_de_la_projection = Table_OS[numero_subset * (*nombre_projections_par_OS) + indexThreadInGrid];
        distance = U[indice_voxel + numero_de_la_projection * (*NB_VOXELS)];
        shift = S[indice_voxel + numero_de_la_projection * (*NB_VOXELS)];

        pointeur_A = &Table_LSF[distance * 3 * NB_DETECTEURS + shift];
        pointeur_D = &normalisation_detecteurs[numero_de_la_projection * NB_DETECTEURS];

        for (indice_detecteur = 0;indice_detecteur < NB_DETECTEURS;indice_detecteur++)
        {
            projected_solution[indexThreadInGrid * NB_DETECTEURS + indice_detecteur] += ((*pointeur_A++) / (*pointeur_D++)) * Objet[indice_voxel];
        }
    }

The size of all my parameters together (NB_VOXELS, Objet, nombre_projections_par_OS, Table_OS…) doesn’t exceed 76.5 Mo.

Would anyone have any idea what I have done wrong, why my code runs so slowly? Because it runs even slower than the non-parallel cpu version…

Thank you!

That is a bad idea. When doing CUDA programming, you want lots of threads for each kernel launch. Doing the work with 32 threads at a time, in a loop that is launching kernels, will be very disappointing from a performance standpoint.

All right, thank you very much for your response.
In this case, I can try to parallelize on either NB_DETECTEURS = 336 or even better on NB_VOXELS = 28224.
It will be a little bit more complicated on NB_VOXELS because I will have to manage the sharing of data between the different threads.

for all voxels
  
    for all projections

          for all detectors

               Table += (...)

The += on Table applies for the same projection and the same detector for all voxels. So if I parallelize on the voxels I need to share Table between all threads. I think this mean I need to memorize 28224 times Table and then only make the final addition.

Dear Robert Crovella,

I changed my code and try to parallelize with 28224 threads.

My code works as follows:

for all slices
      for all iterations
            for all subsets
                  call to kernel function

In the main I call the GPU function on 882 blocks of 32 threads each:

            kernel_projection<<<882,32>>>(NB_VOXELS, Objet, (*nombre_projections_par_OS), Table_OS, U, S, Table_LSF, normalisation_detecteurs, (*device_subset), projected_solution);
            cudaDeviceSynchronize();

At the end I want the function kernel_projection to build a tab projected_solution of size 16 * 336 * sizeof(float). I need to share this tab between all threads.

To do that I construct a big tab cache in shared memory that will contain all small tabs projected_solution of all threads. The size of my big tab cache is therefore number_of_threads_per_block * ( 16 * 336 * sizeof(float) ).

Each thread retrieves the needed information to construct its own tab projected_solution, we wait for all threads of the block to be done and then our temporary cache has been filled. Now, we can sum the values in it.

This step is done by reduction: each thread add two of the values in cache and store the result back to cache. Since each thread combines two entries into one, we complete this step with half as many entries as we started with. In the next step, we do the same thing on the remaining half. We continue in this fashion until we have the sum of every entry in cache.

__global__ void kernel_projection(int* NB_VOXELS, float* Objet, int nombre_projections_par_OS, \
    int* Table_OS, int* U, int* S, float* Table_LSF, float* normalisation_detecteurs, int numero_subset, float* projected_solution)
{
    int indice_projection, distance, shift, indice_detecteur, numero_de_la_projection;
    float* pointeur_A, * pointeur_D;

    int indexThreadInGrid = blockIdx.x * blockDim.x + threadIdx.x;
    int cacheIndex = threadIdx.x;

    __shared__ float cache[32*16*336*sizeof(float)];

    if (indexThreadInGrid < (*NB_VOXELS) && Objet[indexThreadInGrid])
    {
        for (indice_projection = 0;indice_projection < nombre_projections_par_OS;indice_projection++)
        {
            numero_de_la_projection = Table_OS[numero_subset * nombre_projections_par_OS + indice_projection];
            distance = U[indexThreadInGrid + numero_de_la_projection * (*NB_VOXELS)];
            shift = S[indexThreadInGrid + numero_de_la_projection * (*NB_VOXELS)];

            pointeur_A = &Table_LSF[distance * 3 * NB_DETECTEURS + shift];
            pointeur_D = &normalisation_detecteurs[numero_de_la_projection * NB_DETECTEURS];
            for (indice_detecteur = 0;indice_detecteur < NB_DETECTEURS;indice_detecteur++)
            {
                cache[cacheIndex * NB_PROJECTIONS * NB_DETECTEURS + indice_projection * NB_DETECTEURS + indice_detecteur] = ((*pointeur_A++) / (*pointeur_D++)) * Objet[indexThreadInGrid];
            }
        }
        __syncthreads();

        int i = blockDim.x / 2;

        while (i != 0)
        {
            if (cacheIndex < i)
                cache[cacheIndex * NB_PROJECTIONS * NB_DETECTEURS + indice_projection * NB_DETECTEURS + indice_detecteur] += cache[cacheIndex * NB_PROJECTIONS * NB_DETECTEURS + indice_projection * NB_DETECTEURS + indice_detecteur + i];
            __syncthreads();
            i /= 2;
        }
    }
}

So here I should have only one tab projected_solution of size 16 * 336 * sizeof(float) for each block left.
I get the following error:

Entry function '_Z17kernel_projectionPiPfiS_S_S_S0_S0_iS0_' uses too much shared data (0x2a0000 bytes, 0xc000 max)

Does this mean that the shared memory of my GPU can’t handle so much data? Or that I have done something wrong somewhere?
I also think it is weird that 0xc000 bytes = 6.144 Ko is the max, it seems very little…

I am trying to identify where the problem exactly is: is it because I’m lost in CUDA world or is it a computer configuration limitation?

Yes. You are limited to about 48KB of shared memory. On some GPUs, in some cases you may be able to get up to about 96KB of shared memory. But this:

__shared__ float cache[32*16*336*sizeof(float)];

is asking for 688KB of shared memory. That’s not possible.

Ok. Thank you a lot for your help!