Building a tab with a recursive function - very slow for 1 to 4 threads and does not work with >4 threads

Hello everyone,

I am new to gpu programming and I am more theoretician than programmer, so please apologize for any obvious mistakes!

Summary of my problem:

I want to build on the GPU a table through a recursive function.

The construction of this table is done as it should be from 1 to 4 threads. From 5 threads it no longer works.

The problem a little more in detail:

I have a simple structure

struct T_Objects
{
int index;
int type;
int distance;
int shift;
int index_projection;
int num_subset;
int num_slice;
};

and a GPU function that calls for a recursive function DFS_normalization_osem a number of times

__global__ void kernel(a lot of parameters)
{
    int num_subset;

    memset(normalization_voxels, 0, sizeof(float) * (*NB_VOXELS) * (*nb_OS));

    for (num_subset = 0; num_subset < (*nb_OS); num_subset++)
    {
        (*subset).num_subset = num_subset;
        (*subset).type = 0;
        (*subset).num_slice = threadIdx.x;
        DFS_normalization_osem(a lot of parameters);
    }
}

The ultimate goal of this recursive function is to build a table, namely normalisation_voxels.
The function is a kind of Depth First Search for a graph : for each object subset of the type T_Objects, I look at the type of the object ((*subset).type), I retrieve information, I recall the function and this until normalization_voxels is completed.

__device__ void DFS_normalization_osem(T_Objects* vertex, float* matrix_relevance, int* U, int* S, float* lsf, int* os_tab, int nb_OS, float* normalization_detectors, float* attenuation, float* normalisation_voxels)
{
    int i;
    float* proba_detector;
    float* val_norm_detector;
    float attenuation_value;
 
    switch ((*vertex).type)
    {
    case 1:
        for (i = 0; i < 168 * 168; i++)
        {
            if (matrix_relevance[i])
            {
                (*vertex).index = i;
                (*vertex).distance = U[i + (*vertex).index_projection * 168 * 168];
                (*vertex).shift = S[i + (*vertex).index_projection * 168 * 168];
                (*vertex).type = 2;
                DFS_normalization_osem(vertex, matrix_relevance, U, S, lsf, os_tab, nb_OS, normalization_detectors, attenuation, normalisation_voxels);
            }
        }
        break;
    case 2:
        for (i = 0; i < NB_DETECTEURS; i++)
        {
            proba_detector = &lsf[(*vertex).distance * 3 * NB_DETECTEURS + (*vertex).shift + i];
            val_norm_detector = &normalization_detectors[(*vertex).index_projection * NB_DETECTEURS + i];
            attenuation_value = attenuation[((*vertex).num_slice)* NB_PROJECTIONS * NB_DETECTEURS + (*vertex).index_projection * NB_DETECTEURS+ i];
            if ((*proba_detector))
            {
                normalisation_voxels[(*vertex).num_slice * nb_OS * 168 * 168 +(*vertex).index + (*vertex).num_subset * 168 * 168 ] += ((*proba_detector) / (*val_norm_detector)) / (attenuation_value);
            }
        }
        break;
    case 0:
        for (i = 0; i < NB_PROJECTIONS / nb_OS; i++)
        {
            (*vertex).index_projection = os_tab[(*vertex).num_subset * NB_PROJECTIONS / nb_OS + i];
            (*vertex).type = 1;
            DFS_normalization_osem(vertex, matrix_relevance, U, S, lsf, os_tab, nb_OS, normalization_detectors, attenuation, normalisation_voxels);
        }
        break;
    }
}

If ((*subset).type) is 0 or 1 I only recover information from the different elements given as input to the recursive function (table U, S, lsf …). It is when ((*subset).type) is 2 that our table normalization_voxels is constructed with the help of the previous gathered informations.

In the main function, I first called my function __global__ on a single thread. It runs very slowly (memory access problems?) but the tab normalisation_voxels is built properly. If I run on 2, 3 or 4 threads it also works but on the other hand from 5 threads, in __global__, the loop

    for (num_subset = 0; num_subset < (*nb_OS); num_subset++)
    {
        (*subset).num_subset = num_subset;
        (*subset).type = 0;
        (*subset).num_slice = thread_slice;
        DFS_normalization_osem(a lot of parameters);
    }

is intertamed before going through all the num_subset.

Would anyone have any idea where this problem might come from? I really don’t know much about it, can my GPU memory be overloaded?

Thank you a lot in advance for your help!

Typical suggestions here might be to make sure you are using proper CUDA error checking and also run your code with cuda-memcheck. The output may be useful debug.

Thank you very much for your suggestions.
I want to follow your advice but I encounter other problems along the way.
First of all I had to solve a PATH problem with the cl.exe file. I think this is okey now.
Now, before trying a cuda-memcheck, I get a problem only with

nvcc kernel.cu

which gives me

C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\vcruntime.h(197): error: invalid redeclaration of type name "size_t"
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\vcruntime_new_debug.h(34): error: first parameter of allocation function must be of type "size_t"
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\type_traits(103): error: class template "std::_Is_function" has already been defined
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\type_traits(138): error: class template "std::_Is_memfunptr" has already been defined

and so on until

Error limit reached.
100 errors detected in the compilation of "kernel.cu".
Compilation terminated.

I think it is important to point out that I get those problems also with a simple Hello world cuda project.
Maybe I should have open a new subject for this issue.

Do you have new suggestions to help me with this?

All this is very new for me, so thank you for your patience and your help!

I’m puzzled. Previously you stated:

" If I run on 2, 3 or 4 threads it also works "

Now you’re unable to even build the code? My suggestion: Go back to using whatever method you used to build the code that allowed you to make the statement " If I run on 2, 3 or 4 threads it also works "

On Visual Studio the good old-fashioned way of building the code - click on the green button “local windows debug” - works for me…
Sorry, like I said, I am more theoretician than programmer; and I have to discover new things in a short time.
Anyway I think I know now what is going on: my computer has two GPUs with each shared memory, I think my RAM might be overloaded.

Great.

Step 1: Add proper CUDA error checking. Then recompile, run and see if you get any clues. Not sure what proper CUDA error checking is? Google “proper CUDA error checking” and take the first hit, study it, and apply it to your code

Step 2: If the clues from step 1 don’t help much, locate the name of the built executable. This is an executable program that ends in .exe The name will be listed in the VS Console output when you build the code (for example, Rebuild project from the menus). Once you locate that executable file name and the location it is at on your machine, open a windows command prompt, use cd to change to that directory, and run the executable from the command prompt preceded by cuda-memcheck, like cuda-memcheck my.exe

If you get errors from cuda-memcheck in step 2, you can usually localize those errors to a single line of code using the methodology described here

Dear Robert Crovella,

I have taken into account as much as possible your advices from this post and the other one I published recently : I think I use proper CUDA error checking ; I have changed the data on which I parallelize, I use 336 threads ; and I parallelise in such a way that I don’t need to share data between the threads (because there is not enough shared memory for that).

My new code works like it should but is very slow; I would like to understand why.

I my main, I make a call to a recursive function

for all slices

    for all iterations

        for all subsets

            call to recursive function

The gpu global function is called in the recursive function, the idea behind it being the following

case 1 :
    recovering of some data
    case becomes case 2
    call to recursive function

case 2 : 
    recovering of some data
    case becomes case 3
    call to recursive function

case 3 :
    recovering of some data
    **call to kernel** 

In case 3, the call to kernel is made through 11 blocks and 32 threads for each block.

This parallel code works fine but is much more longer than the non parallel version : when case 3 wasn’t calling to any global function but had a loop from 0 to 335.

I’d like to understand why I’m losing so many time…Does anyone have an idea?

A little more formally, here is what my code looks like.

I have a very simple structure

typedef struct
{
    int index;
    int type
    int distance;
    int shift;
    int index_projection;
    int index_projection_OS;
    float coord_x;
    float coord_y;
    int num_subset;
}T_Objects;

In the main I call to the recursive function which is named DFS_projection , passing as arguments an object (named subset) of type T_Objects and some other data most of which are tables.

for all slices
     
     for all iterations
     
          for all id_subsets

              T_Objects* subset = Malloc(T_Objects);
              (*subset).num_subset = id_subset;
              (*subset).type = 1;
              DFS_projection(subset, Objet, projected_solution, U, S, Table_LSF, normalisation_detecteurs, Table_OS, nb_OS);
              free(subset);

Here is the recursive function DFS_projection.
Depending on the value of (*subset).type, the function retrieve some information, makes recursive calls and finally calls to the kernel.

void DFS_projection(T_Objects* vertex, float* matrix_relevance, float* sino_in_construction, int* U, int* S, float* lsf, float* normalization_detectors, int* os_tab, int nb_OS)
{
    int i;
    float* proba_detector;
    float* val_norm_detector;
    cudaError_t err;

    switch ((*vertex).type)
    {
    case 3:
        kernel << <11, 32 >> > (sino_in_construction, lsf, normalization_detectors, (*vertex).distance, (*vertex).shift, (*vertex).index_projection, (*vertex).index_projection_OS, matrix_relevance[(*vertex).index]);
        err = cudaGetLastError();
        if (err != cudaSuccess)
        {
            printf("Error: %s\n", cudaGetErrorString(err));
        }
        gpuErrchk(cudaPeekAtLastError());
        gpuErrchk(cudaDeviceSynchronize());
        break;
    case 2:
        for (i = 0; i < 168 * 168; i++)
        {
            if (matrix_relevance[i])
            {
                (*vertex).index = i;
                (*vertex).type = 3;
                (*vertex).distance = U[i + (*vertex).index_projection * 168 * 168];
                (*vertex).shift = S[i + (*vertex).index_projection * 168 * 168];
                DFS_projection(vertex, matrix_relevance, sino_in_construction, U, S, lsf, normalization_detectors, os_tab, nb_OS);
            }
        }
        break;
    case 1:
        for (i = 0; i < NB_PROJECTIONS / nb_OS; i++)
        {
            (*vertex).index_projection = os_tab[(*vertex).num_subset * NB_PROJECTIONS / nb_OS + i];
            (*vertex).index_projection_OS = i;
            (*vertex).type = 2;
            DFS_projection(vertex, matrix_relevance, sino_in_construction, U, S, lsf, normalization_detectors, os_tab, nb_OS);
        }
        break;
    }
}

Finally, my global function build a tab, sino_in_construction, with all retrieved informations

__global__ void kernel(float* sino_in_construction, float* lsf, float* normalization_detectors, int vertex_distance, int vertex_shift, int vertex_index_projection, int vertex_index_projection_OS, float relevance)
{       
    float* proba_detector;
    float* val_norm_detector;
    int indexThreadInGrid = blockIdx.x * blockDim.x + threadIdx.x;

    if (indexThreadInGrid < NB_DETECTEURS)
    {
        proba_detector = &lsf[vertex_distance * 3 * NB_DETECTEURS + vertex_shift + indexThreadInGrid];
        val_norm_detector = &normalization_detectors[vertex_index_projection * NB_DETECTEURS + indexThreadInGrid];
        sino_in_construction[vertex_index_projection_OS * NB_DETECTEURS + indexThreadInGrid] += (*proba_detector) / (*val_norm_detector) * relevance;
    }
}

I know it’s a long post, sorry for that…I tried to explained as clearly as possible my problem.

I learn a lot here one Nvidia developper, so thank you a lot !

Regards,
Sarah

Great. That is usually the point where you may want to consider using a profiler.

Studying the code you have posted, I can make the following observations:

  1. I wonder if you have transferred enough “work” to the GPU to make interesting use of the GPU. Your GPU kernel seems quite simple to me (let’s say arithmetically). There is almost no work being done.

  2. Except for the very smallest of GPUs (e.g. Quadro K2000, GeForce GT640) a kernel launch of <<<11,32>>> is generally too small to make interesting, efficient use of the GPU. Most GPUs will be in a severe latency-bound situation with only 356 threads, considering that a single GPU SM can hold either 1024 or 2048 threads. Any larger GPU will be mostly idle with this kind of launch configuration.

  3. It looks like you are doing indirect access in your kernel. That may well be unavoidable, but it can lead to performance issues on the GPU. GPUs work best when adjacent threads are accessing adjacent data in memory.

I would expect that if you profiled your code with Nsight systems, you could make an initial estimate on whether item 1 above is correct, just by looking at the timeline. If you see an execution timeline where the kernel launches are approximately zero percent of the timeline, then you know that asking why the GPU is slow doesn’t make sense. You would need to move more work to the GPU.

I would expect that if you profiled your kernel with Nsight compute, you could make an initial estimate on whether items 2 and 3 above are correct, by looking at the SOL section and the bottleneck rule in that section (for item 2) and take a look at some metrics (for item 3) that correspond to coalesced or efficient access.

I won’t be able to give tutorials on using the profilers here, but there are resources available on the web already for these.