Building a tab with a recursive function - very slow for 1 to 4 threads and does not work with >4 threads

Hello everyone,

I am new to gpu programming and I am more theoretician than programmer, so please apologize for any obvious mistakes!

Summary of my problem:

I want to build on the GPU a table through a recursive function.

The construction of this table is done as it should be from 1 to 4 threads. From 5 threads it no longer works.

The problem a little more in detail:

I have a simple structure

struct T_Objects
{
int index;
int type;
int distance;
int shift;
int index_projection;
int num_subset;
int num_slice;
};

and a GPU function that calls for a recursive function DFS_normalization_osem a number of times

__global__ void kernel(a lot of parameters)
{
    int num_subset;

    memset(normalization_voxels, 0, sizeof(float) * (*NB_VOXELS) * (*nb_OS));

    for (num_subset = 0; num_subset < (*nb_OS); num_subset++)
    {
        (*subset).num_subset = num_subset;
        (*subset).type = 0;
        (*subset).num_slice = threadIdx.x;
        DFS_normalization_osem(a lot of parameters);
    }
}

The ultimate goal of this recursive function is to build a table, namely normalisation_voxels.
The function is a kind of Depth First Search for a graph : for each object subset of the type T_Objects, I look at the type of the object ((*subset).type), I retrieve information, I recall the function and this until normalization_voxels is completed.

__device__ void DFS_normalization_osem(T_Objects* vertex, float* matrix_relevance, int* U, int* S, float* lsf, int* os_tab, int nb_OS, float* normalization_detectors, float* attenuation, float* normalisation_voxels)
{
    int i;
    float* proba_detector;
    float* val_norm_detector;
    float attenuation_value;
 
    switch ((*vertex).type)
    {
    case 1:
        for (i = 0; i < 168 * 168; i++)
        {
            if (matrix_relevance[i])
            {
                (*vertex).index = i;
                (*vertex).distance = U[i + (*vertex).index_projection * 168 * 168];
                (*vertex).shift = S[i + (*vertex).index_projection * 168 * 168];
                (*vertex).type = 2;
                DFS_normalization_osem(vertex, matrix_relevance, U, S, lsf, os_tab, nb_OS, normalization_detectors, attenuation, normalisation_voxels);
            }
        }
        break;
    case 2:
        for (i = 0; i < NB_DETECTEURS; i++)
        {
            proba_detector = &lsf[(*vertex).distance * 3 * NB_DETECTEURS + (*vertex).shift + i];
            val_norm_detector = &normalization_detectors[(*vertex).index_projection * NB_DETECTEURS + i];
            attenuation_value = attenuation[((*vertex).num_slice)* NB_PROJECTIONS * NB_DETECTEURS + (*vertex).index_projection * NB_DETECTEURS+ i];
            if ((*proba_detector))
            {
                normalisation_voxels[(*vertex).num_slice * nb_OS * 168 * 168 +(*vertex).index + (*vertex).num_subset * 168 * 168 ] += ((*proba_detector) / (*val_norm_detector)) / (attenuation_value);
            }
        }
        break;
    case 0:
        for (i = 0; i < NB_PROJECTIONS / nb_OS; i++)
        {
            (*vertex).index_projection = os_tab[(*vertex).num_subset * NB_PROJECTIONS / nb_OS + i];
            (*vertex).type = 1;
            DFS_normalization_osem(vertex, matrix_relevance, U, S, lsf, os_tab, nb_OS, normalization_detectors, attenuation, normalisation_voxels);
        }
        break;
    }
}

If ((*subset).type) is 0 or 1 I only recover information from the different elements given as input to the recursive function (table U, S, lsf …). It is when ((*subset).type) is 2 that our table normalization_voxels is constructed with the help of the previous gathered informations.

In the main function, I first called my function __global__ on a single thread. It runs very slowly (memory access problems?) but the tab normalisation_voxels is built properly. If I run on 2, 3 or 4 threads it also works but on the other hand from 5 threads, in __global__, the loop

    for (num_subset = 0; num_subset < (*nb_OS); num_subset++)
    {
        (*subset).num_subset = num_subset;
        (*subset).type = 0;
        (*subset).num_slice = thread_slice;
        DFS_normalization_osem(a lot of parameters);
    }

is intertamed before going through all the num_subset.

Would anyone have any idea where this problem might come from? I really don’t know much about it, can my GPU memory be overloaded?

Thank you a lot in advance for your help!

Typical suggestions here might be to make sure you are using proper CUDA error checking and also run your code with cuda-memcheck. The output may be useful debug.

Thank you very much for your suggestions.
I want to follow your advice but I encounter other problems along the way.
First of all I had to solve a PATH problem with the cl.exe file. I think this is okey now.
Now, before trying a cuda-memcheck, I get a problem only with

nvcc kernel.cu

which gives me

C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\vcruntime.h(197): error: invalid redeclaration of type name "size_t"
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\vcruntime_new_debug.h(34): error: first parameter of allocation function must be of type "size_t"
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\type_traits(103): error: class template "std::_Is_function" has already been defined
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\type_traits(138): error: class template "std::_Is_memfunptr" has already been defined

and so on until

Error limit reached.
100 errors detected in the compilation of "kernel.cu".
Compilation terminated.

I think it is important to point out that I get those problems also with a simple Hello world cuda project.
Maybe I should have open a new subject for this issue.

Do you have new suggestions to help me with this?

All this is very new for me, so thank you for your patience and your help!

I’m puzzled. Previously you stated:

" If I run on 2, 3 or 4 threads it also works "

Now you’re unable to even build the code? My suggestion: Go back to using whatever method you used to build the code that allowed you to make the statement " If I run on 2, 3 or 4 threads it also works "

On Visual Studio the good old-fashioned way of building the code - click on the green button “local windows debug” - works for me…
Sorry, like I said, I am more theoretician than programmer; and I have to discover new things in a short time.
Anyway I think I know now what is going on: my computer has two GPUs with each shared memory, I think my RAM might be overloaded.

Great.

Step 1: Add proper CUDA error checking. Then recompile, run and see if you get any clues. Not sure what proper CUDA error checking is? Google “proper CUDA error checking” and take the first hit, study it, and apply it to your code

Step 2: If the clues from step 1 don’t help much, locate the name of the built executable. This is an executable program that ends in .exe The name will be listed in the VS Console output when you build the code (for example, Rebuild project from the menus). Once you locate that executable file name and the location it is at on your machine, open a windows command prompt, use cd to change to that directory, and run the executable from the command prompt preceded by cuda-memcheck, like cuda-memcheck my.exe

If you get errors from cuda-memcheck in step 2, you can usually localize those errors to a single line of code using the methodology described here