Hello everyone,

I am new to gpu programming and I am more theoretician than programmer, so please apologize for any obvious mistakes!

**Summary of my problem:**

I want to build on the GPU a table through a recursive function.

The construction of this table is done as it should be from 1 to 4 threads. From 5 threads it no longer works.

**The problem a little more in detail:**

I have a simple structure

```
struct T_Objects
{
int index;
int type;
int distance;
int shift;
int index_projection;
int num_subset;
int num_slice;
};
```

and a GPU function that calls for a recursive function `DFS_normalization_osem`

a number of times

```
__global__ void kernel(a lot of parameters)
{
int num_subset;
memset(normalization_voxels, 0, sizeof(float) * (*NB_VOXELS) * (*nb_OS));
for (num_subset = 0; num_subset < (*nb_OS); num_subset++)
{
(*subset).num_subset = num_subset;
(*subset).type = 0;
(*subset).num_slice = threadIdx.x;
DFS_normalization_osem(a lot of parameters);
}
}
```

The ultimate goal of this recursive function is to build a table, namely `normalisation_voxels`

.

The function is a kind of Depth First Search for a graph : for each object `subset`

of the type `T_Objects`

, I look at the type of the object (`(*subset).type`

), I retrieve information, I recall the function and this until `normalization_voxels`

is completed.

```
__device__ void DFS_normalization_osem(T_Objects* vertex, float* matrix_relevance, int* U, int* S, float* lsf, int* os_tab, int nb_OS, float* normalization_detectors, float* attenuation, float* normalisation_voxels)
{
int i;
float* proba_detector;
float* val_norm_detector;
float attenuation_value;
switch ((*vertex).type)
{
case 1:
for (i = 0; i < 168 * 168; i++)
{
if (matrix_relevance[i])
{
(*vertex).index = i;
(*vertex).distance = U[i + (*vertex).index_projection * 168 * 168];
(*vertex).shift = S[i + (*vertex).index_projection * 168 * 168];
(*vertex).type = 2;
DFS_normalization_osem(vertex, matrix_relevance, U, S, lsf, os_tab, nb_OS, normalization_detectors, attenuation, normalisation_voxels);
}
}
break;
case 2:
for (i = 0; i < NB_DETECTEURS; i++)
{
proba_detector = &lsf[(*vertex).distance * 3 * NB_DETECTEURS + (*vertex).shift + i];
val_norm_detector = &normalization_detectors[(*vertex).index_projection * NB_DETECTEURS + i];
attenuation_value = attenuation[((*vertex).num_slice)* NB_PROJECTIONS * NB_DETECTEURS + (*vertex).index_projection * NB_DETECTEURS+ i];
if ((*proba_detector))
{
normalisation_voxels[(*vertex).num_slice * nb_OS * 168 * 168 +(*vertex).index + (*vertex).num_subset * 168 * 168 ] += ((*proba_detector) / (*val_norm_detector)) / (attenuation_value);
}
}
break;
case 0:
for (i = 0; i < NB_PROJECTIONS / nb_OS; i++)
{
(*vertex).index_projection = os_tab[(*vertex).num_subset * NB_PROJECTIONS / nb_OS + i];
(*vertex).type = 1;
DFS_normalization_osem(vertex, matrix_relevance, U, S, lsf, os_tab, nb_OS, normalization_detectors, attenuation, normalisation_voxels);
}
break;
}
}
```

If `((*subset).type)`

is 0 or 1 I only recover information from the different elements given as input to the recursive function (table `U`

, `S`

, `lsf`

…). It is when `((*subset).type)`

is 2 that our table `normalization_voxels`

is constructed with the help of the previous gathered informations.

In the main function, I first called my function `__global__`

on a single thread. It runs very slowly (memory access problems?) but the tab `normalisation_voxels`

is built properly. If I run on 2, 3 or 4 threads it also works but on the other hand from 5 threads, in `__global__`

, the loop

```
for (num_subset = 0; num_subset < (*nb_OS); num_subset++)
{
(*subset).num_subset = num_subset;
(*subset).type = 0;
(*subset).num_slice = thread_slice;
DFS_normalization_osem(a lot of parameters);
}
```

is intertamed before going through all the `num_subset`

.

Would anyone have any idea where this problem might come from? I really don’t know much about it, can my GPU memory be overloaded?

Thank you a lot in advance for your help!