Hello everyone,
I am new to gpu programming and I am more theoretician than programmer, so please apologize for any obvious mistakes!
Summary of my problem:
I want to build on the GPU a table through a recursive function.
The construction of this table is done as it should be from 1 to 4 threads. From 5 threads it no longer works.
The problem a little more in detail:
I have a simple structure
struct T_Objects
{
int index;
int type;
int distance;
int shift;
int index_projection;
int num_subset;
int num_slice;
};
and a GPU function that calls for a recursive function DFS_normalization_osem
a number of times
__global__ void kernel(a lot of parameters)
{
int num_subset;
memset(normalization_voxels, 0, sizeof(float) * (*NB_VOXELS) * (*nb_OS));
for (num_subset = 0; num_subset < (*nb_OS); num_subset++)
{
(*subset).num_subset = num_subset;
(*subset).type = 0;
(*subset).num_slice = threadIdx.x;
DFS_normalization_osem(a lot of parameters);
}
}
The ultimate goal of this recursive function is to build a table, namely normalisation_voxels
.
The function is a kind of Depth First Search for a graph : for each object subset
of the type T_Objects
, I look at the type of the object ((*subset).type
), I retrieve information, I recall the function and this until normalization_voxels
is completed.
__device__ void DFS_normalization_osem(T_Objects* vertex, float* matrix_relevance, int* U, int* S, float* lsf, int* os_tab, int nb_OS, float* normalization_detectors, float* attenuation, float* normalisation_voxels)
{
int i;
float* proba_detector;
float* val_norm_detector;
float attenuation_value;
switch ((*vertex).type)
{
case 1:
for (i = 0; i < 168 * 168; i++)
{
if (matrix_relevance[i])
{
(*vertex).index = i;
(*vertex).distance = U[i + (*vertex).index_projection * 168 * 168];
(*vertex).shift = S[i + (*vertex).index_projection * 168 * 168];
(*vertex).type = 2;
DFS_normalization_osem(vertex, matrix_relevance, U, S, lsf, os_tab, nb_OS, normalization_detectors, attenuation, normalisation_voxels);
}
}
break;
case 2:
for (i = 0; i < NB_DETECTEURS; i++)
{
proba_detector = &lsf[(*vertex).distance * 3 * NB_DETECTEURS + (*vertex).shift + i];
val_norm_detector = &normalization_detectors[(*vertex).index_projection * NB_DETECTEURS + i];
attenuation_value = attenuation[((*vertex).num_slice)* NB_PROJECTIONS * NB_DETECTEURS + (*vertex).index_projection * NB_DETECTEURS+ i];
if ((*proba_detector))
{
normalisation_voxels[(*vertex).num_slice * nb_OS * 168 * 168 +(*vertex).index + (*vertex).num_subset * 168 * 168 ] += ((*proba_detector) / (*val_norm_detector)) / (attenuation_value);
}
}
break;
case 0:
for (i = 0; i < NB_PROJECTIONS / nb_OS; i++)
{
(*vertex).index_projection = os_tab[(*vertex).num_subset * NB_PROJECTIONS / nb_OS + i];
(*vertex).type = 1;
DFS_normalization_osem(vertex, matrix_relevance, U, S, lsf, os_tab, nb_OS, normalization_detectors, attenuation, normalisation_voxels);
}
break;
}
}
If ((*subset).type)
is 0 or 1 I only recover information from the different elements given as input to the recursive function (table U
, S
, lsf
…). It is when ((*subset).type)
is 2 that our table normalization_voxels
is constructed with the help of the previous gathered informations.
In the main function, I first called my function __global__
on a single thread. It runs very slowly (memory access problems?) but the tab normalisation_voxels
is built properly. If I run on 2, 3 or 4 threads it also works but on the other hand from 5 threads, in __global__
, the loop
for (num_subset = 0; num_subset < (*nb_OS); num_subset++)
{
(*subset).num_subset = num_subset;
(*subset).type = 0;
(*subset).num_slice = thread_slice;
DFS_normalization_osem(a lot of parameters);
}
is intertamed before going through all the num_subset
.
Would anyone have any idea where this problem might come from? I really don’t know much about it, can my GPU memory be overloaded?
Thank you a lot in advance for your help!