Hello everyone,
I’m new to gpu programming and I don’t undersetand why my simple code below runs so slowly (~20 min!).
If anyone has an idea… I would be happy to understand a little more about what’s behind all this.
In my main
I have a simple loop in which I call a global function, on 1 block of 32 threads :
for (indice_voxel = 0; indice_voxel < (*NB_VOXELS); indice_voxel++)
{
if (Objet[indice_voxel])
{
kernel<<<1,32>>>( NB_VOXELS, Objet, nombre_projections_par_OS, Table_OS, U, S,\
Table_LSF, normalisation_detecteurs, numero_subset, projected_solution, indice_voxel );
cudaDeviceSynchronize());
}
}
Every parameters of kernel_projection
has been allocated the same way, with a cudaMallocManaged.
The goal of my global function is to build a tab projected_solution
. To do this, the informations needed are retrieved from the different parameters given as input to the function.
My function works fine (the tab is built properly and I have used CUDA error checking) but incredibly slowly. This is what I would like to understand.
__global__ void kernel(int* NB_VOXELS, float* Objet, int *nombre_projections_par_OS, \
int* Table_OS, int* U, int* S, float* Table_LSF, float* normalisation_detecteurs, int numero_subset, float* projected_solution, int indice_voxel)
{
int distance, shift, indice_detecteur, int numero_de_la_projection;
float* pointeur_A, * pointeur_D;
int indexThreadInGrid = blockIdx.x * blockDim.x + threadIdx.x;
if (indexThreadInGrid < (*nombre_projections_par_OS))
{
numero_de_la_projection = Table_OS[numero_subset * (*nombre_projections_par_OS) + indexThreadInGrid];
distance = U[indice_voxel + numero_de_la_projection * (*NB_VOXELS)];
shift = S[indice_voxel + numero_de_la_projection * (*NB_VOXELS)];
pointeur_A = &Table_LSF[distance * 3 * NB_DETECTEURS + shift];
pointeur_D = &normalisation_detecteurs[numero_de_la_projection * NB_DETECTEURS];
for (indice_detecteur = 0;indice_detecteur < NB_DETECTEURS;indice_detecteur++)
{
projected_solution[indexThreadInGrid * NB_DETECTEURS + indice_detecteur] += ((*pointeur_A++) / (*pointeur_D++)) * Objet[indice_voxel];
}
}
The size of all my parameters together (NB_VOXELS
, Objet
, nombre_projections_par_OS
, Table_OS
…) doesn’t exceed 76.5 Mo.
Would anyone have any idea what I have done wrong, why my code runs so slowly? Because it runs even slower than the non-parallel cpu version…
Thank you!