Well I was trying to avoid posting this doubt, but I have spent all day researching and reading about it and I am completely overwhelmed.
I understood the idea of using arrays of arrays, user Letharion explained it well in another post (Arrays of Arrays). I have also found an implementation (although it seems badly implemented to me) in Stackoverflow.
What I want to implement is a 5688(dimblockdimgrid) array, where each element contains an array with 3584*3 float elements, and so I did this
//CPU ALLOCATION AND DEBUG
float*intersectionsy_h=(float*)malloc(dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y*sizeof(float*));//ARRAY OF POINTERS
if (intersectionsy_h == NULL)
printf ( "!!!! host memory allocation error (array of pointers y)\n");
for(int i=0;i<dimGrid.x*dimBlock.x;i++){
for(int j=0;i<dimGrid.y*dimBlock.y;j++){
status=cudaMalloc((void**)&intersectionsy_h[i*dimGrid.x*dimBlock.x+j],(3584*3-j)*sizeof(float));
if (status != cudaSuccess)
printf ("!!!! device memory allocation error (intersections per thread)\n");
}
}
//GPU ALLOCATION AND DEBUG
float *intersectionsy_d;
status=cudaMalloc((void**)&intersectionsy_d,sizeof(float*)*dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y);
if (status != cudaSuccess)
printf ("!!!! device memory allocation error (intery)\n");
status=cudaMemcpy(intersectionsy_d,intersectionsy_h,dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y*sizeof(float*),cudaMemcpyHostToDevice);
if (status != cudaSuccess)
printf ("!!!! could not copy intersection array to GPU (intery)\n");
//KERNEL CALL
//sysmat<<<dimGrid,dimBlock>>>(intersectionsy_d,xfocus,yfocus,zfocus, xbin, xbinsize,ybin,ybinsize,zbin,zbinsize,detectorXDim,detectorYDim,projecoes,detectorZDim,iiterationsu,jiterationsu,angle);
//The kernel is commented because I haven't changed it yet to work with arrays of arrays.
//COPY GPU RESULTS TO CPU AND DEBUG
status=cudaMemcpy(intersectionsy_h,intersectionsy_d,dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y*sizeof(float),cudaMemcpyDeviceToHost);
if (status != cudaSuccess)
printf ("!!!! could not retrieve from GPU (interx)\n");
//FREE MEMORY AND DEBUG
for(int i = 0; i < dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y; i++){
status=cudaFree(intersectionsy_d[i]);
if (status != cudaSuccess)
printf ("!!!! device memory free error (interx)\n");
}
status=cudaFree(intersectionsy_d);
if (status != cudaSuccess)
printf ("!!!! device memory free error (interx)\n");
free(intersectionsy_h);
}
I know there are some pointer’s inconsistencies, but I swear I really tried to work around it, but with no success.
I have a segmentation fault and an error regarding the type float in cudaFree, so I would like to know the right method to implement this.
Also does this method brings any problems when working with libraries like Thrust, and are arrays of arrays performance limiters(I have read some posts supporting this)?
Thanks in advance and any help is appreciated since I am completely lost =/