Arrays of arrays in CUDA

Well I was trying to avoid posting this doubt, but I have spent all day researching and reading about it and I am completely overwhelmed.

I understood the idea of using arrays of arrays, user Letharion explained it well in another post (Arrays of Arrays). I have also found an implementation (although it seems badly implemented to me) in Stackoverflow.

What I want to implement is a 5688(dimblockdimgrid) array, where each element contains an array with 3584*3 float elements, and so I did this

//CPU ALLOCATION AND DEBUG

	float*intersectionsy_h=(float*)malloc(dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y*sizeof(float*));//ARRAY OF POINTERS

	if (intersectionsy_h == NULL) 

		printf ( "!!!! host memory allocation error (array of pointers y)\n");

	for(int i=0;i<dimGrid.x*dimBlock.x;i++){

		for(int j=0;i<dimGrid.y*dimBlock.y;j++){

			status=cudaMalloc((void**)&intersectionsy_h[i*dimGrid.x*dimBlock.x+j],(3584*3-j)*sizeof(float));		

			if (status != cudaSuccess) 

            			printf ("!!!! device memory allocation error (intersections per thread)\n");

		}

	}

	//GPU ALLOCATION AND DEBUG

	float *intersectionsy_d;

	

	status=cudaMalloc((void**)&intersectionsy_d,sizeof(float*)*dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y);

	if (status != cudaSuccess) 

            printf ("!!!! device memory allocation error (intery)\n");

	status=cudaMemcpy(intersectionsy_d,intersectionsy_h,dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y*sizeof(float*),cudaMemcpyHostToDevice);

	if (status != cudaSuccess) 

            printf ("!!!! could not copy intersection array to GPU (intery)\n");			

	//KERNEL CALL

	//sysmat<<<dimGrid,dimBlock>>>(intersectionsy_d,xfocus,yfocus,zfocus, xbin, xbinsize,ybin,ybinsize,zbin,zbinsize,detectorXDim,detectorYDim,projecoes,detectorZDim,iiterationsu,jiterationsu,angle);

        //The kernel is commented because I haven't changed it yet to work with arrays of arrays.

	//COPY GPU RESULTS TO CPU AND DEBUG

	status=cudaMemcpy(intersectionsy_h,intersectionsy_d,dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y*sizeof(float),cudaMemcpyDeviceToHost);

	if (status != cudaSuccess) 

            printf ("!!!! could not retrieve from GPU (interx)\n");

	//FREE MEMORY AND DEBUG

    	for(int i = 0; i < dimGrid.x*dimGrid.y*dimBlock.x*dimBlock.y; i++){

		status=cudaFree(intersectionsy_d[i]);

		if (status != cudaSuccess) 

			printf ("!!!! device memory free error (interx)\n");

    	}	

	status=cudaFree(intersectionsy_d);

	if (status != cudaSuccess) 

		printf ("!!!! device memory free error (interx)\n");

	free(intersectionsy_h);

}

I know there are some pointer’s inconsistencies, but I swear I really tried to work around it, but with no success.

I have a segmentation fault and an error regarding the type float in cudaFree, so I would like to know the right method to implement this.

Also does this method brings any problems when working with libraries like Thrust, and are arrays of arrays performance limiters(I have read some posts supporting this)?

Thanks in advance and any help is appreciated since I am completely lost =/

It all starts going wrong on the very first line of code:

This is an array of floats:

float * intersectionsy_h=(float*)malloc(sz*sizeof(float*));//NOT AN ARRAY OF POINTERS

This is an array of pointers to floats:

float ** intersectionsy_h=(float**)malloc(sz*sizeof(float*));//ARRAY OF POINTERS

On a 64 bit system there is a very, very big difference between the two. Once you get that right, things might make a little more sense.

Hum…yes…I knew that float** represents an array of pointers, it is just that with all the examples I have seen, I have never seen anyone using the double pointer notation, and CUDA kernels don’t support double pointers or am I wrong? By now I have read so much information, that I might be confused

How about this or this?

The latter.

Thank you avidday, I do not know why I was so sure about CUDA kernels not supporting double pointers, now it is all a bit clearer to me.

P.S: At least I think those posts confirm that double pointers should not be used, as they are hard to implement and drop the performance

Absolutely. Whether it can be done and whether it should be done are very separate questions. There probably are some specific examples where using arrays of pointers is the optimal solution, but I can’t say I have ever had cause to use them in any of my of work with CUDA.