Allocating an array of pitched arrays

Is it possible to allocate a 2D array using looped mallocs across an allocated array, where the “2nd” dimension allocates a pitched 2d array instead of a linear array? It’s basically to store several 2D pitched arrays. I tried declaring and freeing them recursively as I need them but this process takes up more time than the benefit of having the pitched memory.

Something like this?

size_t dpitchBytes;

int dim[NumberofArrays];//This contains the size for each pitched array

/*  allocate storage for an array of pointers */

  a = malloc(NumberOfArrays * sizeof(double *));

/* for each pointer, allocate storage for apitched array */

  for (i = 0; i < NumberOfArrays; i++) {

    cudaMallocPitch((void**) &a[i], &d_pitchBytes, dim*sizeof(double), dim);

  }

Update: This is my first attempt at putting the pitched arrays within another array.

size_t d_pitchBytes;

size_t memcpyPitch;

double **d_u;

double **d_res;

double **d_rhs

int numGrids=12;

int dimU[numGrids];//These values are initialised elsewhere, but the array is included for completeness

	//Allocate Device Memory

	cudaMalloc((void**) &d_u, numGrids*sizeof(double));

	cudaMalloc((void**) &d_rhs, numGrids*sizeof(double));

	cudaMalloc((void**) &d_res, numGrids*sizeof(double));

	for(i=0;i<numGrids;i++){		

	cudaMallocPitch((void**) &d_u[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

	if(i==0){

		memcpyPitch=d_pitchBytes;

	

	}

	cudaMallocPitch((void**) &d_rhs[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

	cudaMallocPitch((void**) &d_res[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

	Pitch[i]=d_pitchBytes/sizeof(double);

	}

It throws out a cudaError_enum exception when performing the MallocPitch. Any suggestions would be appreciated. Thanks.

You can’t access d_rhs[i] or d_res[i] from the host since they are in device memory. You can get it to work by allocating like in your first post and afterwards copying the device pointer array to the device. But as usual pointers to pointers are a bad idea in CUDA. Do your pitched arrays have different lengths or not? If they have the same length you should be able to flatten them into a 1D array as well with dim*NumberofArrays as the y-dimension. (You could also take a look at cudaMalloc3D, which also pitches the allocation.)

The arrays are all different sizes. I was flattening them all into a 1D array but I wanted to pad the arrays. I have been previously mallocing and freeing each array in recursive calls of a function in the program. This allowed me to use padding which increased the kernel performances but was mitigated by the constant allocation and freeing of arrays. I want to see if having to dereference the pointers will be faster/slower than constantly mallocing/freeing arrays on the device.

Latest attempt at the code, still give an enum error when performing the pitched malloc.

int dimU[numGrids];//These values are initialised elsewhere, but the array is included for completeness

	double **d_u; 

	double **d_rhs;

	double **d_res;

	double *d_uMalloc;

	double *d_rhsMalloc;

	double *d_resMalloc;

        size_t d_pitchBytes;

        size_t memcpyPitch;

        int *dimU;

        int *Pitch;

Pitch=(int *)malloc(numGrids*sizeof(int));

        dimU=(int *)malloc(numGrids*sizeof(int));

	cudaMalloc((void**) &d_u, numGrids*sizeof(double *));

	cudaMalloc((void**) &d_res, numGrids*sizeof(double *));

	cudaMalloc((void**) &d_rhs, numGrids*sizeof(double *));

	d_uMalloc=(double *)malloc(numGrids*sizeof(double));

	d_rhsMalloc=(double *)malloc(numGrids*sizeof(double));

	d_resMalloc=(double *)malloc(numGrids*sizeof(double));

	for(i=0;i<numGrids;i++){		

	cudaMallocPitch((void**) &d_uMalloc[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

	if(i==0){

		memcpyPitch=d_pitchBytes;

		}

	cudaMallocPitch((void**) &d_rhsMalloc[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

	cudaMallocPitch((void**) &d_resMalloc[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

	Pitch[i]=d_pitchBytes/sizeof(double);

	}

Hi,

I’m not sure I understand but I have an array of arrays.

I’m not an expert of array 2d but I will make something like that :

// vars

int numGrids = 12;

double **d_res;           // device array of device arrays 2D

double **h_d_tmp;         // host array of device arrays 2D 

int pitchBytes[numGrids];

int dimU[numGrids];

// init host/device memory

h_d_tmp = new double*[numGrids];

for (int i=0 ; i < numGrids ; i++)

{

     cudaMallocPitch((void**) &h_d_tmp[i], &pitchBytes[i], dimU[i] * sizeof(double), dimU[i]);

}

cudaMalloc((void**) &d_res, numGrids * sizeof(double*));

cudaMemcpy(d_res, h_d_tmp, numGrids * sizeof(double*), cudaMemcpyHostToDevice);

// make something by giving d_res to kernel

/// ...

// free host/device memory

for (int i=0 ; i < numGrids ; i++)

{

     cudaFree(h_d_tmp[i]);

}

cudaFree(d_res);

delete[] h_d_tmp;

I think it’s the malloc of the host array that’s screwing me up. Should it be:

double **d_uMalloc;

d_uMalloc=(double *)malloc(numGrids*sizeof(double));

Or am I using too many/few dereferencing operators?

This is what I’ve got now. Allocations seem to work but I get enum errors at the memcpy2d.

int dimU[numGrids];//These values are initialised elsewhere, but the array is included for completeness

        double **d_u; 

        double **d_rhs;

        double **d_res;

        double **d_uMalloc;

        double **d_rhsMalloc;

        double **d_resMalloc;

        size_t h_pitchBytes = dimension*sizeof(double);

	size_t memcpyPitch;

	

	cudaMalloc((void**) &d_u, numGrids*sizeof(double *));

	cudaMalloc((void**) &d_res, numGrids*sizeof(double *));

	cudaMalloc((void**) &d_rhs, numGrids*sizeof(double *));

	d_uMalloc=(double**)malloc(numGrids*sizeof(double*));

	d_rhsMalloc=(double**)malloc(numGrids*sizeof(double*));

	d_resMalloc=(double**)malloc(numGrids*sizeof(double*));

	for(i=0;i<numGrids;i++){		

	cudaMallocPitch((void**) &d_uMalloc[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

	if(i==0){

		memcpyPitch=d_pitchBytes;

		printf("%i \n",memcpyPitch);

	}

	cudaMallocPitch((void**) &d_rhsMalloc[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

	cudaMallocPitch((void**) &d_resMalloc[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

	Pitch[i]=d_pitchBytes/sizeof(double);

	}

	printf("%s \n","Device Memory Allocation Completed");

	

cudaMemcpy(d_u,d_uMalloc,numGrids*sizeof(double *), cudaMemcpyHostToDevice);	

cudaMemcpy(d_res,d_resMalloc,numGrids*sizeof(double *), cudaMemcpyHostToDevice);	

cudaMemcpy(d_rhs,d_rhsMalloc,numGrids*sizeof(double *), cudaMemcpyHostToDevice);	

cudaMemcpy2D(d_u[0], memcpyPitch, h_u, h_pitchBytes,dimU[0]*sizeof(double), dimU[0], cudaMemcpyHostToDevice);

cudaMemcpy2D(d_rhs[0], memcpyPitch, h_rhs, h_pitchBytes,dimU[0]*sizeof(double), dimU[0], cudaMemcpyHostToDevice);

cudaMemcpy2D(d_res[0], memcpyPitch, h_rhs, h_pitchBytes,dimU[0]*sizeof(double), dimU[0], cudaMemcpyHostToDevice);

You can’t access to device memory : d_u[0] , d_rhs[0] , d_res[0] are impossible.

You can manipulate (assign, copy) pointer of device memory but don’t access of their content.

But what to serve these last 3 lines ???

I want to copy host arrays into d_u[0], d_rhs[0] and d_res[0]. Is it possible to do this? If d_u, d_res and d_rhs were 1D I could.

int dimU[numGrids];//These values are initialised elsewhere, but the array is included for completeness

        double **d_u; 

        double **d_rhs;

        double **d_res;

        double **d_uMalloc;

        double **d_rhsMalloc;

        double **d_resMalloc;

        size_t h_pitchBytes = dimension*sizeof(double);

        size_t memcpyPitch;

cudaMalloc((void**) &d_u, numGrids*sizeof(double *));

        cudaMalloc((void**) &d_res, numGrids*sizeof(double *));

        cudaMalloc((void**) &d_rhs, numGrids*sizeof(double *));

        d_uMalloc=(double**)malloc(numGrids*sizeof(double*));

        d_rhsMalloc=(double**)malloc(numGrids*sizeof(double*));

        d_resMalloc=(double**)malloc(numGrids*sizeof(double*));

        for(i=0;i<numGrids;i++)

        {                

             cudaMallocPitch((void**) &d_uMalloc[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

             cudaMemcpy2D(d_uMalloc[i], memcpyPitch, h_u[i], h_pitchBytes,dimU[i]*sizeof(double), dimU[i], cudaMemcpyHostToDevice);

             if(i==0){

                memcpyPitch=d_pitchBytes;

                printf("%i \n",memcpyPitch);

             }

             cudaMallocPitch((void**) &d_rhsMalloc[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

             cudaMallocPitch((void**) &d_resMalloc[i], &d_pitchBytes, dimU[i]*sizeof(double), dimU[i]);

             cudaMemcpy2D(d_rhsMalloc[i], memcpyPitch, h_rhs[i], h_pitchBytes,dimU[i]*sizeof(double), dimU[i], cudaMemcpyHostToDevice);

             cudaMemcpy2D(d_resMalloc[i], memcpyPitch, h_res[i], h_pitchBytes,dimU[i]*sizeof(double), dimU[i], cudaMemcpyHostToDevice);

             Pitch[i]=d_pitchBytes/sizeof(double);

        }

        printf("%s \n","Device Memory Allocation Completed");

cudaMemcpy(d_u,d_uMalloc,numGrids*sizeof(double *), cudaMemcpyHostToDevice);    

cudaMemcpy(d_res,d_resMalloc,numGrids*sizeof(double *), cudaMemcpyHostToDevice);        

cudaMemcpy(d_rhs,d_rhsMalloc,numGrids*sizeof(double *), cudaMemcpyHostToDevice);

Ok sorry, the day was long …

I want to set the values of the pitched arrays at d_u[0], d_rhs[0] and d_res[0] to the values stored in the host arrays h_u, h_rhs and h_res.

Thanks. If I wanted to pass one of the pitched arrays to a kernel would I pass d_u[i] or d_uMalloc[i]? Thanks for your help. I’m still getting used to device and host pointers.

It depends on what element you want to spend :

// You can pass

d_u / d_res / d_rhs

d_uMalloc[X] / d_rhsMalloc[X] / d_resMalloc[X]

// You can't pass

d_u[X] / d_rhs[X] / d_res[X]               // you can't access to pointer in device memory

d_uMalloc / d_rhsMalloc / d_resMalloc      // you can't pass host array

And for free device memory you need make like this :

for (int i=0 ; i < numGrids ; i++)

{

     cudaFree(d_uMalloc[i]);

     cudaFree(d_rhsMalloc[i]);

     cudaFree(d_resMalloc[i]);

}

cudaFree(d_u);

cudaFree(d_rhs);

cudaFree(d_res);

free(d_uMalloc);

free(d_rhsMalloc);

free(d_resMalloc);

Ok, so let’s say I have a function that needs to be able to pass any of the d_uMalloc arrays to a kernel within said function. Would I have to declare an array of pointers that points to each d_uMalloc for every x, and then pass that array to my function?