Allocating multi-dimension array (An array of arrays of different lengths)

Hello everyone.

I want allocate multi-dimension array.

And, it has different lengths second dimension.

It means…

array -> array[0][0 ~ 100], array[1][0 ~ 200], array[2][0 ~ 300]…

I saw a article from web-site

In my case, I can’t use cudaMallocPitch.

So, now, I had to allocation memory single dimension.

array[0 ~ 99, 100 ~ 299, 300 ~ ]

In this case, actual memory address has to allocated continuously. because, I use pointer access in kernel function.

Please help me…


maybe you should be more precise on what you want to achieve.
For example, I do not see why you don’t want to use a one dimensional array.

I am not sure if universal memory would help your cause, I don’t think so.

So in both situations these arrays will have to be allocated manually.

For example, allocate them both on the cpu and on the gpu side.

This will return pointers on cpu and gpu side.

The pointers which were returned for the gpu side on the cpu side, can then be transferred to the gpu side where they make sense.

So for example:

Use some cuda allocation routine, to allocate memory on the gpu, this returns a gpu pointer (on cpu side). Then use cuda transfer routines to transfer the gpu pointer, to the gpu itself.

There is also another possibility, allocating the memory inside the kernel itself. I wouldn’t recommend doing that… cause it may be buggy or cause other problems. And it also depends if you need memory on both sides ;).

So you will have 1 main pointer for the first dimension, then many pointers for each secondary dimension. All these pointers will have to be duplicated/transferred to gpu side.

Use a structure for that… and then initialize/rebuild the structure on gpu side… initialize main array with those pointers that were transferred.

Hope this gives you some ideas.

Hi. Thank you for your advice.

Actually, I wanna do parallel quick sorting multiple data(has different length).

I was tried simple Quick Sorting of CUDA example. And I was modify it.

That example use dynamic parallelism for quick sorting. Am I right??

Note that, I’m using GTX 780-Ti.

So, I was create data like below.


const int arrays = 100;
int* h_count = (int*)malloc(arrays * sizeof(int));
int** h_data = (int**)malloc(arrays * sizeof(int*));

for (int i=0 ; i<arrays ; i++)
h_count[i] = rand() % 10000;

int sum_length = 0;
for (int i=0 ; i<arrays ; i++)
sum_length += h_count[i];

h_data[i] = (int*)malloc(count[i] * sizeof(int));

for (int j=0 ; j<h_count[i] ; j++)
	h_data[i][j] = rand() % h_count[i];


int* d_data;
cudaMalloc((void**)&d_data, sum_length * sizeof(int));
int* d_count;
cudaMalloc((void**)&d_count, arrays * sizeof(int));

int offset = 0;
for (int i=0 ; i<arrays ; i++)
cudaMemcpy(d_data + offset, h_data[i], h_count[i] * sizeof(int), cudaMemcpyHostToDevice);
offset += h_count[i];
cudaMemcpy(d_count, h_count, arrays * sizeof(int), cudaMemcpyHostToDevice);

// Prepare CDP for the max depth ‘MAX_DEPTH’.
cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, MAX_DEPTH);

// call kernel function
SortQuickParallel<<<1, arrays>>>(d_data, d_count, arrays, 0, 0, 0, 0);


offset = 0;
for (int i=0 ; i<arrays ; i++)
cudaMemcpy(h_data[i], d_data + offset, h_count[i] * sizeof(int), cudaMemcpyDeviceToHost);
offset += h_count[i];

for (int i=0 ; i<arrays ; i++)



It’s working. But, I still want use multiple array.

And, I have another question.

The result showing CPU faster than GPU same function. I don’t know way.

I’m waiting for advice…plz…

Concerning your second question: Why is the CPU faster than the GPU?

In your example you call the kernel with <<<1,100>>> that means 1 block with 100 threads. To fully utilize your GPU (with 2880 cores) you need much more active threads. I would highly recommend that you read some introductory tutorial on when the GPU architecture is powerful to use (for example the first chapters of the programming guide)!
If you cannot modify your problem such that you utilize the GPU much better you should keep to the CPU code.

Thanks Mr.hadschi118.

I’d change code <<<1,100>>> to <<<100,1>>> and threadIdx.x to blockIdx.x inside kernel.

After then each thread has calling kernel<<<1,1>>>(…) like sample project. (Recursive)

But, It still take long time more than CPU code.

What I did wrong?? I can modify my program anytime.

Do I have to keep use CPU code? Is there any other way??

Ok, sorry. I missed that you spawn threads dynamically within the kernel. Then your first choice might be the correct one, depending on the implementation of your “SortQuickParallel” kernel.

Without more details it is hard to tell where the performance problem comes from… Another guess would be that your arrays are to small to see a good performance…

Hi. Mr.hadschi118.

You always reply to my question. Thanks. It’s very helpful for me.

I’m sorry. I had have to upload my whole code at first.

The kernel is implemented like below…

Please look at it and make a right. Thanks again. Have a good day.


global void SortQuickParallel(T* data, int *length, int data_num, int left, int right, int depth, int sidx)
// Do someting at first
if (depth == 0)
left = 0;
right = length[threadIdx.x] - 1; // >> blockIdx.x

	for (int i=0 ; i<threadIdx.x ; i++) // >> blockIdx.x
		sidx += length[i];

// If we're too deep or there are few elements left, we use an insertion sort...
if ((depth >= MAX_DEPTH) || ((right - left) <= SELECTION_SORT))
    SortSelection(data, sidx + left, sidx + right);

T* lptr = data + sidx + left;
T* rptr = data + sidx + right;
T pivot = data[sidx + ((left + right) / 2)];

while (lptr <= rptr)
	T lval = *lptr;
	T rval = *rptr;

	// Move the left pointer as long as the pointed element is smaller than the pivot.
	while (lval < pivot)
		lval = *lptr;

	// Move the right pointer as long as the pointed element is larger than the pivot.
	while (rval > pivot)
		rval = *rptr;

	// If the swap points are valid, do the swap!
    if (lptr <= rptr)
        *lptr++ = rval;
        *rptr-- = lval;

// Now the recursive part
int nleft  = (int)(lptr - data);
int nright = (int)(rptr - data);

nleft = nleft - sidx;
nright = nright - sidx;

if (left < nright)
	cudaStream_t s;
    cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);
    SortQuickParallelMultiVirtualLength<<<1, 1, 0, s>>>(data, length, data_num, left, nright, depth+1, sidx);

if (nleft < right)
	cudaStream_t s1;
    cudaStreamCreateWithFlags(&s1, cudaStreamNonBlocking);
    SortQuickParallelMultiVirtualLength<<<1, 1, 0, s1>>>(data, length, data_num, nleft, right, depth+1, sidx);



Sorry, I don’t have the time to fix your code (and I am no expert to dynamic parallelism). I would propose that you use the AdvancedQuickSort from the CUDA samples and just use it as it is, i.e. call the quicksort kernel for each one-dimensional array. If the arrays are large enough this should give you already a good performance. I feel that trying to quicksort all your different-sized arrays in parallel might be hard to implement in an efficient way…

If your arrays are too small for good efficiency you might try to run the kernels for each of the one-dimensional arrays concurrently ( I have no experience on this…

Try a vectorized sort using Thrust:!topic/thrust-users/BoLsxO6b4FY

Thanks for reply.

I will try and report. Thanks again.