Allocating multi-dimension array (An array of arrays of different lengths)

ahntw80 · June 23, 2014, 8:09am

Hello everyone.

I want allocate multi-dimension array.

And, it has different lengths second dimension.

It means…

array → array[0][0 ~ 100], array[1][0 ~ 200], array[2][0 ~ 300]…

I saw a article from web-site http://www.stevenmarkford.com/allocating-2d-arrays-in-cuda/

In my case, I can’t use cudaMallocPitch.

So, now, I had to allocation memory single dimension.

array[0 ~ 99, 100 ~ 299, 300 ~ ]

In this case, actual memory address has to allocated continuously. because, I use pointer access in kernel function.

Please help me…

hadschi118 · June 23, 2014, 12:14pm

Hi,

maybe you should be more precise on what you want to achieve.
For example, I do not see why you don’t want to use a one dimensional array.

Skybuck · June 24, 2014, 1:02am

I am not sure if universal memory would help your cause, I don’t think so.

So in both situations these arrays will have to be allocated manually.

For example, allocate them both on the cpu and on the gpu side.

This will return pointers on cpu and gpu side.

The pointers which were returned for the gpu side on the cpu side, can then be transferred to the gpu side where they make sense.

So for example:

Use some cuda allocation routine, to allocate memory on the gpu, this returns a gpu pointer (on cpu side). Then use cuda transfer routines to transfer the gpu pointer, to the gpu itself.

There is also another possibility, allocating the memory inside the kernel itself. I wouldn’t recommend doing that… cause it may be buggy or cause other problems. And it also depends if you need memory on both sides ;).

So you will have 1 main pointer for the first dimension, then many pointers for each secondary dimension. All these pointers will have to be duplicated/transferred to gpu side.

Use a structure for that… and then initialize/rebuild the structure on gpu side… initialize main array with those pointers that were transferred.

Hope this gives you some ideas.

ahntw80 · June 24, 2014, 7:01am

Hi. Thank you for your advice.

Actually, I wanna do parallel quick sorting multiple data(has different length).

I was tried simple Quick Sorting of CUDA example. And I was modify it.

That example use dynamic parallelism for quick sorting. Am I right??

Note that, I’m using GTX 780-Ti.

So, I was create data like below.

================================================================================================

const int arrays = 100;
int* h_count = (int*)malloc(arrays * sizeof(int));
int** h_data = (int**)malloc(arrays * sizeof(int*));

srand(time(NULL));
for (int i=0 ; i<arrays ; i++)
{
h_count[i] = rand() % 10000;
}

int sum_length = 0;
for (int i=0 ; i<arrays ; i++)
{
sum_length += h_count[i];

h_data[i] = (int*)malloc(count[i] * sizeof(int));

srand(time(NULL));
for (int j=0 ; j<h_count[i] ; j++)
	h_data[i][j] = rand() % h_count[i];

}

int* d_data;
cudaMalloc((void**)&d_data, sum_length * sizeof(int));
int* d_count;
cudaMalloc((void**)&d_count, arrays * sizeof(int));

int offset = 0;
for (int i=0 ; i<arrays ; i++)
{
cudaMemcpy(d_data + offset, h_data[i], h_count[i] * sizeof(int), cudaMemcpyHostToDevice);
offset += h_count[i];
}
cudaMemcpy(d_count, h_count, arrays * sizeof(int), cudaMemcpyHostToDevice);

// Prepare CDP for the max depth ‘MAX_DEPTH’.
cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, MAX_DEPTH);

// call kernel function
SortQuickParallel<<<1, arrays>>>(d_data, d_count, arrays, 0, 0, 0, 0);

cudaDeviceSynchronize();

offset = 0;
for (int i=0 ; i<arrays ; i++)
{
cudaMemcpy(h_data[i], d_data + offset, h_count[i] * sizeof(int), cudaMemcpyDeviceToHost);
offset += h_count[i];
}

for (int i=0 ; i<arrays ; i++)
{
free(h_data[i]);
}
free(h_data);
free(h_count);

cudaFree(d_data);
cudaFree(d_count);

================================================================================================

It’s working. But, I still want use multiple array.

And, I have another question.

The result showing CPU faster than GPU same function. I don’t know way.

I’m waiting for advice…plz…

hadschi118 · June 24, 2014, 7:45am

Concerning your second question: Why is the CPU faster than the GPU?

In your example you call the kernel with <<<1,100>>> that means 1 block with 100 threads. To fully utilize your GPU (with 2880 cores) you need much more active threads. I would highly recommend that you read some introductory tutorial on when the GPU architecture is powerful to use (for example the first chapters of the programming guide)!
If you cannot modify your problem such that you utilize the GPU much better you should keep to the CPU code.

ahntw80 · June 24, 2014, 11:20am

Thanks Mr.hadschi118.

I’d change code <<<1,100>>> to <<<100,1>>> and threadIdx.x to blockIdx.x inside kernel.

After then each thread has calling kernel<<<1,1>>>(…) like sample project. (Recursive)

But, It still take long time more than CPU code.

What I did wrong?? I can modify my program anytime.

Do I have to keep use CPU code? Is there any other way??

hadschi118 · June 24, 2014, 12:52pm

Ok, sorry. I missed that you spawn threads dynamically within the kernel. Then your first choice might be the correct one, depending on the implementation of your “SortQuickParallel” kernel.

Without more details it is hard to tell where the performance problem comes from… Another guess would be that your arrays are to small to see a good performance…

ahntw80 · June 25, 2014, 12:18am

Hi. Mr.hadschi118.

You always reply to my question. Thanks. It’s very helpful for me.

I’m sorry. I had have to upload my whole code at first.

The kernel is implemented like below…

Please look at it and make a right. Thanks again. Have a good day.

==========================================================================================

template
global void SortQuickParallel(T* data, int *length, int data_num, int left, int right, int depth, int sidx)
{
// Do someting at first
if (depth == 0)
{
left = 0;
right = length[threadIdx.x] - 1; // >> blockIdx.x

	for (int i=0 ; i<threadIdx.x ; i++) // >> blockIdx.x
	{
		sidx += length[i];
	}
}

// If we're too deep or there are few elements left, we use an insertion sort...
if ((depth >= MAX_DEPTH) || ((right - left) <= SELECTION_SORT))
{
    SortSelection(data, sidx + left, sidx + right);
    return;
}

T* lptr = data + sidx + left;
T* rptr = data + sidx + right;
T pivot = data[sidx + ((left + right) / 2)];

while (lptr <= rptr)
{
	T lval = *lptr;
	T rval = *rptr;

	// Move the left pointer as long as the pointed element is smaller than the pivot.
	while (lval < pivot)
	{
		lptr++;
		lval = *lptr;
	}

	// Move the right pointer as long as the pointed element is larger than the pivot.
	while (rval > pivot)
	{
		rptr--;
		rval = *rptr;
	}

	// If the swap points are valid, do the swap!
    if (lptr <= rptr)
    {
        *lptr++ = rval;
        *rptr-- = lval;
    }
}

// Now the recursive part
int nleft  = (int)(lptr - data);
int nright = (int)(rptr - data);

nleft = nleft - sidx;
nright = nright - sidx;

if (left < nright)
{
	cudaStream_t s;
    cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);
    SortQuickParallelMultiVirtualLength<<<1, 1, 0, s>>>(data, length, data_num, left, nright, depth+1, sidx);
    cudaStreamDestroy(s);
}

if (nleft < right)
{
	cudaStream_t s1;
    cudaStreamCreateWithFlags(&s1, cudaStreamNonBlocking);
    SortQuickParallelMultiVirtualLength<<<1, 1, 0, s1>>>(data, length, data_num, nleft, right, depth+1, sidx);
    cudaStreamDestroy(s1);
}

}

==========================================================================================

hadschi118 · June 25, 2014, 7:59am

Sorry, I don’t have the time to fix your code (and I am no expert to dynamic parallelism). I would propose that you use the AdvancedQuickSort from the CUDA samples and just use it as it is, i.e. call the quicksort kernel for each one-dimensional array. If the arrays are large enough this should give you already a good performance. I feel that trying to quicksort all your different-sized arrays in parallel might be hard to implement in an efficient way…

If your arrays are too small for good efficiency you might try to run the kernels for each of the one-dimensional arrays concurrently ([url]Programming Guide :: CUDA Toolkit Documentation). I have no experience on this…

Robert_Crovella · June 26, 2014, 2:34pm

Try a vectorized sort using Thrust:

[url]Redirecting to Google Groups

ahntw80 · July 1, 2014, 12:30am

Thanks for reply.

I will try and report. Thanks again.

Topic		Replies	Views
Passing a multidimensional array to kernel how to allocate space in host and pass to device? CUDA Programming and Performance	12	16180	November 22, 2014
How to add pointer array value CUDA Programming and Performance	13	1726	May 2, 2019
Multidimensional Arrays multidimensional array allocation CUDA Programming and Performance	6	6289	December 8, 2007
Allocating a multidimensional array onto a device variable CUDA Programming and Performance	6	1587	July 15, 2015
Multidimensional array allocation with Cuda Unified Memory CUDA Programming and Performance	17	5878	November 5, 2016
2D Array Not Updated CUDA Programming and Performance	6	5234	May 4, 2010
How can I allocate 2-dimensional array on the device memory? CUDA Programming and Performance	5	15717	August 6, 2009
Is it possible to process multidimensional arrays inside the kernel? CUDA Programming and Performance	13	9030	March 31, 2015
How to cudaMalloc two-dimensional array ? CUDA Programming and Performance	46	66081	September 7, 2023
cuda integer operations and simt for sorting CUDA Programming and Performance	7	8870	July 25, 2009

Allocating multi-dimension array (An array of arrays of different lengths)

Related topics