Hi. Thank you for your advice.
Actually, I wanna do parallel quick sorting multiple data(has different length).
I was tried simple Quick Sorting of CUDA example. And I was modify it.
That example use dynamic parallelism for quick sorting. Am I right??
Note that, I’m using GTX 780-Ti.
So, I was create data like below.
================================================================================================
const int arrays = 100;
int* h_count = (int*)malloc(arrays * sizeof(int));
int** h_data = (int**)malloc(arrays * sizeof(int*));
srand(time(NULL));
for (int i=0 ; i<arrays ; i++)
{
h_count[i] = rand() % 10000;
}
int sum_length = 0;
for (int i=0 ; i<arrays ; i++)
{
sum_length += h_count[i];
h_data[i] = (int*)malloc(count[i] * sizeof(int));
srand(time(NULL));
for (int j=0 ; j<h_count[i] ; j++)
h_data[i][j] = rand() % h_count[i];
}
int* d_data;
cudaMalloc((void**)&d_data, sum_length * sizeof(int));
int* d_count;
cudaMalloc((void**)&d_count, arrays * sizeof(int));
int offset = 0;
for (int i=0 ; i<arrays ; i++)
{
cudaMemcpy(d_data + offset, h_data[i], h_count[i] * sizeof(int), cudaMemcpyHostToDevice);
offset += h_count[i];
}
cudaMemcpy(d_count, h_count, arrays * sizeof(int), cudaMemcpyHostToDevice);
// Prepare CDP for the max depth ‘MAX_DEPTH’.
cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, MAX_DEPTH);
// call kernel function
SortQuickParallel<<<1, arrays>>>(d_data, d_count, arrays, 0, 0, 0, 0);
cudaDeviceSynchronize();
offset = 0;
for (int i=0 ; i<arrays ; i++)
{
cudaMemcpy(h_data[i], d_data + offset, h_count[i] * sizeof(int), cudaMemcpyDeviceToHost);
offset += h_count[i];
}
for (int i=0 ; i<arrays ; i++)
{
free(h_data[i]);
}
free(h_data);
free(h_count);
cudaFree(d_data);
cudaFree(d_count);
================================================================================================
It’s working. But, I still want use multiple array.
And, I have another question.
The result showing CPU faster than GPU same function. I don’t know way.
I’m waiting for advice…plz…