CUDA Mergesort algorithm idea.

I am recently working on a MergeSort algorithm on CUDA. I have some doubts to develpe the logic about how to do it parallaly. I already implemented CPU version of it.

My input arraysize could be from 1 till 2^20. On CUDA how could I take a part of Large array and then sort it and then make array size big in fractional steps and sort the elements?

For example, if I have my array size of 32 elements. At first I take 8 threads to sort 4 elements each and then I make my array slice to 8,16,32. How to effectively do this on CUDA?

That’s pretty much everyone’s problem when porting CPU code to GPU, and it can take weeks to figure out one possible solution. My suggestion would be to first see if this problem wasn’t already solved, and then search and read as many similar problems as possible and bring the basic concepts that parallel implementations often share.

Personally I don’t think people will just drop some code here unless they already have it.

hello

Yes, I have solved the problem and working to finalize it with the help of some available resources. but now there is a problem in code. IT is how to run “(call)” Cuda function(global) from another CUDA function?

It gives me an error that it is allowed on compute architecture compute_35 or above. So I tried to change command line option to --gpu-architecture=compute_35 OR arch=compute_35 but still getting the same error.

I am having NVIDIA MX150 graphics card and some online post says it has compute capability of compute_61. but it is not working

Might be someone could have an idea how to deal with such problem?

-arch=sm_35 -rdc=true -lcudadevrt

I solved the problem but I think it was something else. I changed the flag as you have told but still had an error.

So I changed the execution space-specifier from global to device. and then called device function from global and it compiles without an error.

I think there is no way to call global from global. Is there?

yes you can call global from global

It is called CUDA Dynamic Parallelism, and there is a whole section of the programming guide that explains it.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism

If you call a global function in device code, it must be properly configured: <<<…>>> just as you would when calling it from host code.