Does anyone know, if in the near future, CUDA will support device memory allocation handled by the GPU kernel? As far as I know, the GPU is already able to create new triangles under DirectX10 (is this correct?).
Sorry, if this question has been asked before, I couldn’t find any thread answering it :-)
I am trying to implement an arbitrary precision integer arithmetic in CUDA. If the GPU kernel would be able to allocate its own device memory, that would simplify many of the algorithms I’m using such as division and multiplication of Bigints. The algorithms are based on Knuth’s “The Art of Computer Programming” and sometimes need some re/allocation of memory.
At the moment, I’m precalculating all the needed data on the CPU and transfer that data along with the bigints to the GPU, which creates a transfer overhead I would like to omit.
I take your reply as an indirect answer to my question, that there is no such thing in developement, right? :)
The only time one could ever require dynamic malloc on the GPU is in the case of an unpredictable kernel - which means it’s calculating results from internal gpu clock, or an external ‘random’ property that the CPU itself can’t ‘easily’ predict the size of the result from (to the point where it would be faster just to calculate the kernel on the CPU).
I created my own on-device memory manager, mostly to handle partial work results and dynamically generated subtasks.
It’s not too hard, the basic trick is to use global atomics to share out pieces a pre-allocated large block of memory at need.
The problem with this is that it uses atomics which are SLOW, so you can try to have suballocators at the block level… a block “grabs” a large chunk and threads ask their own block for pieces of that chunk.
Another type of memory allocator could use a linked list of chunks as well, allowing free as well as alloc.
So it is possible… but it’s likely not too useful. It’s a lot of annoying bookkeeping, and atomics are not fast. I am looking for better ways to handle my sub-task problem without using the system I created, even though it’s working.
You can try allocating an extra buffer, let’s say the maximum you would reasonably use, and then just use pieces of it. I’m not sure if this method is suited for your specific application, and most likely you won’t get the luxury of coalescing, but at least you have a place to store any extra precision you would need.
Let’s say you use 32-bit ints as your starting point. You could allocate four 32-bit ints per variable, and maybe use a few bits of the first int as a flag to how many ints you actually used of the four.
An implementation of this sort, while a memory hog should be faster than a dynamic on-GPU allocation.