how to create a node dynamically in kernel

Hi,

If I wanted to implement say a b-tree where the size of the b-tree node is unpredictable and not constrained by any artificial limit, how can dynamically create a node (branch)? In CPU, we can create a node/branch using malloc but I suppose this functionality is not available for code running in device. So how to accomplish this?

I think that would make it pretty hard. If constant size allocations suffice, keep reading

I don’t know much about memory allocation structures, but from an algorithm design viewpoint, you want to implement a structure that allows efficient add(free) and remove(malloc) on the set. One thing to ask is how frequent mallocs will be compared to free. If mallocs dominate, the easiest think I can think of is to allocate 1 large array and a counter that points to the next free slot, that each thread atomically increments. Frees will take O[n] time to complete by scanning the entire array for a slot.

If mallocs and frees happen with same frequency, maybe you can build a static, complete tree, with the leaves pointing to pre allocated storage. For each vertex’s children, have a bit that indicates

if there’s a path from this vertex to an unoccupied leaf. Allocation will just follow edges set to 1 and change the edge to 0 when no more free space can be reached from that node. Free would work similarly.

malloc cost: Log[n]

free cost: Log[n]

Probably not practical, but something I’d say in an interview.

I have the following structure:

template <int size> 

class MemoryTracker {

	private:

		int fieldEmpty;

		int fieldPtr;

		int freeField;

		int highestNode;

	public:

		__device__ inline void init() {

			if (threadIdx.x<32) {

				freeField=0;

				highestNode=0;

			}

			for (int tid=threadIdx.x;tid<size;tid+=blockDim.x)

				fieldPtr[tid]=tid;

		}

		__device__ inline void reset();

		__device__ inline bool isFree(int idx){return idx>=highestNode || fieldEmpty[idx]==1;}

		__device__ inline int getSize() {return highestNode;}

		__device__ inline int add() {

			int idx=atomicAdd(&freeField,1);

			atomicMax(&highestNode,idx+1);

			idx=fieldPtr[idx];

			fieldEmpty[idx]=0;

			return idx;

		}

		__device__ int add_block(int amount) {

			__shared__ int bidx;

			if (threadIdx.x==0) {

				bidx=atomicAdd(&freeField,amount);

				atomicMax(&highestNode,bidx+amount);

			}

			__syncthreads();

			int idx=-1;

			if (threadIdx.x<amount) {

				idx=fieldPtr[bidx+threadIdx.x];

				fieldEmpty[idx]=0;

			}

			return idx;

		}

		__device__ inline void remove(int idx) {

			fieldEmpty[idx]=1;

		}

		__device__ inline void compact(int *sharr) {

			//TODO: you might want to implement something without atomic operations

			if (threadIdx.x<32)

				freeField=highestNode;

			__syncthreads();

			for (int i=threadIdx.x;i<highestNode;i+=blockDim.x) {

				if (fieldEmpty[i]==1) {

					int addr=atomicSub(&freeField,1);

					fieldPtr[addr]=i;

				}

			}

		}

};

template <typename T, int size>

class TrackedMemoryManager : public MemoryTracker<size> {

  private:

	T data;

	public:

		__device__ inline T &operator[](int idx){return data[idx];};

};

It is not very optimised, and it was not fully tested, so use it with care! The components are:

fieldEmpty array - just stores information is cell x is empty or not

fieldPtr array - for x>=freeField it holds an unique pointer to an empty cell

freeField - holds an index to the first pointer in fieldPtr which points to empty cells

highestNode - holds an index to a possibly highest nonempty cell (so that, for example, you can launch a kernel over all items stored)

void init() - you need to call it once in your program

int add() - allocates a single element. (one element per thread)

int add_block(int amount) - allocates amount elements. Whole block must call this one.

void remove(int idx) - marks given cell that it is now empty.

void compact(int *sharr) - call it with a separate kernel. It will compact freePtr array and reduce freeField by the elements which were deallocated, so that they can be reused in subsequent calls to add and add_block. The function as it is now is extremely slow. I just created it just to be working, but should be replaced by some reduction algorithm.

Use MemoryTracker if the data is stored in a separate array and you just need to know which indexes to acces.

Use TrackedMemoryManager if you do not allocate data storage yourself.

I see, you present the illusion of having a contiguous block of free space by remapping fragmented space using fieldPtr. I suppose you don’t care about the amortized cost of delete because they’re so few.

I am remapping the free space, not the one being allocated, so that it can be reused. I do not want to move allocated memory, because there may be some pointers/indexes beyond the range of this class, which it cannot control.

Deletion is very simple O(1) operation. It is this compact which is currently very slow, but you don’t have to launch it upon each deletion! You could launch that when you predict a big amout of memory can be saved that way (e.g. r=n/5) (hence, not that few deletions before).
I am sure one could simply apply a O(log(n)) reduction algorithm to do exactly the same work without using atomics.
Total time for d deletions (d<n) would be then dO(1) + (d/r) * O(log(n)) = O(d) + 5(d/n) * O(log(n)) = O(d), so for amortised time for one deletion is still O(1), no matter how many deletions you make.

ignore

I doubted this at 1st, but it seems your analysis is solid. The reason I doubted deletion can be done in O[1] amortized time is because your malloc is O[1] and that seems like having the infeasible data structure that allows O[1] adds and O[1] delete min (hypothetically allows sorting in O[n] time).

Also, I always had the impression C’s malloc/free aren’t both O[1], but since we’re talking about constant sized allocations, it might be easier.

But if you delay compaction so long by making r proportional to n, you will run out of space quickly, unless you allocate more space than you intend to actually use. So you’re making a space, time trade-off.

Reduction has O[n] work, not O[log[N]]. Having #processors = #elements isn’t practical.

Maybe, but my structure does not guarantee that add_block() will return a consecutive portion of memory. malloc, realloc, free are much harder to implement.

I stand corrected. Fortunately it won’t change my result :)

Of course. Memory is “cheaper” than time.

Usually, in dynamic problems I don’t know exactly how much n will be anyway I have to preallocate a bit more than some guessed value. In worst case scenario, I can issue the compaction when memory is full.

If I know n from somewhere, I need only n+r cells to hold those n elements, I believe this is a reasonable trade.