CUDA : Bad performance when allocating heap inside kernel & RTree

I have two problems which make me so confuse and try hard these days

  1. As title, I ran a performance test with my NVDIA Tesla M2090. The only thing I did was create new object insider kernel, nothing more. And the result I got was terrible which make me so surprise. Here’s my code:
template <class T> 
class TArray
{
protected:
    int m_Size;
    int m_Capacity;
    int m_GrowBy;
    T* m_pData;
public:
    template <class T> 
    __host__ __device__ TArray<T>::TArray(int growby)
    {
        m_Size = 0;             // empty to start
        m_Capacity = 0;         // remember the size
        m_GrowBy = growby;
        m_pData = NULL;
    }
}

struct Rect2D
{
    real64 left;
    real64 top;
    real64 right;
    real64 bottom;
};

class GERect2D : public Rect2D  
{
public:
    __host__ __device__ GERect2D() {
        left   = 0;
        top    = 0;
        right  = 0;
        bottom = 0;
    }
}

class GELineString
{
protected:
    GERect2D* m_pMBR;
    TArray<Point2D>* m_parrPoint;
    bool m_bContain;
    bool m_bRTree;
    TRTree<int32, real64, 2> *m_pRT;
public:
    __host__ __device__ GELineString(int32 growby = 100) {
        m_parrPoint = new TArray<Point2D>(growby);
        m_pMBR = new GERect2D();
        m_bContain = false;
        m_bRTree = false;
        m_pRT = NULL;
    }
}

//My kernel
__global__ void kernel_BuildTree() {
    int tid = threadIdx.x + blockIdx.x* blockDim.x;
    GELineString *pLs = NULL;

    for (int i = 0; i < 1000; i++)
    {
        pLs = new GELineString(17);
        delete pLs;
    }
}

void main() {
    kernel_BuildTree<<<1,128>>>();
    cudaDeviceSynchronize();
}

The result which I never think about when I ran it is about 18 seconds, while it’s only not over 1 milisecond when I ran in CPU. Moreover, when I increased the number in for loop to 10,000 (just 10 times from 1000), the time I got was not 180 seconds (10 times from 18 seconds) as I was expected, it was about 10 minutes

I’ve read a lot about the warp divergence, occupancy, memory hierarchy of GPU but still can’t explain it, so confuse. Maybe there is a limitation when using heap in GPU?

  1. I want to port the RTree structure to GPU. I’ve read a lot of papers about RTree implemented on GPU but seem like all of it based on array, not pointer. So maybe pointer is not well-suited for GPU? Do you know any library about tree structure based on pointers on GPU?

Hope to see your help!

Any idea?

your GPU can execute 24K threads simultaneously. now imagine all these threads trying to simultaneously increment the single atomic variable controlling the memory allocation

the standard approach is to alloc from heap in large blocks and use sub-allocators shared by a few dozens or hundreds of threads. googling for “cuda memory allocator” gives us halloc, f.e.