CUDA quadtree!!!!

Hello All,

I’m trying to build a quadtree in CUDA and I’m having some trouble with it.

I’ve learned that cudaMalloc() is a host function so I can’t put allocating new children on the GPU so the only thing I really can do is thread the point location. Granted, this isn’t my actual project but it’s exactly similar, to explain my motivations.

The only issue is, I don’t know how to make recursive calls. I want a parent thread to send more threads to search each child (assuming it’s valid, i.e. point is contained somewhere) but I’m starting to think my GPU isn’t advanced enough (GTX 460).

My compile line is : nvcc -arch sm_35 -o qtree qtree.cu

And my code is (also note, I do expect undefined behaviour but this won’t even compile):

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

struct node {

   unsigned int level;

   struct node *parent;
   struct node *children[4];
};

__global__ void memoryCheck(struct node *root) {

   if (root) {

      root->level = 123;
   }

   if (root->children[0]) memoryCheck<<<1, 1>>>(root->children[0]);

   return;
}

int main(void) {

   struct node *root;

   if (cudaMalloc((void**) &root, sizeof(*root))) printf("Allocation failed\n");
   if (cudaMalloc((void**) &root->children[0], sizeof(*root))) printf("Allocation failed\n");

   memoryCheck<<<1, 1>>>(root);

   struct node *host_root = (struct node*) malloc(sizeof(*host_root));
   host_root->level = 0;

   cudaMemcpy(host_root, root, sizeof(*root), cudaMemcpyDeviceToHost);

   printf("%d\n", host_root->level);

   cudaFree(root->children[0]);
   cudaFree(root);
   free(host_root);

   cudaDeviceReset();

   return 0;
}

But I get this from the command line :

ptxas fatal : Unresolved extern function ‘cudaGetParameterBuffer’

Does anyone have any examples of quadtrees and CUDA? I’ve learned that I must build the tree on the host and that I must search it on the device but that search hurdle, man.

The compute capability of a 460 is 2.x, but you are using 3.5, which is for the new Kepler (Titan, K20) GPUs.

Also keep in mind calling kernels from kernels is generally not a good idea unless you have a newer GPU with DP.

Start with a simpler typical CUDA project, like vector add and the get into scans and reductions.

A quad tree is a more advanced project.

I think a quadtree is really just need to be expressed as a pointer set. I have one pointer that points to an array of pointers, each one of those pointing to a unique set of children.

Basically, the 0th element points to the root, the 1st points to the first 4 kids, the 2nd points to the 16 children and so on and so forth.

I was hoping I could just use recursive traversing but I can’t.

I guess I’ll try doing it by arrays.

And while I’m tinkering, when I write C code I like to make sure I’m doing it in the best way possible so I always compile code with the -Wall, -Wextra flags.

Are there equivalents for those with nvcc? I’ve tried looking it up and I got info about a bunch of other options that don’t seem to be relevant to my inquiries.

There is a quad tree sample in your CUDA 5.x SDK, so why not look at that?

It requires a compute capability of 3.5, but the general idea should be the same.

http://docs.nvidia.com/cuda/cuda-samples/index.html#advanced

scroll down it is need the bottom of that section.

Also keep in mind that parallel computing has significant differences to serial computing, and you need to understand these differences before you start to adapt tree/graph algorithms and data structures.

You need to learn to walk before you can run.

Again understand concepts like scans, reductions…

Okay, so I’ve been trying to learn how to write a 2D array in CUDA and use it with structures. So what the code I have is, it takes a pointer to a set of pointers to respective structures and there’s some kernel that changes a parameter in each struct.

But I’m having trouble with the free’ing procedure. Here’s what I have below :

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

struct node {

   unsigned int level;

   struct node *parent;
   struct node *children[4];
};

__global__ void memoryCheck(void **root) {

   int x = threadIdx.x;

   struct node *tmp = (struct node*) &root[x];
   tmp->level = 123 + x;

   return;
}

int main(void) {

/* Host version of struct set */

   void **root = (void**) malloc(4*sizeof(*root));

   for (int i=0; i<4; i++)
      root[i] = malloc(sizeof(struct node));

/* Device version of struct set */

   void **dev_root;

   cudaMalloc((void**) &dev_root, 4*sizeof(void*));

   for (int i=0; i<4; i++)
      cudaMalloc(&root[i], sizeof(struct node));

/* We copy the host set to the device set, launch the kernel and
   then re-write the device set into the host set */

   cudaMemcpy(dev_root, root, 4*sizeof(*dev_root), cudaMemcpyHostToDevice);

   memoryCheck<<<1, 4>>>(dev_root);

   cudaMemcpy(root, dev_root, 4*sizeof(*root), cudaMemcpyDeviceToHost);

   for (int i=0; i<4; i++) {

      struct node *tmp = (struct node*) &root[i];
      printf("%d\n", tmp->level);
   }

/* Attempts to free all allocated memory */

   //for (int i=0; i<4; i++)
      //cudaFree(root[i]);

   free(root);   
   cudaFree(dev_root);

   return 0;
}