You have at least three options:
- Compile for device emulation and then run the emulated program on valgrind.
- Use the cuda implementation of GDB.
- Or run on Ocelot ( http://code.google.com/p/gpuocelot/ ) which is a GPU emulator that detects memory errors like valgrind. (I helped write this so my opinion here is biased)
You also might want to consider using a wrapper library like thrust ( http://code.google.com/p/thrust/ ) for allocating memory (make everything a thrust::vector). Thrust allows you to easily access elements from vectors stored in device memory in host code, which should allow you to inspect your data structures after each update.
And yes it is possible to pass tree structures to the device though it is beneficial to lay them out in a sequential memory block so that you can easily copy them to the device. There are some typical layout strategies for trees for example where you lay out the data as a sorted vector and then use a recursive binary search to determine indices into the array. This ensures that all of the data is stored contiguously and as long as you don’t modify the data structure very often it has very good spatial locality properties.