CUDA PTX advise help making a library of sorts for gpu structures

Working on building a GPU-structures library for use in Kernels, providing simple memory allocation, and spin-locks and more advance data structures.

In a separate project I am working on a real-time simulator completely implemented in device code (to support a neural network), and will require these data structures to implement.

I know memory allocation, hash-tables, and spin-locks in device code is typically frowned upon, where current Kernel development tactics are:
serialized on host, parallelized on device, and shuffle between them for a solution. But I feel my need is great, and my data-sets are large, and will require these objects to stay and execute in device logic and memory.

From what I have learned so far, the only practical way of stitching this code together is in PTX with the .func entities, due to CUDA auto expanding all function calls. And there is not an effective way of mixing cu files and .func together, or of mixing multiple PTX files together, since there is no pre-processor for PTX files. But I may be way wrong and please correct me.

Project lives here:
[url=“Google Code Archive - Long-term storage for Google Code Project Hosting.”]Google Code Archive - Long-term storage for Google Code Project Hosting.

It is currently a visual studio 2008 project. Had to clone the CUDA build.rules and modify one, adding the .PTX extension to get studio to compile my .PTX files. Is there a better way of doing this.

Also making a concurrent implementation eventually for the ATI line with Cal files. Sorry NVIDIA, yours is more dominate tho.

Please provide any feedback.

Thank you,
Sky Morey