Using CUDA libraries with dynamic parallelism

Hey

I’d like to write a multi - warp affine code, each warp would not utilize the card capabilities, however many warps would.

I have been thinking about using the npp warp affine code with dynamic parallelism

I wanted to create N threads, where each thread would use NPP warp affine function for the warp operation.

[ Such way would make my CUDA coding much more simple !!! ]

However, the NPP functions defined only for host usage,

What is the optimal way coding such problem ? specially where a single warp would not utilize the processing card, many warps would, plus I’d like to avoid coding the warp function by myself.

Is there any future toolkit release where the NPP function would be access from device and host?

Thanks in advance

S