‘no inlining’ doesn’t means recursive functions are supported. There’s no stack; you can’t do recursion (or any of the other things that the original poster wanted) without one.
The compiler depends on having a call graph that can be predicted entirely in advance so that locals can be assigned to registers. Off-chip memory is too slow, and on-chip memory is too small to have a traditional ‘stack per thread’ (even a small one).
seibert is otherwise completely correct. It’s better to learn data-parallel programming from scratch than to imagine that you can just port your existing C code across. Those of us who came to CUDA via GPGPU on GLSL/HLSL have very different impressions about CUDA’s ‘ease of use’ and generality, incidentally.