CUDA functions How should CUDA functions be called?

Hi! I want to perform multiple operations on a single piece of data. In my C program, I use different functions to do different operations. For CUDA, should I make each thread call other CUDA functions… or should I make a C program that calls a kernel for each action that needs to be done on the data.

For example… say I have a data set D1, D2, D3,… Dn. And I want to perform operations on it… Op1(D), Op2(D), Op3(D)…

So in my C program I do this…

for 1 to n
Op1(Dn)
Op2(Dn)
Op3(Dn)

For CUDA, can I do this?

kernel for all n
Op1(Dn)
Op2(Dn)
Op3(Dn)

or do I have to do this:

kernelOp1(Dn) for all n
wait for kernel to finish
kernelOp2(Dn) for all n
wait for kernel to finish
kernelOp3(Dn) for all n
wait for kernel to finish

And when I call a function from a kernel, do I specify that function like a kernel? ::Kind of confused on function structures::

Sorry for the disorganized post and I appreciate the help.
Matt

If I understood this correctly the question here is whether to put for-loop inside a kernel or whether to wrap it around a kernel?

Basically you can do both. But you should be aware of execution time limit for devices with attached display.

Quote from linux cuda release notes:

So if kernel is going to run for more then a few seconds, then you should use second option.

Functions that you can call from a kernel must be device functions and you call it like a normal functions from a kernel. (Appendix B - NVIDIA CUDA programming guide)

Lightenix

This is what I was trying to figure out. So I can make functions to call from my kernel, even though they themselves may only operate on a single piece of data per iteration. Whereas the kernel can be thought of as operating on multiple pieces of data.

Whatever, be carefull: “for” loops are killers, especially because you need to check the condition n<5 (say), and wait till you get the decision output.

Doesn’t the compiler automatically unroll for loops?

Matt

Sometimes it does. Read section 6.2 Branch Predication - Best Practices Guide.

Lightenix

Generally yes, but we’ve been surprised in the past by instances where we assumed it would unroll, and didn’t. You can force an unroll with “#pragma unroll” just before the loop to be sure it happens.

Is it possible to call a kernel from inside a kernel? Or can you just call other functions? In my program I multiply a LOT of 4x4 matrices, and it wouldn’t make sense to write a kernel to do parallelize the matrix multiplication for each one, but have the algorithm call that from the host. It seems like copying data to the card, running the 4x4 matrix multiply, then copying back, would be slower than running in on the CPU. So instead, I want to move my whole algorithm there onto the card. Using a device function to do the matrix multiply would probably be okay, but it wouldn’t be maximizing parallelism I think. The biggest part of the parellelism is to run a particular algorithm on multiple sets of data at once, but I was thinking maybe I can further that parallelism but parallelizing subroutines of the algorithm.

Matt