What exactly runs where?

I’m still trying to figure out a few things and require a little assistance for some clarification.

When you look at .cu files like in the sdk, are those programs running on the gpu? or partly cpu then transferred to gpu for the kernel file?

If its completely gpu, how would I go about making a c program and forwarding some math with loops into the gpu and get the responses back into the c program to do final touch ups etc.

My card has 256mb ram (8600gts) and if I want a lot of global memory shared between all threads will it use the 256mb ram? If it needs more will it goto system ram? [it is only going to be used as read-only to know where 3D objects are located]

If all the code in the cu file runs on gpu, is there a time to live feature?
ie. a thread can only run for a maximum of X time.

If running cpu and gpu code at the same time is possible, is there a way for me to tell the threads to pass back what they have now and quit?(so i can stop a calculation at it’s current accuracy)

If it matters, I am doing this under linux, but would like it to work with windows as well.

Thank you in advanced.

Not quite. Only routince that have prefix global or device are run on GPU, otherwise the function is executed on the host. Please read carefully CUDA programming manual, it is mentioned there.

It is up to you, if you allocate all the memory and pass pointers which refers to this memory, then the memory will be shared between all the threads within one kernel. If you need more device ram than you have available on the device, the cudaMalloc will fail to allocate it.

On windows, as far as I know, the maximal execution time on the device is 5sec. On GNU/Linux this applies if and only if X server is running; otherwise, this 5sec limit is not applicable.

Yes there is. The actual implementation, though, depends on the problem at hand.

If I may, I’d advice you to read carefully CUDA programming manual. Most of your question are answered there.