1-On a “single GPU” system how do I call six different (or same) kernels, using six different host threads? (Assuming I have a device capable of executing multiple kernels concurrently)
2- I understand in CUDA 4.0 one of the features is Sharing GPUs across multiple threads. Is there any sample code demonstrating this feature?
you can use cuCtxCreate to create a new context on each of the threads. I think you may need to use cuCtxSynchronize for synchronization purposes rather than cudaDeviceSynchronize(). These shouldbe the only requirements when using multiple threads.
However, if you want to use multiple threads just to launch multiple kernels at the same time, you can do this using cuda streams.
You can find the example for both cases in CUDA SDK
You will need to have a card (with compute capability >= 2.0) that is capable of doing this though.