Using Multiple Devices

I am only using one of two C870 devices in my Tesla at the moment. How do I get the two set up and working? And how would I execute kernels on them?

Read about cudaSetDevice() in Manual.
You have to run separate thread for each device.

I’m more interested in cutStartThread, which is not mentioned in the reference, nvcc or programming guide. The simpleMultiGPU example uses it start a thread in a loop, which I understand, but how does the kernel know what data is being passed to it?

From that example

for(i = 0; i < GPU_N; i++)

            threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, (void *)(plan + i));

        cutWaitForThreads(threadID, GPU_N);

the function solverThread which sets up the device and kernel is defined as taking in a user-defined structure TGPUplan, data from which is passed to the kernel. Regarding using solverThread, if several parameters were passed to it, for example

static CUT_THREADPROC solverThread(TGPUplan *plan,int* X,float* Y)

{

}

then what would the call to it be? The call to it in the example with the one TGPUplan parameter is currently

threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, (void *)(plan + i));

I know I could put X and Y into the TGPUplan, I’m just curious as to how cutStartThread would be called in that case.

I also understand that several different kernels could be executed concurrently on different devices by allocating the kernels to different devices , and using cutWaitForThreads similar to a OpenMP barrier .

cutStartThread() is just a wrapper around CreateThread() function (check SDK\common\src\multithreading.cpp) , so there’s no way of supplying more than one parameter (except for packing them into single struct).