Multiple GPUs host thread safety?

I’m running a simple test on a machine with two GTX 275 cards. I’m doing a simple test to see that I can work with the two gpus. here’s the code:

#include <cuda_runtime.h>

#include <QThread>

#include <windows.h>

void testAlloc(int dev)


	cudaError ret;

	int N = 256;

	cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();

	cudaExtent extent = make_cudaExtent(N, N, N);

	size_t avail = 0, total = 0;

	ret = cudaMemGetInfo(&avail, &total);

	printf("%d: avail = %d, total = %d\n", dev, avail, total);

	cudaArray *arr;

	ret = cudaMalloc3DArray(&arr, &channelDesc, extent);

	size_t avail2 = 0, total2 = 0;

	ret = cudaMemGetInfo(&avail2, &total2);

	printf("%d: avail = %d, total = %d\n", dev, avail2, total2);

	printf("%d: alloc = %d, asked=%d\n", dev, avail-avail2, N*N*N*sizeof(float));


class TestMalloc : public QThread



	TestMalloc(int device) : m_device(device) {}

	virtual void run()


		cudaError ret;

		ret = cudaSetDevice(m_device);

		int dev = -1;


		printf("dev=%d\n", dev);



	int m_device;


int main(int argc, char *argv[])


	cudaError ret;

	TestMalloc tm0(0);

	TestMalloc tm1(1);





	return 0;

for threading I use the Qt framework (QThread).

When I’m running this the thread of device 0 runs ok but the thread of device 1 is deadlocked in the first call to cudaMemGetInfo()

If I add a Sleep(m_device*100) at the beginning of the thread, allowing the first to finish before the second starts, everything works ok.

So what’s going on here? Is the CUDA API not thread safe? do I need to start guarding every cuda usage with mutexes?

The runtime API is thread safe - there are plenty of examples floating around using either pthreads or Boost threads which work perfectly. The only thing you have to be careful about is that contexts are bound to both devices and threads and have the lifetime of the thread that established them. So you need to make sure each thread only ever interacts with its own GPU, and that you do something to keep the threads alive for as long as you need the context the thread holds.

I am not familiar with the Qt threading API so it is pretty hard to parse what the start() and wait() methods your code calls do (or where the run() method which interacts with the GPU is actually called from).

What happens if you remove the cudaMemGetInfo calls from the testAlloc function? Does it remove the deadlock?


what actually removes the deadlock is if I add:

int* dummy = NULL;

	cudaMalloc(&dummy, 10);


at the very beginning of testAlloc()

I guess this establishes the thread context appropriately and then cudaMemGetInfo() can do its job.

The only question that remains is why cudaMemGetInfo() doesn’t establish the context itself.

For those struggling with the threading API for some reason, run() is the actual code of the that the thread runs, start(), surprisingly enough starts the thread, and wait(), waits for it to finish().

Thats obviously not a good solution… :)

Try to put the cudaSetDevice instead of the dummy code.

Anyway the cudaMemGetInfo is a CUTIL function (which is not supported) and requires a context before called.

If i remember correctly in my code, if I dont initialize CUDA before calling cudaMemGetInfo I get zero as the card’s memory.


cudaMemGetInfo() is not infact a function of cutil. It is in the cudart dll and is documented in the official cuda runtime documentation.
That documentation says nothing about a valid context and infact, for a user of the cudart api, the concept of contextes should be transparent.
This is clearly a bug in cudart.

I stand corrected. You are right.