My first test on CUDA and some questions sync, thread with CUDA

Hi, I’m new to CUDA, and I design a test of mine today, to test if it can ran multiple threads or processes to share only one GPU, but while the threads number is up to 8, it seems very unstable, I mean 8 threads share only one GPU device. Sometimes the copy of memory from device back to host failed, sometimes it hung my computer!

this problem also happened when I ran multiple processes which use CUDA to do some calculations, of course, I have only one GPU card!


I wrote my test in one doc word file, which described my test design and the problem output. Please read it if you have time and willing to help me :-) Thanks a lot! [attachment=4722:attachment][/b]

Anybody can provide some suggestions on these usage?

My system is notebook with 8600GM, 256MB video memory, Redhat Enterprise Linux 5.
CUDATest.doc (208 KB)

To make sure the global function (e.g. the kernel) finished you can call CudaThreadSyncrhonize() on the host. This function will return when the device is finished with all computations.

When using one device in a multi threaded program you have to be careful if you want to share device memory references over multiple threads. It has been reported that this doesn’t work so the best way to handle this is to dedicate a single thread to handle all GPU related functions.

I have no experience with multiprocess GPU usage - someone else might be better suited to answer those questions.

thanks very much, but I still got problems when running multiple processes! For example, I have a program named syncTest, which uses cuda to do calculation, and then I wrote a script, like:

> output

for ((i=1;i<=5;i++));do

  ./syncTest >> output 2>&1 &


very strange, if the number of processes increase, the system became unstable, the CPU is very busy, and sometimes, the result is wrong!

well, if possible, I could provide the code of mine.

Conclusion and Question

From this test, I known:

  1. all calls from host to device is async, except the memory copy functions, this makes it possible for parallel computing;
  2. memory available on the video card is very important, you should check it, or change the algorithm which will use CUDA, to avoid huge or un-managed usage of memory;
  3. It’s better to feed GPU jobs in line, not parallel;

Still some questions:

  1. Is there a way to known how many memory is available on the device?
  2. Is there a way to only check if the device job is finished, without a blocking?
  3. Why cannot multiple processes which use GPU CUDA run at the same time? If not, how to avoid that?
  4. Why it costs a lot of CPU time while running multiple processes which use GPU CUDA?
  5. Is there a better way to run CUDA code in multiple thread to share one GPU? Or GPU can be only used one task by another?

GPU cannot run several kernels in parallel, they are serialized before starting.


  1. Yes. Programming Manual holds contains the answer.

  2. No.

  3. They can, at least they should.

  4. This is probably a limitation of CUDA 1.0. Some syncronization issues.