Host->Device memcpy failure in forked process valgrind output included


I’m trying to see the gains I can realize by forking processes to run on separate cores (quad-core processor AMD Phenom9600 on an ASUS M3A32-MVP, 2GB DRAM) against separate GPUs (1x8800GT, 1x8800GTX). I allocate memory on the host, and the 2nd process always aborts on the memcpy from host to device. This is in CentOS 4.6 and the 169 NVIDIA driver is installed, and SDK/Toolkit 1.1 are in use.

I’m attaching two .tar files: improvedParallel.tar, which has the code in question (you can untar it in the projects dir, cd to improvedParallel, do a make etc. from there), and also contains a large valGrind report. What I am seeing is that the 2nd process (correlating to the 2nd GPU card, as it happens, but this fails regardless of which card is used by the 2nd process) tries its copy and valgrind says the address is not recently malloc’d, free’d, etc. The code-line in is line 77 where this problem occurs.

I’m also including workingBandwidth.tar, which is a modified bandwidthTest that also runs forks two children to each run against a separate GPU. On that I see that the bandwidth numbers are lower than if they run alone, but at least the memcpys all seem to work!

I noticed immediately that the device-side arrays, when allocated in each child rocess, had the same address value.

My questions boil down to this:

1 - What am I doing wrong in the improvedParallel code? I don’t see a difference in the bandwidthTest handling of data allocation on the host or device, and it sure seems to work. It does not seem to be a function of size as when I crank down the allocation sizes I still get the 2nd process dying out.

2 - The identical device address allocations make me worry that somehow I’m NOT running on separate GPUs at that point – the coincidence of matching addresses happens in the bandwidth test code also when THAT runs. Is that just a coincidence?

3 - is this a fruitless quest? I have it in my head that 2 processes on separate cores, using different memory areas, and searate GPUs, should complete things in parallel. I.e. task A completes in what approaches half the time in this scenario. Am I off-base here? Is there a hard bottleneck (motherboard bus sharing, etc.)? The bandwidthTest code results being lower in parallel mode have me curious on this point.

4 - I’ve ruled out (mostly) the fact that my cards are not identical; I’m using no 1.1-specific functionality and the code works against each in single-mode runs.

I appreciate any insights folks have to offer! BTW the code when compiled will take -0 (run against GPU0), -1 (run against GPU1), -a (run against both), or no args (same as -a). -pinned is accepted as well to force that. The code HAS been stripped down and simplified to demonstrate the behavior more clearly - but therefore is pretty inflexible, i.e. you better have at least a coreDuo and 2 GU cards.

I have run this against an EVGA board with an Intel CoreDuo, same hardware otherwise, and had same results.

I welcome your comments/suggestions. Cheers,

improvedParallel.tar (40 KB)
workingBandwidth.tar (40 KB)

I can’t answer all your questions since I don’t have a 2 GPU system to play with :( But I can help a little.

seibert found an interesting behavior when using fork with CUDA. You may be experiencing the same thing. Maybe you should try a pthreads version of your test. That is known to work.

And I can answer (3): Two threads/processes should be able to use both GPUs simultaneously to double your performance, assuming you can split your problem into 2 halves. This is not a fruitless quest. However, there are a few motherboard details that will come into play. Each GPU should be able to execute kernels independently and at full speed. But, when it comes to transferring memory CPU<->GPU the RAM speed and PCIe connections will likely cause slowdowns.

How much of a slowdown? That depends on your system. I looked up your MB and it lists that with dual cards you still have PCIe x16 on each, so that is good. But, you are going to be limited by 1) The maximum bandwidth of your RAM and 2) The CPU and hypertransport link to the PCIe controller. (if you aren’t aware, AMD CPUs have the memory controller on the CPU). I didn’t look up the detailed block diagram on your board to see if those seem to be a big bottleneck or not.

Thank you for the very prompt response!

I had looked at the seibert discussion a day or two ago (honestly!) but because the behavior seemed to be an unintentional one, it wasn’t clear if this was an “accepted” use - i.e. if a later CUDA SDK wouldn’t take it away from me. Thoughts?

Also, if you have to fill the device memory before the fork(), doesn’t that mean you have to load the 2 GPU devices serially? I.e. the memory copy is no longer parallel? Maybe I am misunderstanding the implications here.

Lastly, re: pthreads – I did in fact get a version of this running in pthreads (the stripped code I posted is a version of code which lets you run against either fork or thread mode). While it runs, there is no way to force a thread to run on a separate CPU - my understanding is that threads remain on the same CPU as the process owning them. fork() seemed the only way to exploit the additional CPUs. Does that sound right?

I’m much obliged for your quick reply. Your ideas on the above comments I’ve made would be appreciated as well if you had any. Thank you again!

My interpretation of seibert’s results is that forking a CUDA process is doing something funny and shouldn’t be used. The CUDA runtime is set up so that each individual thread/process has its own GPU context and cannot even share GPU memory (on the same GPU even) with another thread. This is analogous to protected memory on the CPU. Since this is the intended behavior of the CUDA runtime, and fork doesn’t seem to follow this behavior something isn’t working right with fork.

Along with the whole 1 GPU context per thread, you cant even load each gpu’s memory in serial if you wanted to. Once you call the cudaSetDevice function to choose the GPU that is used by the current thread, it can’t see that any other GPUs exist any more. All cudaMalloc/memcpy/etc calls will be performed on the device that is selected.

I guess I should have read your code, then I would have seen that you had a pthreads version. Sorry.

Anyways, while there is no way to force a thread on a 2nd processor, the OS scheduler is reasonably intelligent. If there are 2 threads actively running, it will put them on different CPUs. Now, it may swap which thread is on which CPU from time to time but that really only causes a problem if your CPU code is extremely sensitive to the cache. Under most circumstances just leaving the OS scheduler to do its job will be fine.

Oh, it’s not in the code I posted - I stripped it out. No reason for you to apologize,

you’re helping me! :)

I confess I was unaware of this - in my threading examples that I’ve done, there

was always a base amount of time that was (to my mind) indicative of the CPU-side

activity having to be serialized when two threads were running. I.e. task A takes

5 seconds, task B takes 5 seconds, doing both in threads yielded 7-8 seconds, not

5 seconds. I chalked it up to the threads being on the same processor, with no

way to force it over to the other core, and thus the CPU-side activities had to be


I’ll have to go back to the books on how the linux kernel assigns threads to CPUs!

Thanks, again.