I’m trying to see the gains I can realize by forking processes to run on separate cores (quad-core processor AMD Phenom9600 on an ASUS M3A32-MVP, 2GB DRAM) against separate GPUs (1x8800GT, 1x8800GTX). I allocate memory on the host, and the 2nd process always aborts on the memcpy from host to device. This is in CentOS 4.6 and the 169 NVIDIA driver is installed, and SDK/Toolkit 1.1 are in use.
I’m attaching two .tar files: improvedParallel.tar, which has the code in question (you can untar it in the projects dir, cd to improvedParallel, do a make etc. from there), and also contains a large valGrind report. What I am seeing is that the 2nd process (correlating to the 2nd GPU card, as it happens, but this fails regardless of which card is used by the 2nd process) tries its copy and valgrind says the address is not recently malloc’d, free’d, etc. The code-line in gpuProcess.cu is line 77 where this problem occurs.
I’m also including workingBandwidth.tar, which is a modified bandwidthTest that also runs forks two children to each run against a separate GPU. On that I see that the bandwidth numbers are lower than if they run alone, but at least the memcpys all seem to work!
I noticed immediately that the device-side arrays, when allocated in each child rocess, had the same address value.
My questions boil down to this:
1 - What am I doing wrong in the improvedParallel code? I don’t see a difference in the bandwidthTest handling of data allocation on the host or device, and it sure seems to work. It does not seem to be a function of size as when I crank down the allocation sizes I still get the 2nd process dying out.
2 - The identical device address allocations make me worry that somehow I’m NOT running on separate GPUs at that point – the coincidence of matching addresses happens in the bandwidth test code also when THAT runs. Is that just a coincidence?
3 - is this a fruitless quest? I have it in my head that 2 processes on separate cores, using different memory areas, and searate GPUs, should complete things in parallel. I.e. task A completes in what approaches half the time in this scenario. Am I off-base here? Is there a hard bottleneck (motherboard bus sharing, etc.)? The bandwidthTest code results being lower in parallel mode have me curious on this point.
4 - I’ve ruled out (mostly) the fact that my cards are not identical; I’m using no 1.1-specific functionality and the code works against each in single-mode runs.
I appreciate any insights folks have to offer! BTW the code when compiled will take -0 (run against GPU0), -1 (run against GPU1), -a (run against both), or no args (same as -a). -pinned is accepted as well to force that. The code HAS been stripped down and simplified to demonstrate the behavior more clearly - but therefore is pretty inflexible, i.e. you better have at least a coreDuo and 2 GU cards.
I have run this against an EVGA board with an Intel CoreDuo, same hardware otherwise, and had same results.
I welcome your comments/suggestions. Cheers,