Monster machine: MultiGpu and some other examples from SDK 3.2 do not run

twofisher · December 29, 2010, 8:35am

Hi!
I recently got my hands on a monster machine:
2 hexacore Xeons, 8 GTX 460, lots of ram in one rackable box.
There are 8 regular PCI-Slots, no riser-cards or anything like that.

The system is really cool but sounds like a starfighter.

OS: OpenSuse 11.2, 64 bit.

Cuda-SDK 3.2 (final) has been installed on it.

deviceQuery runs ok giving all 8 gpus.

MonteCarloMultiGPU fails on 8 gpus (L1 norm: NAN)
cuda-memcheck MonteCarloMultiGPU
says, that MonteCarloReduce() execution failed in MonteCarlo_kernel.cuh(265)

simpleMultiGPU fails on 8 gpus (GPU sum: inf)
cuda-memcheck simpleMultiGPU
says:
Invalid global write of size 4 at 0x000000f0 in reduceKernel
by thread (115,0,0) in block (31,0)
Adress 0xf801017dcc is out of bounds

dmsg spits out:
simpleMultiGPU[2629]: segfault at 40 ip 00007f65946f000 error 4 in libcudart.so.3.2.16[7f65751df000+4b000]
looks not good to me.

Although eigenvalues seems to work on all 8 gpus (tested with all 8 -device= options)
cuda-memcheck BlackScholes -device=2 complains about an invalid global write of size 4 (different threadIds and blockIds).
cuda-memcheck BlackScholes -device=5 complains about an invalid global write of size 4 (different threadIds and blockIds).
cuda-memcheck BlackScholes -device=5 complains about an invalid global write of size 4 (different threadIds and blockIds).
cuda-memcheck BlackScholes -device=3 yields an unspedified launch failure in BlackScholes.cu(171)
cuda-memcheck BlackScholes -device=4 yields an unspedified launch failure in BlackScholes.cu(171)

cuda-memcheck BlackScholes -device=0 runs ok.
cuda-memcheck BlackScholes -device=1 runs ok.
cuda-memcheck BlackScholes -device=6 runs ok.
cuda-memcheck BlackScholes -device=7 runs ok.

nvidia-smi -d | fgrep Temp says, that all 8 gpu’s are below 30 C (basically they are idle).

dmegs tells me, that the 8 cards share 4 IRQs (two cards use one IRQ together).

Any ideas where could be the problem?
Is it hardware, OS, driver, software or anything else?

Thanks for any hints in advance
Martin

gogogpu · December 29, 2010, 2:55pm

Hi Martin, sounds like a killer system External Image

Here’s a trick to getting more debug info out of the nvidia driver - set the NVreg_ResmanDebugLevel module parameter to 0.

modprobe nvidia NVreg_ResmanDebugLevel=0

This will output a lot more info, to the point where it impacts system performance.

Can you try running the bandwidth test? This won’t run any kernels, just allocate and transfer memory, so we can see if that functions at least.

Which driver are you using? It should be 260.19.* for cuda 3.2 final.

The shared IRQs should be ok, that’s the way it is in my multi-gpu configurations. You could try NVreg_EnableMSI=1 though to try MSI interrupts instead.

twofisher · December 29, 2010, 3:24pm

Yes it kills my nerves.

Here’s a trick to getting more debug info out of the nvidia driver - set the NVreg_ResmanDebugLevel module parameter to 0.
modprobe nvidia NVreg_ResmanDebugLevel=0
This will output a lot more info, to the point where it impacts system performance.

I tried that. I get 6 lines like this in /var/log/messages when trying BlackScholes -device=2:

NVRM osCallACPI_DSM: Error during 0x0 DSM subfunction 0x0! status=0x2f

NVRM osCallACPI_DSM: Error during 0x1 DSM subfunction 0x0! status=0x2f

NVRM osCallACPI_DSM: Error during 0x2 DSM subfunction 0x0! status=0x2f

NVRM osCallACPI_DSM: Error during 0x3 DSM subfunction 0x0! status=0x2f

NVRM osCallACPI_DSM: Error during 0x4 DSM subfunction 0x0! status=0x2f

NVRM osCallACPI_DSM: Error during 0x5 DSM subfunction 0x0! status=0x2f

Yes that works - at least it spits out “PASSED”.

Currently i use 260.19.26. Originally i used 260.24, which did not work either.

Thanks for your help!

Martin

gogogpu · December 29, 2010, 3:55pm

NVRM osCallACPI_DSM: Error during 0x0 DSM subfunction 0x0! status=0x2f

NVRM osCallACPI_DSM: Error during 0x1 DSM subfunction 0x0! status=0x2f

NVRM osCallACPI_DSM: Error during 0x2 DSM subfunction 0x0! status=0x2f

NVRM osCallACPI_DSM: Error during 0x3 DSM subfunction 0x0! status=0x2f

NVRM osCallACPI_DSM: Error during 0x4 DSM subfunction 0x0! status=0x2f

NVRM osCallACPI_DSM: Error during 0x5 DSM subfunction 0x0! status=0x2f

Those messages are ok, I get that too.

Have you tried the system with fewer GPUs installed? What motherboard or backplane are you using? Which PCIe switch is upstream of the GPUs?

twofisher · December 30, 2010, 8:32am

I will try to remove some of the gpus.

When i have results i will post them.

Since i am on vacation until Sunday i will post on Monday again.

Thank you very much for your help!

Martin

PS: Did you change anything with linux’s kernel settings on your machine?

How many gpus does your machine have?

gogogpu · December 31, 2010, 3:32pm

Nope, I just use the kernel that comes with the distributions. Sometimes I will experiment with the PREEMPT_RT kernel though, and that works fine too.

I don’t really use OpenSUSE for CUDA. In my experience RHEL and Ubuntu are the most stable for cuda development.

My system only has 3 GPUs. Looks tiny compared to yours External Image

ImNutz4NvSLI · December 31, 2010, 4:05pm

Sort of off topic, would you mind posting some pics of your machine? I have a similar project I am working on for a customer and am looking for any and all viewable configs.

Thanks in advance and good luck with her, she sounds like a data crunching beast!

~Nutz

twofisher · January 5, 2011, 10:05am

Hi and a happy new year!

the machine is a modified version of this beast:

Typhoon

Since the power consumption should allow it we replaced the 8 teslas by gtx 460 gpus.

Regards

Martin

twofisher · January 10, 2011, 3:11pm

Hi,

this seems to be a SDK 3.2 thing:
I downgraded to 3.1 (leaving the installed driver untouched) and the examples work as they should.

At least simpleMultiGPU und MonteCarloMultiGPU spit out “PASSED” on all 8 devices.

I tried the examples with 260.24 und 260.19.26 Drivers.

Has anybody seen those problems, too?

Regards
Martin