CUDA 2.0 SDK Examples not working in RHEL4

jam834 · November 19, 2008, 10:21pm

Hi, I’ve gotten the CUDA 2.0 SDK examples compiled, but the majority of them give “TEST FAILED” errors and such when run. I’m running RHEL4.7 (2.6.9-78.0.8.ELsmp) using both the 177.73 and 177.82 drivers (I have 2 systems to test with, both give the same errors) and the GeForce GTX 280.

Some examples of the output are:

[storm0 release]$ ./alignedTypes

Using device 0: GeForce GTX 280

Allocating memory...

Generating host input data array...

Uploading input data to GPU memory...

Testing misaligned types...

uint8...

Avg. time: 283.634583 ms / Copy throughput: 0.164176 GB/s.

TEST FAILED

uint16...

Avg. time: 282.249329 ms / Copy throughput: 0.164982 GB/s.

TEST FAILED

RGBA8_misaligned...

Avg. time: 281.692932 ms / Copy throughput: 0.165308 GB/s.

TEST FAILED

LA32_misaligned...

Avg. time: 282.705109 ms / Copy throughput: 0.164716 GB/s.

TEST FAILED

RGB32_misaligned...

Avg. time: 282.121643 ms / Copy throughput: 0.165057 GB/s.

TEST FAILED

RGBA32_misaligned...

Avg. time: 282.891663 ms / Copy throughput: 0.164608 GB/s.

TEST FAILED

Testing aligned types...

RGBA8...

Avg. time: 283.331879 ms / Copy throughput: 0.164352 GB/s.

TEST FAILED

I32...

Avg. time: 282.897919 ms / Copy throughput: 0.164604 GB/s.

TEST FAILED

LA32...

Avg. time: 282.514648 ms / Copy throughput: 0.164827 GB/s.

TEST FAILED

RGB32...

Avg. time: 283.734222 ms / Copy throughput: 0.164119 GB/s.

TEST FAILED

RGBA32...

^C

[storm0 release]$ ./simpleCUFFT

Using device 0: GeForce GTX 280

cufft: ERROR: /root/cuda-stuff/sw/rel/gpgpu/toolkit/r2.0/cufft/src/config.cu, line 106

cufft: ERROR: CUFFT_INTERNAL_ERROR

cufft: ERROR: /root/cuda-stuff/sw/rel/gpgpu/toolkit/r2.0/cufft/src/cufft.cu, line 115

cufft: ERROR: CUFFT_INVALID_PLAN

cufft: ERROR: /root/cuda-stuff/sw/rel/gpgpu/toolkit/r2.0/cufft/src/cufft.cu, line 115

cufft: ERROR: CUFFT_INVALID_PLAN

cufft: ERROR: /root/cuda-stuff/sw/rel/gpgpu/toolkit/r2.0/cufft/src/cufft.cu, line 115

cufft: ERROR: CUFFT_INVALID_PLAN

Test FAILED

cufft: ERROR: /root/cuda-stuff/sw/rel/gpgpu/toolkit/r2.0/cufft/src/cufft.cu, line 94

cufft: ERROR: CUFFT_INVALID_PLAN

[storm0 release]$ ./simpleCUBLAS

simpleCUBLAS test running..

!!!! CUBLAS initialization error

Some things manage to run, but the bandwidth test seems really low:

[storm0 release]$ ./deviceQuery 

There is 1 device supporting CUDA

Device 0: "GeForce GTX 280"

  Major revision number:						 1

  Minor revision number:						 3

  Total amount of global memory:				 1073020928 bytes

  Number of multiprocessors:					 30

  Number of cores:							   240

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 16384

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.35 GHz

  Concurrent copy and execution:				 Yes

Test PASSED

[storm0 release]$ ./bandwidthTest 

Running on......

	  device 0:GeForce GTX 280

Quick Mode

Host to Device Bandwidth for Pageable memory

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432		111.2

Quick Mode

Device to Host Bandwidth for Pageable memory

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432		116.9

Quick Mode

Device to Device Bandwidth

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432		212.0

&&&& Test PASSED

I also get this in dmesg when the module loads:

NVRM: loading NVIDIA UNIX x86_64 Kernel Module 177.82 Tue Nov 4 16:50:05 PST 2008

NVRM: PAT configuration unsupported, falling back to MTRRs.

Now, I’ve read something online that suggests if the nvidia driver falls back to MTRR as opposed to using PAT, that it may break CUDA support. Does anyone know if this is true? Thanks.

netllama · November 19, 2008, 10:35pm

Please generate and attach an nvidia-bug-report.log along with the output from lspci . You should also verify that you’re using the latest motherboard BIOS.

jam834 · November 19, 2008, 11:27pm

Here’s nvidia-bug-report.log, I’ve appended the lspci output to the end. These are HP xw8600 workstations and it looks like a new BIOS was released on Oct 28 (01.32 Rev. a). I’l try to get that installed this week.
nvidia_bug_report.log.txt (157 KB)

balachmar · November 24, 2008, 9:02pm

I am experiencing similar problems.
A few of the programs in the SDK work perfect. However other like oceanFFT and async?? don’t.
I’m using Ubuntu 8.10 (with older GCC) and running 177 nvidia drivers.
nvidia_bug_report.log.txt (129 KB)

netllama · November 24, 2008, 9:16pm

Do these problems persist with the CUDA_2.1-beta release ?

balachmar · November 24, 2008, 9:53pm

I personally don’t really fancy upgrading my graphics drivers. Since I have had too much trouble with that in the past… And unfortunately cuda will not run in a VM I assume.

So I use 177 and as far as I know it is cuda 2.0 beta. Will try to find myself some ubuntu 8.10 64bit packages of 180.60…

jam834 · November 25, 2008, 4:30pm

They do, the only difference is that some of the SDK examples no longer compile. Any ideas on the PAT/MTRR issue? I took a look at an older system running the same OS, but with CUDA 1.1 and a GeForce 8800. With the CUDA 1.1 and the 169.09 driver installed everything works and I get warnings about PAT already being configured instead of not being supported when the nvidia module loads. With CUDA 2.0 and the 177.73 driver installed the PAT errors return and the examples stop working (although I don’t know if CUDA 2.0 is supported on this card so feel free to ignore this).

jam834 · December 2, 2008, 8:42pm

I’ve fixed my problem. It was a conflict with the nvidia kernel module and the Myrinet 10G ethernet card module where they were trying to enable write-combining on two different PAT indexes (whatever that means).

From myrinet’s documentation:

I added the line: options myri10ge myri10ge_pat_idx=1

to my /etc/modprobe.conf, made a new initrd image and rebooted.

The “NVRM: PAT configuration unsupported, falling back to MTRRs.” error has gone away and everything is now working like it should. This is running the 180.06 driver with the CUDA 2.1 beta, but I’m assuming it will also work with CUDA 2.0 and its appropriate driver.