Tesla installs, deviceQuery OK, bandwidthTest hangs (100%CPU)

I’ve used CUDA and OpenCL on a couple of cards, on OpenSuse and CentOS, but am unable to get a new Tesla C1060 working correctly on CentOS 5.4.

I have installed:
NVIDIA-Linux-x86_64-190.53-pkg2.run
cudatoolkit_2.3_linux_64_rhel5.3.run
cudasdk_2.3_linux.run

I have also added the script described at http://acceleratedcomputing.blogspot.com/2…060-on-mac.html and changed the xorg.conf to use the GPU on the mother board. When I reboot everything comes up as expected.

Running deviceQuery from the SDK gives normal results, as does lspci. Everything looks just fine.

But when I run bandwidthTest, it hangs with 100% on one of the CPU cores. Since we only have one device, I wondered if “device to device transfer” was asking the impossible, but I see similar problems with other tests (for example, “transpose” prints nothing) and I am 99% certain I have run that before with other (single) cards and nothing has hung. Nothing extra is printed if I compile the SDK examples with “dbg=1” and run the code from the debug directory.

There are no errors in /var/log/messages. This is a 64 bit machine. I had no problems at all with OpenCL on this machine with a random 9800 card and the OpenCL compatible driver.

Any ideas? We were hoping the Tesla would give us a significant speedup, but so far it’s just failed to work at al :( Happy to provide more data and apologies if this is explained somewhere (I checked the release notes, and couldn’t find anything on a search).

Thanks,
Andrew

PS Output from some tests:

[andrew@schist debug]$ ./bandwidthTest
Running on…
device 0:Tesla C1060
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3857.0

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3050.9

Quick Mode
Device to Device Bandwidth
[hangs here]

[andrew@schist debug]$ ./transpose
[hangs here]

At last, an error message:

[andrew@schist debug]$ ./matrixMul
cutilCheckMsg cudaThreadSynchronize error: Kernel execution failed in file <matrixMul.cu>, line 116 : unspecified launch failure.

Don’t know if that is useful; googling now…

Andrew

If we remove the Tesla card and replacing the old 9800 card then everything works again. Does this mean that the hardware is broken?

Thanks,

Andrew

Another snippet of data - due to some other cards in the same machine, it is - according to the docs we have for the motherboard - installed in a PCIx8 slot. This isn’t ideal, I know, but is it going to break things? Thanks.

More data on that from an email:

Our motherboard is an ASUS Z8PE with 4 PCIEx16 slots, one with a x16 link, 3 with x8 link. According to the documentation(E4766 Second

Edition V2 ASUS Motherboard Z8PE-D12 Series on page 2-21), "2.5.4 MIO PCIE slot. The MIO PCIE slot only supports a MIO audio card, which offers great sound quality to complement the robust video power. This slot does not support PCI-E x1 cards. 2.5.5 PCI Express x16 slots(x16 link; x8 link)

The onboard PCI Express x16 slots provide one x16 link and three x8 links to Intel 5520 IOH chipset. These slots support VGA cards and various server class high performance add-on cards. The x16 links switched to x8 link automatically if the slot location 5 is occupied."

Our setup with the 9600 card is as follows:

Slot 1(MIO PCIE slot): unoccupied

Slot 2(PCIEx16 x16 link): 9600 card

Slot 3(PCIEx16 x8 link): RAID card

Slot 4(PCIEx16 x8 link): RAID card

Slot 5(PCIEx16 x8 link): unoccupied

Setup with the Tesla is:

Slot 1(MIO PCIE slot): unoccupied

Slot 2(PCIEx16 x16 link): Tesla

Slot 3(PCIEx16 x8 link): slot physically blocked by Tesla

Slot 4(PCIEx16 x8 link): RAID card

Slot 5(PCIEx16 x8 link): RAID card

I haven’t seen this specific problem, but check your BIOS IGP settings. I believe what you want (if you have the option) is to enable both GPUs, but default to the IGP. This is not guaranteed to fix the problem, but it’s something to try.

Hi, thanks for the comment. There’s nothing in the BIOS, just a jumper on the board to enable the IGP (which we have enabled).

We have now moved to software RAID for one set of disks and remove the card from Slot 5 (see previous comment). That means the slot containing the Tesla is now PCIx16.

Installing the 3.0Beta drivers:

  • oclDeviceQuery takes ~5s to start, but appears OK.
  • oclBandwidth test gives a -5 (out of resources) error on
  • oclVectorAdd hangs.

We have put a huge amount of work into this (moving from hardware to software RAID is not fun). We have a customer waiting for results. This is extremely frustrating…

Andrew

./oclDeviceQuery
oclDeviceQuery.exe Starting…

OpenCL SW Info:

CL_PLATFORM_NAME: NVIDIA CUDA
CL_PLATFORM_VERSION: OpenCL 1.0 CUDA 3.0.1
OpenCL SDK Version: 4954966

OpenCL Device Info:

1 devices found supporting OpenCL:


Device Tesla C1060

CL_DEVICE_NAME: Tesla C1060
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 195.17
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 30
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1296 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1023 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 4095 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma

CL_DEVICE_IMAGE 2D_MAX_WIDTH 8192
2D_MAX_HEIGHT 8192
3D_MAX_WIDTH 2048
3D_MAX_HEIGHT 2048
3D_MAX_DEPTH 2048

CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_fp64

CL_DEVICE_COMPUTE_CAPABILITY_NV: 1.3
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 16384
CL_DEVICE_WARP_SIZE_NV: 32
CL_DEVICE_GPU_OVERLAP_NV: CL_TRUE
CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_FALSE
CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE
CL_DEVICE_PREFERRED_VECTOR_WIDTH_ CHAR 1, SHORT 1, INT 1, FLOAT 1, DOUBLE 1

oclDeviceQuery, Platform Name = NVIDIA CUDA, Platform Version = OpenCL 1.0 CUDA 3.0.1, SDK Version = 4954966, NumDevs = 1, Device = Tesla C1060

System Info:

Local Time/Date = 14:17:53, 03/11/2010
CPU Name: Intel® Xeon® CPU L5520 @ 2.27GHz

of CPU processors: 16

Linux version 2.6.18-164.11.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Wed Jan 20 07:32:21 EST 2010

TEST PASSED

Press to Quit…

./oclBandwidthTest
./oclBandwidthTest Starting…

Running on…

Device Tesla C1060
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3874.7

!!! Error # -5 (CL_OUT_OF_RESOURCES) at line 617 , in file oclBandwidthTest.cpp !!!

Exiting…

./oclVectorAdd
./oclVectorAdd Starting…

of float elements per Array = 11444777

Global Work Size = 11444992
Local Work Size = 256

of Work Groups = 44707

Allocate and Init Host Mem…
clGetPlatformID…
clGetDeviceIDs…
clCreateContext…
clCreateCommandQueue…
clCreateBuffer…
oclLoadProgSource (VectorAdd.cl)…
clCreateProgramWithSource…
clBuildProgram…
clCreateKernel (VectorAdd)…
clSetKernelArg 0 - 3…

clEnqueueWriteBuffer (SrcA and SrcB)…
clEnqueueNDRangeKernel (VectorAdd)…

[hung at this point]

Also, to complete the catalogue of frustration and fail, exactly the same results with the old CUDA libs mentioned in the first post:

./deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA

Device 0: “Tesla C1060”
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…

./bandwidthTest
Running on…
device 0:Tesla C1060
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3866.8

Quick Mode
Device to Host Bandwidth for Pageable memory

[hangs at this point]

I am using the following driver/toolkit/SDK combination on CentOS 5.4, supporting GTX 275 and Tesla C1060 GPUs:

cudadriver_2.3_linux_64_190.18.run
cudatoolkit_2.3_linux_64_rhel5.3.run
cudasdk_2.3_linux.run

When I install the driver, I select NO for the modify the X-config question. If I were you, I would try reinstalling the driver first (the version above if you are superstitious) and run bandwidthTest again. If you get the same result, I would try installing another CUDA-compatible GPU to see if the problem isn’t your C1060.

Hmmm. We haven’t tried that exact driver (I have been reinstalling drivers frequently, but have stayed with the two OpenCL compatible releases I know of - the first stable and the recent CUDA 3 beta). We have used another GPU with that hardware (a 9800) and it works just fine. Unfortunately that card is the one now installed, since it works, so I can’t try the particular driver you have used until Monday when we can swap the Tesla back (I don’t have physical access to the machine). And I guess that’s pre-OpenCL? Still, we can at least try it.

We’ve also got a ticket raised via our supplier that we’re hoping will get us a second card to check against (or just our money back - we need 64bit for a client, so are considering “other options”)

To follow up on this - we FINALLY got a replacement card, which appears to work just fine.

The support from NVidia, via Silicon Mechanics, has been appalling.

And it turns out that the Tesla C1060 is only about 3x as fast as a $200 9800 GT card from newegg.

Thankfully I’m using OpenCL, so hopefully we can switch to AMD/ATI in the future.

Andrew