Tesla installs, deviceQuery OK, bandwidthTest hangs (100%CPU)

andrew_cooke · February 25, 2010, 5:04pm

I’ve used CUDA and OpenCL on a couple of cards, on OpenSuse and CentOS, but am unable to get a new Tesla C1060 working correctly on CentOS 5.4.

I have installed:
NVIDIA-Linux-x86_64-190.53-pkg2.run
cudatoolkit_2.3_linux_64_rhel5.3.run
cudasdk_2.3_linux.run

I have also added the script described at [url=“Accelerated Computing: Installing Nvidia Tesla C1060 on a Mac Pro with CentOS”]http://acceleratedcomputing.blogspot.com/2...060-on-mac.html[/url] and changed the xorg.conf to use the GPU on the mother board. When I reboot everything comes up as expected.

Running deviceQuery from the SDK gives normal results, as does lspci. Everything looks just fine.

But when I run bandwidthTest, it hangs with 100% on one of the CPU cores. Since we only have one device, I wondered if “device to device transfer” was asking the impossible, but I see similar problems with other tests (for example, “transpose” prints nothing) and I am 99% certain I have run that before with other (single) cards and nothing has hung. Nothing extra is printed if I compile the SDK examples with “dbg=1” and run the code from the debug directory.

There are no errors in /var/log/messages. This is a 64 bit machine. I had no problems at all with OpenCL on this machine with a random 9800 card and the OpenCL compatible driver.

Any ideas? We were hoping the Tesla would give us a significant speedup, but so far it’s just failed to work at al :( Happy to provide more data and apologies if this is explained somewhere (I checked the release notes, and couldn’t find anything on a search).

Thanks,
Andrew

PS Output from some tests:

[andrew@schist debug]$ ./bandwidthTest
Running on…
device 0:Tesla C1060
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3857.0

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3050.9

Quick Mode
Device to Device Bandwidth
[hangs here]

[andrew@schist debug]$ ./transpose
[hangs here]

andrew_cooke · February 25, 2010, 5:06pm

At last, an error message:

[andrew@schist debug]$ ./matrixMul
cutilCheckMsg cudaThreadSynchronize error: Kernel execution failed in file <matrixMul.cu>, line 116 : unspecified launch failure.

Don’t know if that is useful; googling now…

Andrew

andrew_cooke · February 25, 2010, 7:20pm

If we remove the Tesla card and replacing the old 9800 card then everything works again. Does this mean that the hardware is broken?

Thanks,

Andrew

andrew_cooke · February 25, 2010, 7:31pm

Another snippet of data - due to some other cards in the same machine, it is - according to the docs we have for the motherboard - installed in a PCIx8 slot. This isn’t ideal, I know, but is it going to break things? Thanks.

andrew_cooke · February 25, 2010, 8:01pm

More data on that from an email:

Our motherboard is an ASUS Z8PE with 4 PCIEx16 slots, one with a x16 link, 3 with x8 link. According to the documentation(E4766 Second

Edition V2 ASUS Motherboard Z8PE-D12 Series on page 2-21), "2.5.4 MIO PCIE slot. The MIO PCIE slot only supports a MIO audio card, which offers great sound quality to complement the robust video power. This slot does not support PCI-E x1 cards. 2.5.5 PCI Express x16 slots(x16 link; x8 link)

The onboard PCI Express x16 slots provide one x16 link and three x8 links to Intel 5520 IOH chipset. These slots support VGA cards and various server class high performance add-on cards. The x16 links switched to x8 link automatically if the slot location 5 is occupied."

Our setup with the 9600 card is as follows:

Slot 1(MIO PCIE slot): unoccupied

Slot 2(PCIEx16 x16 link): 9600 card

Slot 3(PCIEx16 x8 link): RAID card

Slot 4(PCIEx16 x8 link): RAID card

Slot 5(PCIEx16 x8 link): unoccupied

Setup with the Tesla is:

Slot 1(MIO PCIE slot): unoccupied

Slot 2(PCIEx16 x16 link): Tesla

Slot 3(PCIEx16 x8 link): slot physically blocked by Tesla

Slot 4(PCIEx16 x8 link): RAID card

Slot 5(PCIEx16 x8 link): RAID card

Tom_Milledge · February 28, 2010, 2:39am

I haven’t seen this specific problem, but check your BIOS IGP settings. I believe what you want (if you have the option) is to enable both GPUs, but default to the IGP. This is not guaranteed to fix the problem, but it’s something to try.

andrew_cooke · March 11, 2010, 7:20pm

Hi, thanks for the comment. There’s nothing in the BIOS, just a jumper on the board to enable the IGP (which we have enabled).

andrew_cooke · March 11, 2010, 7:33pm

We have now moved to software RAID for one set of disks and remove the card from Slot 5 (see previous comment). That means the slot containing the Tesla is now PCIx16.

Installing the 3.0Beta drivers:

oclDeviceQuery takes ~5s to start, but appears OK.
oclBandwidth test gives a -5 (out of resources) error on
oclVectorAdd hangs.

We have put a huge amount of work into this (moving from hardware to software RAID is not fun). We have a customer waiting for results. This is extremely frustrating…

Andrew

./oclDeviceQuery
oclDeviceQuery.exe Starting…

OpenCL SW Info:

CL_PLATFORM_NAME: NVIDIA CUDA
CL_PLATFORM_VERSION: OpenCL 1.0 CUDA 3.0.1
OpenCL SDK Version: 4954966

OpenCL Device Info:

1 devices found supporting OpenCL:

Device Tesla C1060

CL_DEVICE_NAME: Tesla C1060
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 195.17
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 30
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1296 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1023 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 4095 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma

CL_DEVICE_IMAGE 2D_MAX_WIDTH 8192
2D_MAX_HEIGHT 8192
3D_MAX_WIDTH 2048
3D_MAX_HEIGHT 2048
3D_MAX_DEPTH 2048

CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_fp64

CL_DEVICE_COMPUTE_CAPABILITY_NV: 1.3
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 16384
CL_DEVICE_WARP_SIZE_NV: 32
CL_DEVICE_GPU_OVERLAP_NV: CL_TRUE
CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_FALSE
CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE
CL_DEVICE_PREFERRED_VECTOR_WIDTH_ CHAR 1, SHORT 1, INT 1, FLOAT 1, DOUBLE 1

oclDeviceQuery, Platform Name = NVIDIA CUDA, Platform Version = OpenCL 1.0 CUDA 3.0.1, SDK Version = 4954966, NumDevs = 1, Device = Tesla C1060

System Info:

Local Time/Date = 14:17:53, 03/11/2010
CPU Name: Intel® Xeon® CPU L5520 @ 2.27GHz

of CPU processors: 16

Linux version 2.6.18-164.11.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Wed Jan 20 07:32:21 EST 2010

TEST PASSED

Press to Quit…

./oclBandwidthTest
./oclBandwidthTest Starting…

Running on…

Device Tesla C1060
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3874.7

!!! Error # -5 (CL_OUT_OF_RESOURCES) at line 617 , in file oclBandwidthTest.cpp !!!

Exiting…

./oclVectorAdd
./oclVectorAdd Starting…

of float elements per Array = 11444777

Global Work Size = 11444992
Local Work Size = 256

of Work Groups = 44707

Allocate and Init Host Mem…
clGetPlatformID…
clGetDeviceIDs…
clCreateContext…
clCreateCommandQueue…
clCreateBuffer…
oclLoadProgSource (VectorAdd.cl)…
clCreateProgramWithSource…
clBuildProgram…
clCreateKernel (VectorAdd)…
clSetKernelArg 0 - 3…

clEnqueueWriteBuffer (SrcA and SrcB)…
clEnqueueNDRangeKernel (VectorAdd)…

[hung at this point]

andrew_cooke · March 11, 2010, 8:16pm

Also, to complete the catalogue of frustration and fail, exactly the same results with the old CUDA libs mentioned in the first post:

./deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA

Device 0: “Tesla C1060”
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…

./bandwidthTest
Running on…
device 0:Tesla C1060
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3866.8

Quick Mode
Device to Host Bandwidth for Pageable memory

[hangs at this point]

Tom_Milledge · March 12, 2010, 8:35pm

I am using the following driver/toolkit/SDK combination on CentOS 5.4, supporting GTX 275 and Tesla C1060 GPUs:

cudadriver_2.3_linux_64_190.18.run
cudatoolkit_2.3_linux_64_rhel5.3.run
cudasdk_2.3_linux.run

When I install the driver, I select NO for the modify the X-config question. If I were you, I would try reinstalling the driver first (the version above if you are superstitious) and run bandwidthTest again. If you get the same result, I would try installing another CUDA-compatible GPU to see if the problem isn’t your C1060.

andrew_cooke · March 12, 2010, 9:40pm

Hmmm. We haven’t tried that exact driver (I have been reinstalling drivers frequently, but have stayed with the two OpenCL compatible releases I know of - the first stable and the recent CUDA 3 beta). We have used another GPU with that hardware (a 9800) and it works just fine. Unfortunately that card is the one now installed, since it works, so I can’t try the particular driver you have used until Monday when we can swap the Tesla back (I don’t have physical access to the machine). And I guess that’s pre-OpenCL? Still, we can at least try it.

We’ve also got a ticket raised via our supplier that we’re hoping will get us a second card to check against (or just our money back - we need 64bit for a client, so are considering “other options”)

andrew_cooke · March 24, 2010, 11:55pm

To follow up on this - we FINALLY got a replacement card, which appears to work just fine.

The support from NVidia, via Silicon Mechanics, has been appalling.

And it turns out that the Tesla C1060 is only about 3x as fast as a $200 9800 GT card from newegg.

Thankfully I’m using OpenCL, so hopefully we can switch to AMD/ATI in the future.

Andrew

Topic		Replies	Views
Extremely low bandwidth CUDA Programming and Performance	10	2075	September 4, 2010
bandwidth test CUDA Programming and Performance	9	19354	March 24, 2009
Tesla S1070 Bandwidth Problem CUDA Programming and Performance	16	11501	March 31, 2009
device to device bandwidth; a weird crash... CUDA Programming and Performance	0	2749	September 14, 2009
CUDA error at bandwidthTest Linux	13	2659	August 13, 2019
2 Tesla C1060s with a legacy GeForce FX 5200 card Need help editing the xorg.conf file for multiple CUDA Programming and Performance	28	35721	January 29, 2009
GTX295 + Tesla C1060 on xp CUDA Programming and Performance	2	18073	August 21, 2009
deviceQuery OK, everything else hangs Cuda sdk 4.1 examples simply hang, no errors, no warnings CUDA Programming and Performance	12	8976	April 23, 2012
TESLA bandwidthTest results CUDA Programming and Performance	5	2937	January 19, 2010
bandwidthTest crashes bandwidthTest crashes when run CUDA Programming and Performance	4	7459	October 6, 2009

Tesla installs, deviceQuery OK, bandwidthTest hangs (100%CPU)

Device Tesla C1060

of CPU processors: 16

Press to Quit…

Exiting…

of float elements per Array = 11444777

of Work Groups = 44707

Related topics