C1060/c1070 GPU Bench results for IO Ops

toejama1 · April 16, 2009, 2:17pm

I was wondering if anyone posted GPUbench results for c1060/c1070 since it was not on GPUbench.org or quick search on the forums.

I was wondering what the practical and theoretical amount of I/O ops they can handle

thanks

MisterAnderson42 · April 16, 2009, 2:45pm

“I/O ops?” You mean global memory bandwidth?

The theoretical peaks are listed here:

http://www.nvidia.com/object/product_tesla_s1070_us.html

Although, they multiplied the per-card bandwidth by 4 to get 408, so it is really only 102 GB/s if you run on one card.

All the Teslas I’ve got access to are HPC systems without any windowing systems installed so I cannot run a graphical GPUBench on them. If it is bandwidth you are interested in, though, here are the results of my bandwidth test script: (http://forums.nvidia.com/index.php?showtopic=52806&hl=bw_test)

On a single GPU inside a Telsa S1070:

copy_gmem<float> - Bandwidth:	71.592189 GiB/s

copy_gmem<float2> - Bandwidth:	76.486863 GiB/s

copy_gmem<float4> - Bandwidth:	74.633183 GiB/s

copy_tex<float> - Bandwidth:	70.903607 GiB/s

copy_tex<float2> - Bandwidth:	76.418907 GiB/s

copy_tex<float4> - Bandwidth:	77.371516 GiB/s

write_only<float> - Bandwidth:	69.588005 GiB/s

write_only<float2> - Bandwidth:	71.199130 GiB/s

write_only<float4> - Bandwidth:	70.569901 GiB/s

read_only_gmem<float> - Bandwidth:	65.726154 GiB/s

read_only_gmem<float2> - Bandwidth:	83.037063 GiB/s

read_only_gmem<float4> - Bandwidth:	46.565674 GiB/s

read_only_tex<float> - Bandwidth:	65.170775 GiB/s

read_only_tex<float2> - Bandwidth:	71.472940 GiB/s

read_only_tex<float4> - Bandwidth:	70.653347 GiB/s

Compare this to GTX 285 (theoretical peak ~160 GiB/s):

copy_gmem<float> - Bandwidth:	121.494258 GiB/s

copy_gmem<float2> - Bandwidth:	126.038586 GiB/s

copy_gmem<float4> - Bandwidth:	104.040466 GiB/s

copy_tex<float> - Bandwidth:	124.938593 GiB/s

copy_tex<float2> - Bandwidth:	129.273315 GiB/s

copy_tex<float4> - Bandwidth:	130.567617 GiB/s

write_only<char> - Bandwidth:	18.899835 GiB/s

write_only<float> - Bandwidth:	73.363346 GiB/s

write_only<float2> - Bandwidth:	75.512689 GiB/s

write_only<float4> - Bandwidth:	73.769699 GiB/s

read_only_gmem<float> - Bandwidth:	69.645715 GiB/s

read_only_gmem<float2> - Bandwidth:	97.900049 GiB/s

read_only_gmem<float4> - Bandwidth:	52.964178 GiB/s

read_only_tex<float> - Bandwidth:	69.641488 GiB/s

read_only_tex<float2> - Bandwidth:	109.820544 GiB/s

read_only_tex<float4> - Bandwidth:	105.827061 GiB/s

eyalhir74 · April 16, 2009, 3:35pm

Hi,

Why is read_only_tex much faster then read_only_tex ? is it just an artifact or its better reading float2 from textures then simple floats?

read_only_tex<float> - Bandwidth:	69.641488 GiB/s

read_only_tex<float2> - Bandwidth:	109.820544 GiB/s

read_only_tex<float4> - Bandwidth:	105.827061 GiB/s

thanks

eyal

MisterAnderson42 · April 16, 2009, 4:38pm

In terms of bandwidth, the more bytes you can read in a single texture read the, the better. There are two limitations that come into play, 1) bandwidth, and 2) the hardware can only serve so many texture reads/second (regardless of their size). For texture reads of only 4-byte quantities (floats), the 2nd can become the bottleneck.

In non-toy kernels, the extra instructions needed to calculate several texture addresses can also become an overhead in a situation of multiple float texture reads vs a single float2/4 texture read.

_Big_Mac · April 16, 2009, 4:44pm

edit: answered before I posted

A bit out of topic: MisterAnderson42, are there no windowing systems installed on those HPC servers or do you don’t have access to them via ssh (or whatever you’re using to communicate with them)? I’m asking because I thought Tesla drivers can only be installed with graphics drivers, and those require a windowing system?

mfatica · April 16, 2009, 5:10pm

There is a script in the Linux release note that shows how to load the driver without starting X.

MisterAnderson42 · April 16, 2009, 5:46pm

Some systems do have X-windows installed, but not running (no reason to waste cycles on compute nodes). Regardless, these systems are only accessible by logging into a head node and then submitting batch jobs via a queue, so I can’t exactly plug in a monitor and start X running, especially given that the Tesla system I posted a benchmark from is 3 states away :)

Linux isn’t silly about this like windows is. To Linux, a driver is a driver. It is just a piece of code that loads into the kernel, and you can access it whether you are sitting at the physical machine or logged in remotely. I’ve setup several systems without even installing X-windows and the NVIDIA drivers install and load happily. You do need to run the dev node creation script that mfatica mentioned, though.

Topic		Replies	Views
Bandwidthtest on Tesla S1070 CUDA Programming and Performance	0	711	February 16, 2011
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13731	June 2, 2008
Tesla C1060 Memory Bandwidth CUDA Programming and Performance	11	7626	August 19, 2010
GTX295 + Tesla C1060 on xp CUDA Programming and Performance	2	18069	August 21, 2009
Benchmarking Different Memory Access Patterns CUDA Programming and Performance	6	1788	June 11, 2008
Bandwidth on Tesla C1060 Which value is the response from bandwidth test? CUDA Programming and Performance	1	7587	March 5, 2010
Basic question regarding bandwidthTest on Tesla C1060 CUDA Programming and Performance	1	1402	January 13, 2009
TESLA bandwidthTest results CUDA Programming and Performance	5	2936	January 19, 2010
4 TerraFlop TESLA! CUDA Programming and Performance	6	4250	July 11, 2008
bandwidth test CUDA Programming and Performance	9	19342	March 24, 2009

C1060/c1070 GPU Bench results for IO Ops

Related topics