GTX 690 low (~77%) bandwidth

Hi, All
I’ve 4 GTX 690 on TYAN mobo with 2 6272 Opterons under linux Debian 3.2.0 kernel. I use cuda-5.0. In bandwidth test I see the following:

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: GeForce GTX 690
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5608.6

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6523.4

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 149631.2

I suppose ~150 GB/s is too low for single die. I’ve expected 384/2 GB/s according to documentation.
What do u guys think about it? I did not tuned the system at all. Simply installed all the stuff.

I think 78% bandwidth utilization is quite common for the bandwidth test… It really doesn’t push things to a maximum.

You can try this reduction sum benchmark:

It should reach somewhere between 84-93 % utilization. For example i reached 245 GB/s on my GTX Titan which is roughly ~85% .

  • Make sure you download the updated version for kepler though!!!

Thnx Jimmy! I’m looking through the code you suggested but I can’t compile it as is. There is the error with time measuring function.

[b]reduction_main.cu(73): error: declaration is incompatible with previous “get_clock” (70): here

reduction_main.cu(74): error: expected a “;”[/b]

I cnt’t understand what is els_per_x_group there in you code. I’m trying to find out the value of iters (line 175)
I noticed you use 64 threads, why not 32? When you use registers for reduction I see you use only 32 threads. I also noticed you use 256 floats of smem is it occupancy limited?

Isn’t there a version of get_clock() for both windows/linux there? I will look into it.

64 threads is for occupabcy improvement on post GT200 hardware.

Yes the Kepler shuffle instrution only works on a per warp basis hence 32 threads.

Increasing the occupancy shouldn’t give any significant further increase, ie I already tried that.

There was a typo in the linux version of the get_clock() definition.

It should be:

#ifdef _WIN32
double get_clock()
{
	
	LARGE_INTEGER ticksPerSecond;
	LARGE_INTEGER timeStamp;

	QueryPerformanceFrequency(&ticksPerSecond);
	QueryPerformanceCounter(&timeStamp);

	double sec = double(timeStamp.QuadPart)/(double)ticksPerSecond.QuadPart; 

	return sec; // returns timestamp in secounds

}
#else
unsigned long long int get_clock()
{
	struct timeval tv;
	gettimeofday(&tv, NULL);
	return (unsigned long long int)tv.tv_usec + 1000000*tv.tv_sec;

}
#endif

Hoper this compiles better.

It seems that we’re unable to upload files to the forums these days? Hence I cannot rectify the original post.

Jimmy, It is exactly what i did when found the error i’ve posted
but the result is the following

GeForce GTX 690 @ 192.256 GB/s

N [GB/s] [perc] [usec] test
1048576 0.00 0.00 % 38780000.0 Pass
2097152 0.00 0.00 % 66769996.0 Pass
4194304 0.00 0.00 % 121600000.0 Pass
8388608 0.00 0.00 % 231950000.0 Pass
16777216 0.00 0.00 % 453100000.0 Pass
33554432 0.00 0.00 % 896650048.0 Pass
67108864 0.00 0.00 % 1776770048.0 Pass
134217728 0.00 0.00 % 3538520064.0 Pass

You had something line 10-s or ~100 usec… I think there is still some problem in the function gettime. Also my compiler rejects void pause and fgetchar. I used to patch it as follows
int pause()
{

#ifdef _WIN32
system(“pause”);
#else
getchar();
#endif
return 0;

}

btw how are you pasting the code in that white box?

and one more question… does 64 really improve occupancy and why. I can run 16 blocks of 32 threads at the same SM right? Is not it enough?
I’ve also experimented with this and found that performance improvement for 64 ths vs 32 ths is very small.

I’m not on a linux system today so I’m not gonna have time to debug it. Perhaps you could debug the timer function? It should be an easy function.

Yes “fgetchar” only works under windows it seems.

To paste code use [code] YOUR CODE HERE [/code] but without the “__”

You could cross-check your bandwidth numbers with the simple ‘dcopy’ app I posted a few years ago:

[url]Quadro 4000 Bandwidth The device to device bandwidth obtained with - CUDA Programming and Performance - NVIDIA Developer Forums