Hi, All
I’ve 4 GTX 690 on TYAN mobo with 2 6272 Opterons under linux Debian 3.2.0 kernel. I use cuda-5.0. In bandwidth test I see the following:
[CUDA Bandwidth Test] - Starting…
Running on…
Device 0: GeForce GTX 690
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5608.6
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6523.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 149631.2
I suppose ~150 GB/s is too low for single die. I’ve expected 384/2 GB/s according to documentation.
What do u guys think about it? I did not tuned the system at all. Simply installed all the stuff.
Thnx Jimmy! I’m looking through the code you suggested but I can’t compile it as is. There is the error with time measuring function.
[b]reduction_main.cu(73): error: declaration is incompatible with previous “get_clock” (70): here
reduction_main.cu(74): error: expected a “;”[/b]
I cnt’t understand what is els_per_x_group there in you code. I’m trying to find out the value of iters (line 175)
I noticed you use 64 threads, why not 32? When you use registers for reduction I see you use only 32 threads. I also noticed you use 256 floats of smem is it occupancy limited?
You had something line 10-s or ~100 usec… I think there is still some problem in the function gettime. Also my compiler rejects void pause and fgetchar. I used to patch it as follows
int pause()
{
btw how are you pasting the code in that white box?
and one more question… does 64 really improve occupancy and why. I can run 16 blocks of 32 threads at the same SM right? Is not it enough?
I’ve also experimented with this and found that performance improvement for 64 ths vs 32 ths is very small.