I currently have a problem running a CUDA program on my school Linux machine (I connected to it using my laptop through ssh). When I run a very simple program (see below), it would have a 5 seconds delay (wall time) on generating the result. However, if I run the same code on my desktop, there is no delay.
Here is the GPU we used on the school Linux machine, tesla k20m GPU, CUDA Driver Version 9.0; Runtime Version 8.0
This is the result of running the squaring code:
The elapsed time in GPU was 1.111872 ms
CPU time: 0.844870 s
Wall time: 4.777729 s
This is my environment: a Windows machine, GeForce 1050 TI, CUDA toolkit 8.0 was installed
The result for running my code:
The elapsed time in GPU was 0.76 ms
CPU Time = 0.0000000000 s
Wall Time = 0.1014325496 s
Furthermore, we ran a bandwidth test on the school Linux machine with the following results:
Device 0: Tesla K20m
Host to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6153.9
Device to Host Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4553.6
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 145612.3
Result = PASS
What could be causing the big wall time on the Linux machine but not on my desktop?
Any advice is highly appreciated.
-----------------------------------Update-----------------------------------------------------------------------After setting the GPUs to persistence mode, the wall time for running squaring code on the school Linux machine is down to 2.54 s. However, if I run the same code on my desktop, it just take 0.21 s. Anyone has any thoughts in this problem? I sincerely appreciate your help.