Big delay on output for CUDA program on tesla k20m / Linux machine

I currently have a problem running a CUDA program on my school Linux machine (I connected to it using my laptop through ssh). When I run a very simple program (see below), it would have a 5 seconds delay (wall time) on generating the result. However, if I run the same code on my desktop, there is no delay.

Here is the GPU we used on the school Linux machine, tesla k20m GPU, CUDA Driver Version 9.0; Runtime Version 8.0
This is the result of running the squaring code:
The elapsed time in GPU was 1.111872 ms
CPU time: 0.844870 s
Wall time: 4.777729 s

This is my environment: a Windows machine, GeForce 1050 TI, CUDA toolkit 8.0 was installed
The result for running my code:

The elapsed time in GPU was 0.76 ms
CPU Time = 0.0000000000 s
Wall Time = 0.1014325496 s

Furthermore, we ran a bandwidth test on the school Linux machine with the following results:
Device 0: Tesla K20m

Host to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6153.9

Device to Host Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4553.6

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 145612.3

Result = PASS

What could be causing the big wall time on the Linux machine but not on my desktop?

Any advice is highly appreciated.

-----------------------------------Update-----------------------------------------------------------------------After setting the GPUs to persistence mode, the wall time for running squaring code on the school Linux machine is down to 2.54 s. However, if I run the same code on my desktop, it just take 0.21 s. Anyone has any thoughts in this problem? I sincerely appreciate your help.

BTW, the code is just a very simple square code. So it seems the delay is not reasonable.

GPUs can have a long start up time. You should be able to mitigate some of that by placing the GPUs in persistence mode. The GPU in your laptop is effectively in persistence mode.

Thanks a lot for your reply. Could you please tell me how to put the GPU in persistence mode?
Update: I found out how to set up GPU in persistence mode. The wall time is now 2.55 s. Thank you soooooo much!