When I create a simple application and test the difference between use of
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
or not, at the point of cudaDeviceSynchronize()
, what I observe is that when I don’t use it, top
reports about 100% cpu usage by the process. When I do use, top
reports approximately 0% cpu usage by the process.
$ cat t5.cu
const unsigned long long delay = 30000000000ULL;
__global__ void k(){
unsigned long long start = clock64();
while (clock64() < (start+delay));
}
int main(){
#ifdef USE_BLOCK
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
#endif
k<<<1,1>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t5 t5.cu
$ ./t5 &
[1] 9970
$ top -p 9970
top - 10:02:37 up 297 days, 13:36, 2 users, load average: 0.27, 0.15, 0.09
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.8 us, 1.4 sy, 0.0 ni, 95.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13183105+total, 11599987+free, 1562592 used, 14268592 buff/cache
KiB Swap: 4194300 total, 3770732 free, 423568 used. 12943844+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9970 user2 20 0 0.157t 131180 122196 R 99.7 0.1 0:09.94 t5
[1]+ Done ./t5
$ nvcc -o t5 t5.cu -DUSE_BLOCK
$ ./t5 &
[1] 10008
$ top -p 10008
top - 10:04:18 up 297 days, 13:38, 2 users, load average: 0.11, 0.14, 0.09
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13183105+total, 11600230+free, 1560092 used, 14268668 buff/cache
KiB Swap: 4194300 total, 3770732 free, 423568 used. 12944084+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10008 user2 20 0 0.157t 131088 122108 S 0.0 0.1 0:00.29 t5