I am working on multi-process test scenario on Nivida TX1/TX2.
I created a simple matrix calculation program like the following.
As you see, a stream was defined as a global variable.
For example,
matrix_cal.c
cudaStream_t stream;
main()
{
cudaStreamCreate(&stream);
foo<<<blocks,threads, stream>>>();
}
Then, I ran the same binary multiple times on two cores with real-time priority like the below.
For example,
sudo taskset -c 3 schedtool -F -p 65 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 64 -e ./matrix_cal_sum &
sudo taskset -c 3 schedtool -F -p 63 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 62 -e ./matrix_cal_sum &
sudo taskset -c 3 schedtool -F -p 61 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 60 -e ./matrix_cal_sum &
sudo taskset -c 3 schedtool -F -p 59 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 58 -e ./matrix_cal_sum &
sudo taskset -c 3 schedtool -F -p 57 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 56 -e ./matrix_cal_sum &
sudo taskset -c 3 schedtool -F -p 55 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 54 -e ./matrix_cal_sum
In this test, I noticed that the cudaStreamCreate took more than 60 seconds as you see nvprof result.
Is this normal behavior?
How can I avoid this long overhead when running multiple identical process?
Thanks
