Multi-process test cudastreamcreate overhead too long

I am working on multi-process test scenario on Nivida TX1/TX2.

I created a simple matrix calculation program like the following.
As you see, a stream was defined as a global variable.

For example,


cudaStream_t stream;
    foo<<<blocks,threads, stream>>>(); 

Then, I ran the same binary multiple times on two cores with real-time priority like the below.

For example,

sudo taskset -c 3 schedtool -F -p 65 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 64 -e ./matrix_cal_sum &
sudo taskset -c 3 schedtool -F -p 63 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 62 -e ./matrix_cal_sum &
sudo taskset -c 3 schedtool -F -p 61 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 60 -e ./matrix_cal_sum &
sudo taskset -c 3 schedtool -F -p 59 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 58 -e ./matrix_cal_sum &
sudo taskset -c 3 schedtool -F -p 57 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 56 -e ./matrix_cal_sum &
sudo taskset -c 3 schedtool -F -p 55 -e ./matrix_cal_sum &
sudo taskset -c 4 schedtool -F -p 54 -e ./matrix_cal_sum

In this test, I noticed that the cudaStreamCreate took more than 60 seconds as you see nvprof result.

Is this normal behavior?
How can I avoid this long overhead when running multiple identical process?