CUDA stream not fully utilize

Can anyone help me to solve the problem that my streams are not executing concurrently? Please
I have a job that can finish in 0.6 ms on one stream
And now I divide this job into 16 pieces of small jobs and it takes 1.3 ms.
From the nvvp profiler, there are only two stream are concurrent executing.
In the same stream, the time between current job and next job is very long.

My GPU card is P5000

Here is my Resource usage for each piece of small job:
number of blocks are 22,
number of threads per block are 64,
registers per thread is 39,
and shared memory is 4096 KB per block.

here are my part of code:

  1. for(j=0;j<16*3;j++) { cudaStreamCreateWithFlags(&cuda_parm.stream[j],cudaStreamNonBlocking); //cudaStreamCreate(&cuda_parm.stream[j]); } for(i=0;i<16;i++) { dim3 bb(blocknum[i], group[i].cw_num);

    log<<<bb, threadnum, 0, cuda_parm.stream[i]>>>(turbo_parm->sys_d,

    turbo_parm->sys_d,

    turbo_parm->sys2_d,

    turbo_parm->ypar1_d,

    turbo_parm->alpha_d,

    turbo_parm->beta_d,

    turbo_parm->alpha_pre_1,

    turbo_parm->beta_pre_1,

    turbo_parm->ext_d,

    turbo_parm->ext2_d,

    turbo_parm->decode_ext2,

    turbo_parm->stop_flag,

    turbo_parm->interleaver,

    turbo_parm->de_interleaver,

    num_per_block[i], iteration_cnt, 1, group[i].cw_len, i,

    sys_addr[i], alpha_beta_addr[i], pre_table_addr[i], inter_addr[i]);
    }