slowness of first cudaMalloc (K40 vs K20)

I understand CUDA will do initialization during first API call, but the time spent is just too much.

For the same program built with: CUDA 7.0 (compute_35) + Visual Studio 2012 + NSight 4.5
And before the 1st cudaMalloc, I’ve called “cudaSetDevice” already

on my PC: Win7 + Tesla K20, 1st cudaMalloc takes 150ms
on my server: Win2012+ Tesla K40, it takes 1100ms!!

My questions are:
1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st cudaMalloc? As K40 is supposed to be better than K20
2, If the initialization is unavoidable, can process A maintain its status(or context) in GPU while process B is running in the same GPU? I understand I better run GPU in “exclusive” mode, but can process A “suspend” so that it doesn’t need to initialize GPU again later?

Thanks in advance

fyi, I got a answer here:

cross posted: