Hi, There are two cpu threads in my code and only one gpu(M4000) device. each cpu thread do the gpu calculation in different gpu memory address. the images enter into the first cpu thread in every 40ms, after the first cpu thread process finish, the image enter into the second cpu thread. meanwhile a new image enter into the first cpu thread…
I have tested if there are only cudaMemcpy from host to device in the two cpu threads. Most of the running time is very short less than 1ms, but the peak time is more than 800ms. I also use the Pinned cpu Memory to make the cudaMemcpy fast. but the peak time also exist.
I want to know the reson.
my cuda version is 6.0 and pci-e 3.0 x16