programs in two GPU cards

Hi, all:

I have a program written with OpenCL 1.2. This program runs at two GPU cards at the same time. In the CPU part, I used OpenMP or Pthread to start two threads, and each thread supports one GPU card. I have two machines, one machine has two K40 card, the other one has two P100 cards. The system environment of two machines is somehow different. But the compiler and run time library is the same. Both of them is intel compiler 2016.

Above is my program description.

First time, I run my program on two K40 cards. Everything is fine. The speed in double card is about 1.6 fold of single card.

Then, I run my program on two P100 cards. Interesting thing happens. The speed in double card is slower than the single card.

Here is an example.

the data is only on one card: the time is 150ms;
The data is on two cards, each card only processes half of the data:
When two cards runs paralleled, the total time is 220ms, that is each card runs a little less than 220ms.
when two cards runs sequentially, the time is 160ms, that is each card runs about 80ms.

(1)I double checked the overhead of CPU thread initialization, the overhead is less than 1ms. So it can be ignored.
(2) I also confirmed that it is not the data transfer problem between CPU and GPU. Because pure kernel running time in paralleled model is slower than sequential model.
(3) The problem happens at both multiple threading environment (OpenMP and Pthread).
(4) When I use K40 card, there is no such problem.

Do anyone has any idea? Many thanks.