Behaviour in running two programs on single GPU(Tesla K40m)?

Hi All,
I am executing 2 instances of same program simultaneously on Tesla K40. The program is launching one kernel with one block having one thread. From device property, I found that there are 15 SMX on Tesla K40. So my understanding was that time for execution of both programs would be same. But, outout was not as per my understanding as when second instance starts execution, time for first instance is almost doubled. Please find below program code and output.

Can someone please help me to understand this behaviour?

#include <stdio.h>

#define BLOCK_CNT	1
#define THREAD_CNT	1

#define HEIGHT		640
#define	WIDTH		480

//Kernel code for some calculation
__global__ void kernel (char *frameBuf, int height, int width)
	for (int ind = 0; ind < height; ind++)
		for (int ind1 = 0; ind1 < width; ind1++)
			frameBuf[ind1 + (ind * width)] = frameBuf[ind1 + (ind * width)] * 5;

//Below program is launched twice one after another
int main ()
	char *frameBufDev = NULL;
	cudaError_t	retVal = cudaSuccess;
	cudaEvent_t start, stop;
	float elapsedTime;
	int counter = 0;
	while (counter++ < 100)
		retVal = cudaMalloc (&frameBufDev, HEIGHT * WIDTH);
		if (retVal != cudaSuccess)
			printf ("ERROR : frameBufDev mem allocation failed : %s\n",cudaGetErrorString(retVal));
		cudaMemset (frameBufDev, 1, HEIGHT * WIDTH);
		cudaEventCreate (&start);
		cudaEventCreate (&stop);
		cudaEventRecord (start, 0);
		//Launch the kernel with 1 block and 1 thread to do some processing
		kernel<<<BLOCK_CNT, THREAD_CNT>>> (frameBufDev, HEIGHT, WIDTH);
        retVal = cudaGetLastError();
        if (retVal != cudaSuccess)
           printf ("ERROR : kernel launch fail : %s\n",cudaGetErrorString(retVal));
		cudaEventRecord (stop, 0);
		cudaEventElapsedTime(&elapsedTime,start, stop);
		printf ("Time for execution : %f\n", elapsedTime);

		cudaEventDestroy (start);
		cudaEventDestroy (stop);
		cudaFree (frameBufDev);
	return 0;

And following is the output with two instances,

Instance one:
Time for execution : 56.887390 //running alone
Time for execution : 56.567455
Time for execution : 56.790207
Time for execution : 113.556671 //second instance started
Time for execution : 113.160355
Time for execution : 113.558113
Time for execution : 113.547363
Time for execution : 113.130402
Time for execution : 56.485153 //second instance stopped
Time for execution : 56.722816

Instance Two:
Time for execution : 56.642815
Time for execution : 56.447166
Time for execution : 56.636543
Time for execution : 56.874016
Time for execution : 56.644161

Thanks in advance.

CUDA operations from separate host processes are serialized. When you run two separate copies of a program, they run in separate host processes. If you want them to run concurrently, you must launch those kernels from the same host process, or else use CUDA MPS on linux.

You may want to study the cuda concurrent kernels sample code:

and the programming guide section on asynchronous concurrent execution:

Thanks txbob. These links are really useful.

Btw, I tried multiple threads using streams and it worked for me.