Jetson TK1 memory allocation/kernel launch perfomance compared to GTX 760

I’m comparing the performance of launching a null kernel on the TK1 and the GTX 760 by first allocating an amount of managed memory and then launching the null kernel. The allocated memory is not touched on the CPU, so no copying of pages occurs from the GPU to the CPU. The following are the results:

#pages  	Tegra TK1	GTX 760
1                 0.0833	0.0179
2	          0.08342	0.02276
4	          0.08498	0.01727
8	          0.08827	0.02194
16	          0.09497	0.01986
32	          0.1043	0.01987
64	          0.12292	0.01806
128	          0.15998	0.01889
256	          0.23391	0.01703
512	          0.38912	0.01997
1024	          0.54158	0.01746
2048	          0.84338	0.02071
4096	          1.44875	0.02089

My Question is the following:
Why does the kernel launch overhead on the TK1 increases when allocating more memory pages? The cost of launching on the 760 appears to be constant and not related to the amount of allocated memory. Please note that the costs above are in milliseconds and are for the kernel launch only (not including cudaMallocManaged costs).

I’ve also profiled the cost of the allocation alone (just the cost of cudaMallocManaged), and here are the results:

#pages	TK1	GTX 760
1	0.22	0.13
2	0.25	0.13
4	0.22	0.13
8	0.24	0.13
16	0.22	0.13
32	0.24	0.13
64	0.22	0.13
128	0.25	0.13
256	0.22	0.13
512	0.38	0.13
1024	0.44	0.13
2048	0.77	0.13
4096	1.44	0.14

The pages are also not touched on the CPU and numbers are in milliseconds. The cost of allocating on the TK1 is apparently higher. Could someone please elaborate on why this is so.

For the TK1, I’m running the latest 19.3 Linux for tegra driver with maxed out GPU and memory frequency and “performance” CPU scaling governor. For the GTX 760, I’m running driver version 340.32 on Ubuntu 14.04 with CUDA toolkit 6.5.