Could you please post a sample code on

How multiple kernels work in Fermi, please post some sample code.

About the L1, L2 cache in fermi. Are they customizable like shared memory ?

Thanks in advance

ok if not the sample code. Please Tell me is this the right way to do it.

I am launching 2 kernels without using cudaThreadSynchronize() in between

scanfirst<<<1,n>>>(first_d,n,index1);

scansecond<<<1,n>>>(second_d,n,index2);

Is this the correct way of launching multiple kernels ??

Multiple Kernels inside the same context can work concurrently in FERMI only if they are launched under different STREAMS.
Check the STREAMS concept in CUDA.

Check the sample code in SDK 3.2 : simpleStreams, and concurrentKernels

In programming guide and in best practices they have explained how Stream APIs are used for concurrent Kernel Execution and Memory Copy.

The same concept can be applied to Concurrent Kernels execution. Look in sample code for more detail.