Streams in different compute capabilities

Lets take a look at this code:

for(int k = 0; k < nreps; k++)
    // asynchronously launch nstreams kernels, each operating on its own portion of data
    for(int i = 0; i < nstreams; i++)
        init_array<<<blocks, threads, 0, streams[i]>>>(d_a + i * n / nstreams, d_c, niterations, i);

    // asynchronoously launch nstreams memcopies.  Note that memcopy in stream x will only
    //   commence executing when all previous CUDA calls in stream x have completed
    for(int i = 0; i < nstreams; i++)
        cudaMemcpyAsync(a + i * n / nstreams, d_a + i * n / nstreams, nbytes / nstreams, cudaMemcpyDeviceToHost, streams[i]);

For a person who will transfer his/her knowledge to someone else this simply do not tell much. For instance I can use rules of thumb but this is not actually Scientific.
The question is:
How are the kernels launch in a GPU without a queue how do the kernels launch in a GPU with a queue. What is the scheduling. Do the for loop run all the way and store the different kernels or is there a mixture between these two for loops? Is the compiler clever enough to know that we are talking about a kernel-memory transfer overlap what is really going on here especially in Compute Capability 2.0 since many kernels can execute in parallel increasing the speedup. What happens especially with a stream synchronization since in memcopy there is a hidden synchronization in the function. Nvidia should make more effort in providing material so that a theoretical person who wants to seriously get involved in this parallel architecture understand it teach it and enhance it (not possible since it is proprietary). For instance for me CUDA is the only GPGPU I am willing to get involved since I seriously think that a heterogenous API like OpenCL cant even touch a hardware specific API but there is a need for more documentation I believe.