Possible cudaMalloc problem

I have some code where I first allocate memory, then run a kernel that writes to that memory, and then run another kernel that reads from that memory and writes the result. I use cudaLaunch instead of the normal <<< >>> calls. Something like this

allocate memory, a
run kernel 1 that writes to a
run kernel 2 that reads from a and writes to b
free memory a

If I run kernel 1 or 2 by themselfes it runs fine, but when I try to run them after each other I get segmentation faults. Can this be due to the fact that memory a is free’d while kernel 2 is running (since the control is returned to the host when kernel 2 has been launched) ? If this is the case, how do I avoid that?

memory will not be freed while the GPU is running; it will wait for all work to idle.