concurrent memory writes

Yes well some times you have to do this, i have to do it 300 times in my solver. And since the alternative is launching 300 kernels, for a total compute time of less then 8ms any alternative that would cost less would be welcome. even if it removes the thread interleaving at that specific point.