I have a question about writing results:
There is an allocated array in global memory space. Each thread will do some calculation and then writes a few numbers in that array, but the amount of numbers each thread will generate is not the same. So i thought the writing to the memory should be serialized and there should be a global index to indicate the writing place. How can I do that? Or is it possible to do that in the kernel?
Or each thread should keep its result in another global memory space → exit the kernel → use cuda copy to store it the destination array?
Thank you for your help.