Really ? As beginner it’s realy complicated to know if our code is optimized or not, informations aren’t stocked at the same place … What are these requirements and compilations ? Because, if it’s sequential i could have just done a for loop sequencaly and sum ? …
Yeah, I was surprised to see you calling your reduction from inside a kernel. I assumed you wanted to call thrust::reduce from the CPU context where it’ll launch a properly configured kernel to give you the reduction you desire.
I think you can use a primary kernel that only runs one thread and then dispatches various other kernels using dynamic parallelism. I think you can use that to avoid device-to-host-to-device copies. I could be wrong on this though, this is just an idea.