I have another question regarding approach to the problem.
Suppose multiple threads within a block calculate their independent outcome.
But those individual outcomes must be added together at the end.
Is my following approach the standard way ?
- Each thread calculates outcome, stores that value into one location in the shared memory.
- Do some scan algorithm to add the outputs together
I can’t just let each thread sum up the outputs in the global memory right ? like the following,
globalmemory[some fixed location] += output
I need to store each thread output into different shared memory locations and then sum them right ?