I would like to hear your opinion regarding optimising this code for cuda
What would happen in a kernel is:
Each thread would:
Allocate array1 of size 150
Allocate array2 of size 150
Fill array1 with 150 values taken from global memory
array2[i] = poisson_distribution(array1[i], seed)
*Note: poisson_distributioni is from Numerical recipes and seed would be in global memory because it would lose accuracy if I put it as a shared memory
for loop with a range of iterations 8-15
search array2 for value greater than 1
a = generate an array of 3x3,
do addition: some3x3 = some3x3 + a
And the number of threads is 2500 but no of blocks is ?, can vary.
What do you think of it? A lot of waste cycles on memory allocation?