I would like to hear your opinion regarding optimising this code for cuda
What would happen in a kernel is:
Each thread would:
Allocate array1 of size 150
Allocate array2 of size 150
Fill array1 with 150 values taken from global memory
For loop:
array2[i] = poisson_distribution(array1[i], seed)
*Note: poisson_distributioni is from Numerical recipes and seed would be in global memory because it would lose accuracy if I put it as a shared memory
for loop with a range of iterations 8-15
search array2 for value greater than 1
a = generate an array of 3x3,
do addition: some3x3 = some3x3 + a
end
And the number of threads is 2500 but no of blocks is ?, can vary.
What do you think of it? A lot of waste cycles on memory allocation?
ptxas info : Compiling entry function ‘__globfunc__Z8run_loopifiiiPfS_S_S_Pl’
ptxas info : Used 46 registers, 1236+1200 bytes lmem, 56+52 bytes smem, 84 bytes cmem[1]
loopKL.cu(158): warning: variable “lastlastposition” was declared but never referenced