Code-> CUDA assessment

Hey guys,

I would like to hear your opinion regarding optimising this code for cuda

What would happen in a kernel is:

Each thread would:
Allocate array1 of size 150
Allocate array2 of size 150

Fill array1 with 150 values taken from global memory
For loop:
array2[i] = poisson_distribution(array1[i], seed)

*Note: poisson_distributioni is from Numerical recipes and seed would be in global memory because it would lose accuracy if I put it as a shared memory

for loop with a range of iterations 8-15
search array2 for value greater than 1
a = generate an array of 3x3,
do addition: some3x3 = some3x3 + a
end

And the number of threads is 2500 but no of blocks is ?, can vary.

What do you think of it? A lot of waste cycles on memory allocation?

Thanks
Zaki

Please, someone help me out because I’m new to CUDA.

you cannot allocate memory from within a CUDA kernel.

My mistake, what I meant was float array[150] within the kernel. I’m not allocating it dynamically.

What do you think will be the performance bottleneck of this code?

that float array[150] will be in local memory, so that will likely be your bottleneck.

Aweful performance.

ptxas info : Compiling entry function ‘__globfunc__Z8run_loopifiiiPfS_S_S_Pl’
ptxas info : Used 46 registers, 1236+1200 bytes lmem, 56+52 bytes smem, 84 bytes cmem[1]
loopKL.cu(158): warning: variable “lastlastposition” was declared but never referenced

Great adventure though