Hi,
Usually we allocate the output memory before invoking the kernel function because it is constant (example: A(m,n) * B (n,p)= C (m,p), here we know what will be the size of matrix C).
My problem is when we only know the output memory inside the kernel function and, therefore, how can we allocate the output memory at runtime.
Thanks in advance
You have two choices, either use a Fermi card (which supports dynamic memory allocation in the kernel), or allocate as much memory as you will ever need, or have available, and have each thread work within its own subspace of that allocation.
Ok. Thanks a lot for the reply.
My colleague is working with this and he needs that each thread works with 20 kb data, but the local memory only allows 16 kb. Do you know a way to solve this?
Thanks in advance
Use global memory, most probably. But I would really start questioning the algorithm design (or at least its amenability to a computing model like CUDA) when the smallest level of data granularity it permits is 20Kb.