solutions to local memory limit

This is a question about limitations of local memory and solutions to overcome it.

Though it might not be relevant, I will briefly give the background to my particular application.
I’m in the process of porting a sequential C code to cuda in order to parallelise. The application
is basically a special case of particle filtering.
Loosely speaking what I’m trying to achieve is,
Given a set of N samples(particles) each associated with a particular weight, do some processing on each particle to come up with an updated
weight set, which is then used to generate the next set of samples.

So, I see a straightforward case of parallelising by processing each particle parallely.

However processing each particle requires quite a bit of computing and memory.
When I try to compile the code I get the following error
“ptxas info : Compiling entry function ‘Z5readSPdS_S_PiS0_S0_S0_S_S_S_S0_S_S_S_S_S_S0_S_S_S’ for ‘sm_13’
ptxas info : Used 117 registers, 29936+0 bytes lmem, 160+16 bytes smem, 144 bytes cmem[0], 220 bytes cmem[1], 20 bytes cmem[14]
ptxas error : Entry function ‘Z5readSPdS_S_PiS0_S0_S0_S_S_S_S0_S_S_S_S_S_S0_S_S_S’ uses too much local data (0x74f0 bytes, 0x4000 max)”

I think this is caused by some memory structures exceeding the local memory.

So my first question simply boils down to what are the potential solutions to this problem? (I don’t mind a bit of a compromise in efficiency)

This forum thread suggests
allocating enough memory from host and using an offset (block id etc) to divide the memory among parallel blocks.
Now on to my second question, how do you allocate memory from host so that I can use the solution given in the thread?
Is it simply allocating using cudaMalloc within a host function and then passing the pointers to device functions?