Parallel processing with large arrays


I’m a newbie to CUDA, went over the documentation and the examples and I’m about to implement some code using multi-threading.

My multi-threaded function needs to receive 5 variables:
var1 - sequential write-only float** array that will contain the result of the computation. This variable can get really large. 300-400 megabytes.
var2 - a 2-3megabyte float* vector. read only & sequential.
var3 - a 1-2megabyte float* vector. read only & sequential.
var4 - a small 1000-2000 byte float* vector. read only & sequential.
var5 - a small scalar, 8 bytes. read only.

These variables need to be accessed by all the threads (I will need something in the order of 100-150 threads, one block). All the threads write results into var1. Each thread will write into his own section of var1, no problem with collisions and conflicts there.

Question #1:
I understand that shared memory is limited to 16 KB, therefore I will not be able to store vars1-3 in this memory. Is that correct? Will it be possible to store var4-5 in shared memory and vars1-3 in global memory? Would that be the appropriate implementation?

That’s it for now… I’m sure I’ll have more questions once I’ll understand the basics here… :mellow:


You can copy all the vars to the device… After you’ve done this you can call the kernel and do your computations. Am I right that you only have 150 threads running?

Thats not a lot. I think you will not have a great speedup compared to the CPU, this because the occupancy of the GPU is very low. And the overhead of copying your data will be higher than you computations.

If you still want to do it on the GPU try to make your number of threads a multiple of 32. 160 threads will work better than the 150 I think.

But good luck with your project and welcome.

Thank you… I’ll look into that.

The issue here is that I need to run FFT for something like 150 different arrays with something in the order of 100,000 members in each array. Sequentially this can get somewhat slow (I tried a straightforward implementation in CUDA with their FFT library with no multi-threading and the improvement over non-CUDA implementation was very minimal), so I was thinking that I could benefit from multi-threading, having each thread compute one FFT for one array (hence, around 150 threads).

Am I wrong in my premise here? Should I rethink my entire strategy here?

Thank you for the warm welcome… :)


Ahh that is something different. 100k members and 150 arrays. I don’t know anything about FF transformations. But you need to rethink your implementation because if you just take the code you are using for the “normal” CPU implementation it will almost never have a speedup. Do you want to make 150 host(CPU) threads because I don’t think that will work. because you need to copy your array on the GPU and 150x300 is a lot more than max 1.5GB :P

But take a good look at the CUFFT I think this is what you want. At least most of them…

Why invent the wheel if it is already there.

Well then, I indeed want to use CUFFT. I just want each thread to execute CUFFT, thus repeating the process n (~150) times. Is this the correct train of thought? Would each thread be able to run an instance of CUFFT?



So if I’m right you want to call the kernel ~150 times with an array(dataset/members) of 100k elements? I don’t think you want to have 150 hostthreads that all call the kernel because this will not work. There can be only one kernel be running at the same time on one GPU. So if you want to use this you need to have 150GPU nodes. Looking forward to see that :D


There may be something very basic that I’m missing here. I didn’t want 150 host-threads, I wanted device threads and I wanted something like: (pseudocode)

tid = threadIdx.x;

result[tid] = cufft (inputarray[tid]);

Since cufft runs on the GPU, this entire thread can be a GPU thread, is that not the case? :ermm:


Ohh ok… Now I get it. you just want to have 150 threads… like I said in my first post, I really don’t know anything about CUFFT so maybe someone else can comment on this one.

I don’t think a single device thread will have the resources it needs to perform a whole FFT. What you really want it s to perform a batched 1D FFT. I’ve never used cuFFT, but from what I find on the forums, it already has this feature:…7419&hl=batched…7758&hl=batched

Thanks a lot, I’ll look into that!