Newbie - Need to use shared mem?

Thanks for the replies. If I were to use shared memory, would I be able to copy it over just once? And then have each block that goes through that multiprocessor read from the data already stored there by the previous block? I not quite sure how it works.
My program has 2 for loops that go through the array, calculations are done on each piece, all the results from the calculations are summed up, and the result is stored back in global memory.

No. But maybe you could benefit from constant memory.

I’ll keep that in mind, thanks. Right now it’s premature for us to outsource, but soon we could be in a position to consider it.

I read that you only get 64kb of constant memory and my 512x512 array of floats won’t fit in that. Should I instead try texture memory or optimizing it for global memory? Or just stay with constant memory, but break up my calculations into smaller tiles? Thanks!

I think I’m back to the beginning again. Constant memory won’t fit my 512x512 array (I think >.>) and you said texture memory is rarely useful (any particular reason why? I read in the programming manual that both are cached. Is there any other significant property of texture memory?). So, if I tile my calculations, I think shared memory is best. Each thread needs to perform many calculations on each piece of the 512x512 array, and if I use shared memory, I can just load each new tile over the previous tile at each iteration in the thread. I’m not very experienced with CUDA, but is this a good idea compared to the alternatives? I still don’t get to avoid all that memory copying, but it should work better than global memory I suppose.

Yes, tiling and shared memory are generally the best solution. Other solutions may be slightly better in specific circumstances, but shared memory is always good. If you do many calculations, and you can check the specifics, the extra copying will hardly matter. (Make sure your copies are nice and coalesced.)

This is probably a stupid question, but what happens to variables you don’t manually copy to the GPU? In your kernel invocation, you pass the variables you already copied over to global mem, constant mem etc, but what about the parameters like matrix dimensions?

You can pass ordinary values (ie not pointers) and these get copied automatically. In fact, the value of pointers gets copied the same way. You can also pass structs by value (ie automatically), up to 256 bytes total.