Is that ~500 bytes per thread or per block? If per thread, it won’t fit to shared memory unless you use less than 128 threads per block. You can use 16kB of shared memory per block (minus kernel parameters - they are implicitly copied to smem). You might be forced to go with local memory but it’s a lot slower than shared. Think of smem as a managed cache, while local memory is just a fragment of global memory, which is RAM.
Since I am not using shared memory currently, I will move some of my objects there, starting with the most frequently accessed ones. If my understanding is correct, accessing them in shared memory will be quicker. I’ll make sure I don’t go over 16KB.
Simple variables (int, float, char etc., and built in vector types float2, int4 etc.) by default go to registers (unless you run out of hardware registers, then they will ‘spill’ to local memory)Registers are separate from shared memory and are per-thread (smem is as fast as registers if there are no access conflicts which need to be serialized).
I believe allocating small arrays, like float c, will also get translated to four consecutive registers. But for arrays as big as 128, the compiler will spill it to local memory.
So, it’s like this:
float c = a*b; //compile to register operations, unless you ran out of resources
float c; //compile to registers (I think)
float c; //too big for registers, compile to local memory
__shared__ float c; //compile to shared memory on a per-block basis: each thread sees the same array!
Oh, and about constructors - C++ is not supported in device code, only C (with some extensions like templates, but it’s not really object-oriented programming). Read through the programming guide, all of your questions so far have been answered there.