If a cuda kernel has 4 integer variables, how the memory allocation happens inside CUDA? Does it vary on number of threads used?
i.e. memory used in above kernel is 4*4 bytes = 16 bytes. If the cuda kernel is called with 10 threads, then total memory allocated is 10 times the 16 bytes? Or is it something different? …
How exactly it allocates the memory for threads/blocks?
Each thread stores is variables in the the registers. Each multiprocessor has 1024x32 bit registers which are assigned to at least one block. So if you have 10 threads per block then one block will use 40 registers all together. If there are enough registers left on a multiprocessor more blocks can be run. If you increase the number of threads/block more registers per block will be needed. If you keep number of threads per block the same you just have more blocks.