Ok, this is probably in the documentation somewhere, but I either am not understanding what I’m reading, or I just haven’t found it yet:
Is it possible to get a chunk of memory that is shared inside the thread only, and that I won’t have to pass around from device function to function as an argument? i.e., I have my kernel, which calls other device functions, and I want to have the equivalent of global variables accessible to them, but I want a different instantiation for each thread. My impression is that global, constant and shared declarations get executed once per kernel call, and I want my memory chunked out once per thread. The way I’m fearing I’ll have to do this is by just doing a cudaMalloc from the host with enough space for all my threads, then pass the pointer to the kernel, and force each thread to figure out which chunk of the memory belongs to it. That seems like a lot of hassle.