Thread local arrays and other big structures will end up in local memory which is a slice of global memory dedicated to a thread. So it will be slow but it should work just like that. Your code allocates about 3MB total (for the entire grid) so there shouldn’t be a problem. If you launched more threads or had a bigger array so that the sum of local storage was greater than the card’s global memory size, I guess you’d get a kernel launch failure @ runtime (either “unspecified error” or “out of resources”) but I’ve never tested it.