At what memory is the GPU code located in execution, in host memory or in device memory?

According to CUDA programming guide, a cuda program, after compilation by nvcc, is separated into two kinds of code, roughly CPU code and GPU code. In execution, the former, surely, would be located in system memory and performed by CPU. No doubt is that the GPU code is executed by GPU, but my problem is: where is it located, in system memory or in device memory? Would you please offer me an in-depth discussion about this question? Tks.

I think it’s in device memory. GPU code will be excuted by GPU whose main memory is device memory just like system memory is the main memory of CPU.

In detail, I think when need excution of CUDA code, CPU will lanch an DMA request, DMA controller will copy CUDA code to device memory( is this the device memory we use in cuda? or another single memory just for store instruction of cuda? I prefer the first.), then send single to GPU and tell the start address of cuda code.

When you launch the kernel, it is copied into device memory. Not information as to where exactly, but hints are that it’s dedicated code memory (thus the limit of 2,000,000 instructions per kernel, and no instruction load latency)

There is definitely an instruction load latency, and instructions are almost definitely stored in DRAM. This paper http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf shows that there is a 3-level cache hierarchy for instructions where the upper levels are also shared with the constant cache. Why have a cache if it is not stored in DRAM? 2M instructions would take at least 4-16MB of storage space (16-64 bit instructions), there is no way to store this much memory without putting it in DRAM.