My application uses three kernels, running serially in stream 1, run times of each, 5uS, 1.5mS and 2.4mS.
Kernel 1 writes out 1024 Bytes to global memory and upon ending, an event is triggered in stream 2, cudaMemcpyToSymbolAsync’ing this array to statically assigned constant memory. This constant array is then accessed by kernel 3. This works well.
I now wish to enlarge the array copied to 64kB and doing so triggers the expected error of exceeding the permitted 64kB constant memory limit, due to a small amount of constant memory used elsewhere.
A comment in the margin of the “Technical Specification” table of the Cuda Wiki page, in the “Constant memory size”, field states:
“Constant memory size accessible by CUDA C/C++(1 bank, PTX can access 11 banks, SASS can access 18 banks)”.
Looking in the PTX ISA here and here, makes me wonder if I can write a small device PTX function to utilise a whole 64kB bank, for exclusive use of the copy outlined above?
A comment in the PTX ISA:
"Constant buffers allocated by the driver are initialized by the host, and pointers to such buffers are passed to the kernel as parameters. "
Two things I’m unsure of, assuming this can be done:
- How to initialise this on the host - cudaMalloc?
- Would I actually have to re-write kernel 1 and 3 completely in PTX, in order to utilise the Kernel Parameter Attribute: .ptr?