where can I get some information about the caching strategy of the CUDA kernels?
For example I have 500 kernels which work on an array of 500 bytes in the first instruction. Should the first kernel start working on the last byte or on the first byte?
The second question is if every kernel uses in the second instruction a second byte. Should I use a second 500 bytes array or is it better to use on 1000 bytes array and every kernel works on following bytes (e.g. First kernel works on byte 0 with the first instruction and byte 1 with the second instruction or is it faster kernel one uses byte 0 and 512)?