This is surely the most noobish CUDA question ever so please bear with me.
I’ve been programming CUDA for just about 1 year now. A few weeks ago I decided to pause development and read a handful of books on CUDA programming in order to get a handle on some stuff I’ve never quite understood. Three weeks on, I’m realizing that some of stuff that I’ve been doing may in fact be hurting performance. Take for example the following type of kernel you’d currently see in my code base:
__global__ void myKernel( const short * __restrict__ src , short* __restrict__ dst )
{
dst[ threadIdx.x ] = src[ threadIdx.x ] + 42;
}
The problems with this kernel as I understand them are as follows:
- The use of shorts. From all I've read it seems to me that CUDA does not particularly "like" data types less than 32bits and nvcc will in this case generate additional instructions to convert short local variables to and from 32-bits; because the granularity of the register file is 32-bits. Question: is this still true of the latest CUDA architectures? I've come across some PTX instructions that work on 16-bit integers. Is there any "penalty" associated with such instructions?
- If I'm not mistaken, the use of const with __restrict__ directs memory reads can be through the read-only cache which has a granularity of 32 bytes. Whereas if both L1 and L2 caches are used, a memory access is serviced by a 128-byte memory transaction. My thinking here was to confuse __restrict__ with its C equivalent restrict. Clearly they have different meanings. Question: is there a mechanism for informing nvcc that two pointer do not alias each other?
- I'm not certain about the use of const pointer, without __restrict__. Is there a penalty associated with the use of const? e.g. would the read be serviced through the constant cache?