Using the command cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared) in conjunction with allocating about 38k worth of shared memory produces the following compile error:
ptxas error : Entry function ‘_Z12MyKernelPlPfS0_S0_iiilli’ uses too much shared data (0x88f8 bytes + 0x10 bytes system, 0x4000 max)
According to this error log, the max shared memory allowed is 16k, rather than 48k specified with CacheConfig command. The default setting also allocates 48k to shared. So it is a bit confusing.
Is the problem related to the amount of L1 memory that’s required by the kernel? I understand that the system will allocate the necessary mem to L1 regardless fo the specified preference. I suppose one way to verify this is to simplify kernel logic and try the Cache setting anew. Is there any other way to determine what’s causing this problem?
Thanks in advance, Joe.