Hi,
I wrote a simple program to check cache size effect :
__global__ void addTestLoop(double* tabA, double* tabB, double* tabC, int nbElt, int nbTest)
{
for (long long int idxTest = 0; idxTest < nbTest; idxTest++)
for (int i = 0; i < nbElt; i++)
tabC[i] = tabA[i] + tabB[i];
}
I have got nice result for some cards:
speed is function of data size.
I found cache size using techpowerup web site.
but for some cards resuts are not consistent with cache size :
I do not understand.
Is L1 cache size for gtx 1050 is 49152 and L2 size 1048576?
Or may be something is wrong in my understanding of cache effect in cuda card?
The L2 cache sizes for cards that are most important for compute work are generally documented in the architecture whitepapers. However many lower-end GPUs don’t have that level of documentation available. Nevertheless the the GTX 1050 appears to have 1024K i.e. 1MB of L2 cache.
For L1 cache, the L1 is generally part of the SM design, and so should be the same across designs of the same compute capability. GTX 1050 is a Pascal device, and so details of the L1 cache behavior are available in the Pascal tuning guide. The GTX 1050 will be similar in behavior to the references there to GP104. Note the mention there of when global loads are cached in L1.
Thanks
I made some change in my programs and install ubuntu 20 or 24 to work on linux. Now results are nice for rtx 3090 and rtxada500
for old card gt730 cache L2 is found but where is cache L1 (16Kb or 48Kb)
for gtx 1050 on windows something is wrong no L1 cache and L2 cache is wrong and I can found what : may be windows is using card so L1 cache and L2 cache cannot be found?
Why there is an improve of process speed between 2^5 and 2^10 bytes for all card?
The relative overhead of running kernels is greatest for the smallest data size. Your formula is time / data. And if time has some additional small contributions, they get larger, if data size is smaller.
1 Like
As indicated in a link I already provided, some GPUs including the GP10x Pascal series do not cache global loads in L1, at least by default.
1 Like