Share Memory Bank Conflict - No conflict is slower than all conflict?

I have read about share memory bank conflict from Professional Cuda C programming book and one of the example is shown using Square Matrix.

This is the code that I have follow from the book https://drive.google.com/open?id=1YV-v_lPdASzeAPD5oO0Y5rlpqrlUSICj

However the result I have got is almost completely different from the result from the book here is the result

Showing conflict

D:\Programing\CudaTest\x64\Debug>nvprof --metrics shared_load_transactions_per_request --metrics shared_store_transactions_per_request CudaTest
==15836== NVPROF is profiling process 15836, command: CudaTest
CudaTest at device 0: GeForce GTX 1080 with Bank Mode:4-Byte <<< grid (1,1) block (32,32)>>>
==15836== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==15836== Replaying kernel "setColReadCol(int*)" (done)
==15836== Replaying kernel "setRowReadRow(int*)" (done)
==15836== Replaying kernel "setColReadRow(int*)" (done)
Replaying kernel "setRowReadColPad(int*)" (2 of 2)... )
        2 internal eventsl "setRowReadColDyn(int*)" (done)
==15836== Replaying kernel "setRowReadColPad(int*)" (done)
==15836== Replaying kernel "setRowReadColDynPad(int*)" (done)
==15836== Profiling application: CudaTest
==15836== Profiling result:
==15836== Metric result:
Invocations                               Metric Name                             Metric Description         Min         Max         Avg
Device "GeForce GTX 1080 (0)"
    Kernel: setColReadRow(int*)
          1      shared_load_transactions_per_request    Shared Memory Load Transactions Per Request    1.000000    1.000000    1.000000
          1     shared_store_transactions_per_request   Shared Memory Store Transactions Per Request   32.000000   32.000000   32.000000
    Kernel: setRowReadColDynPad(int*)
          1      shared_load_transactions_per_request    Shared Memory Load Transactions Per Request    1.000000    1.000000    1.000000
          1     shared_store_transactions_per_request   Shared Memory Store Transactions Per Request    1.000000    1.000000    1.000000
    Kernel: setRowReadCol(int*)
          1      shared_load_transactions_per_request    Shared Memory Load Transactions Per Request   32.000000   32.000000   32.000000
          1     shared_store_transactions_per_request   Shared Memory Store Transactions Per Request    1.000000    1.000000    1.000000
    Kernel: setRowReadColDyn(int*)
          1      shared_load_transactions_per_request    Shared Memory Load Transactions Per Request   32.000000   32.000000   32.000000
          1     shared_store_transactions_per_request   Shared Memory Store Transactions Per Request    1.000000    1.000000    1.000000
    Kernel: setRowReadRow(int*)
          1      shared_load_transactions_per_request    Shared Memory Load Transactions Per Request    1.000000    1.000000    1.000000
          1     shared_store_transactions_per_request   Shared Memory Store Transactions Per Request    1.000000    1.000000    1.000000
    Kernel: setColReadCol(int*)
          1      shared_load_transactions_per_request    Shared Memory Load Transactions Per Request   32.000000   32.000000   32.000000
          1     shared_store_transactions_per_request   Shared Memory Store Transactions Per Request   32.000000   32.000000   32.000000
    Kernel: setRowReadColPad(int*)
          1      shared_load_transactions_per_request    Shared Memory Load Transactions Per Request    1.000000    1.000000    1.000000
          1     shared_store_transactions_per_request   Shared Memory Store Transactions Per Request    1.000000    1.000000    1.000000

Showing execution time

D:\Programing\CudaTest\x64\Debug>nvprof CudaTest
==14580== NVPROF is profiling process 14580, command: CudaTest
CudaTest at device 0: GeForce GTX 1080 with Bank Mode:4-Byte <<< grid (1,1) block (32,32)>>>
==14580== Profiling application: CudaTest
==14580== Warning: Found 58 invalid records in the result.
==14580== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==14580== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   27.76%  13.504us         1  13.504us  13.504us  13.504us  setRowReadColPad(int*)
                   11.58%  5.6320us         7     804ns     672ns  1.6000us  [CUDA memset]
                   10.66%  5.1840us         1  5.1840us  5.1840us  5.1840us  setColReadCol(int*)
                    9.80%  4.7680us         1  4.7680us  4.7680us  4.7680us  setRowReadColDynPad(int*)
                    8.62%  4.1920us         1  4.1920us  4.1920us  4.1920us  setRowReadColDyn(int*)
                    8.29%  4.0320us         1  4.0320us  4.0320us  4.0320us  setColReadRow(int*)
                    8.09%  3.9360us         1  3.9360us  3.9360us  3.9360us  setRowReadCol(int*)
                    8.03%  3.9040us         7     557ns     480ns     736ns  [CUDA memcpy DtoH]
                    7.17%  3.4880us         1  3.4880us  3.4880us  3.4880us  setRowReadRow(int*)
      API calls:   78.34%  126.92ms         1  126.92ms  126.92ms  126.92ms  cudaDeviceGetSharedMemConfig
                   20.54%  33.279ms         1  33.279ms  33.279ms  33.279ms  cudaDeviceReset
                    0.26%  425.64us         7  60.806us  45.477us  111.39us  cudaMemcpy
                    0.21%  345.17us         1  345.17us  345.17us  345.17us  cudaMalloc
                    0.21%  341.84us        38  8.9950us     255ns  161.21us  cuDeviceGetAttribute
                    0.20%  319.87us         1  319.87us  319.87us  319.87us  cudaGetDeviceProperties
                    0.06%  91.720us         7  13.102us  7.1540us  43.689us  cudaLaunch
                    0.06%  90.187us         1  90.187us  90.187us  90.187us  cudaFree
                    0.05%  87.377us         1  87.377us  87.377us  87.377us  cuDeviceGetName
                    0.03%  46.755us         1  46.755us  46.755us  46.755us  cudaSetDevice
                    0.02%  38.325us         7  5.4750us  2.8110us  18.907us  cudaMemset
                    0.00%  5.6200us         1  5.6200us  5.6200us  5.6200us  cuDeviceTotalMem
                    0.00%  5.1100us         6     851ns     255ns  3.8320us  cudaConfigureCall
                    0.00%  2.3010us         3     767ns     256ns  1.7890us  cuDeviceGetCount
                    0.00%  1.7880us         6     298ns     255ns     511ns  cudaSetupArgument
                    0.00%  1.0210us         2     510ns     255ns     766ns  cuDeviceGet

As you can see setRowReadRow is the fastest one that is correct since no conflict with this type of access but setRowReadCol and setColReadRow are the next one instead of setRowReadColPad or setRowReadColDynPad which have no memory conflict I don’t know why this happen or because I use different GPU than in the book or because the version of Cuda is different and there is something changing about memory handling or because of my GPU is buggy now and measuring time is impossible because of that warning.

Thanks for you help.

I have the same problem.

1070ti W10 pro. SDK10.

I’m still investigating on nvvp and it sees that

on [32][32] —>shared memory bank conflict
on [32][33]—>>L2/cache memory. free bank conflicts shared memory.

obviously, Reading from L2 is slower than shared memory.

But… I dont Know why the driver/compiler do that.

I’ve just checked on a GT525M (W7 SDK8)and the result are ok (as expected):

248.39ms [32][32] bank conflicts (slower than padding)…
140.25ms [32][33] free bank conflicts.(Ok)

I dont know what happend with my GTX1070 Ti… A mystery…