I have read about share memory bank conflict from Professional Cuda C programming book and one of the example is shown using Square Matrix.
This is the code that I have follow from the book https://drive.google.com/open?id=1YV-v_lPdASzeAPD5oO0Y5rlpqrlUSICj
However the result I have got is almost completely different from the result from the book here is the result
Showing conflict
D:\Programing\CudaTest\x64\Debug>nvprof --metrics shared_load_transactions_per_request --metrics shared_store_transactions_per_request CudaTest
==15836== NVPROF is profiling process 15836, command: CudaTest
CudaTest at device 0: GeForce GTX 1080 with Bank Mode:4-Byte <<< grid (1,1) block (32,32)>>>
==15836== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==15836== Replaying kernel "setColReadCol(int*)" (done)
==15836== Replaying kernel "setRowReadRow(int*)" (done)
==15836== Replaying kernel "setColReadRow(int*)" (done)
Replaying kernel "setRowReadColPad(int*)" (2 of 2)... )
2 internal eventsl "setRowReadColDyn(int*)" (done)
==15836== Replaying kernel "setRowReadColPad(int*)" (done)
==15836== Replaying kernel "setRowReadColDynPad(int*)" (done)
==15836== Profiling application: CudaTest
==15836== Profiling result:
==15836== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GeForce GTX 1080 (0)"
Kernel: setColReadRow(int*)
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.000000 1.000000 1.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 32.000000 32.000000 32.000000
Kernel: setRowReadColDynPad(int*)
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.000000 1.000000 1.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.000000 1.000000 1.000000
Kernel: setRowReadCol(int*)
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 32.000000 32.000000 32.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.000000 1.000000 1.000000
Kernel: setRowReadColDyn(int*)
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 32.000000 32.000000 32.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.000000 1.000000 1.000000
Kernel: setRowReadRow(int*)
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.000000 1.000000 1.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.000000 1.000000 1.000000
Kernel: setColReadCol(int*)
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 32.000000 32.000000 32.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 32.000000 32.000000 32.000000
Kernel: setRowReadColPad(int*)
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.000000 1.000000 1.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.000000 1.000000 1.000000
Showing execution time
D:\Programing\CudaTest\x64\Debug>nvprof CudaTest
==14580== NVPROF is profiling process 14580, command: CudaTest
CudaTest at device 0: GeForce GTX 1080 with Bank Mode:4-Byte <<< grid (1,1) block (32,32)>>>
==14580== Profiling application: CudaTest
==14580== Warning: Found 58 invalid records in the result.
==14580== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==14580== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 27.76% 13.504us 1 13.504us 13.504us 13.504us setRowReadColPad(int*)
11.58% 5.6320us 7 804ns 672ns 1.6000us [CUDA memset]
10.66% 5.1840us 1 5.1840us 5.1840us 5.1840us setColReadCol(int*)
9.80% 4.7680us 1 4.7680us 4.7680us 4.7680us setRowReadColDynPad(int*)
8.62% 4.1920us 1 4.1920us 4.1920us 4.1920us setRowReadColDyn(int*)
8.29% 4.0320us 1 4.0320us 4.0320us 4.0320us setColReadRow(int*)
8.09% 3.9360us 1 3.9360us 3.9360us 3.9360us setRowReadCol(int*)
8.03% 3.9040us 7 557ns 480ns 736ns [CUDA memcpy DtoH]
7.17% 3.4880us 1 3.4880us 3.4880us 3.4880us setRowReadRow(int*)
API calls: 78.34% 126.92ms 1 126.92ms 126.92ms 126.92ms cudaDeviceGetSharedMemConfig
20.54% 33.279ms 1 33.279ms 33.279ms 33.279ms cudaDeviceReset
0.26% 425.64us 7 60.806us 45.477us 111.39us cudaMemcpy
0.21% 345.17us 1 345.17us 345.17us 345.17us cudaMalloc
0.21% 341.84us 38 8.9950us 255ns 161.21us cuDeviceGetAttribute
0.20% 319.87us 1 319.87us 319.87us 319.87us cudaGetDeviceProperties
0.06% 91.720us 7 13.102us 7.1540us 43.689us cudaLaunch
0.06% 90.187us 1 90.187us 90.187us 90.187us cudaFree
0.05% 87.377us 1 87.377us 87.377us 87.377us cuDeviceGetName
0.03% 46.755us 1 46.755us 46.755us 46.755us cudaSetDevice
0.02% 38.325us 7 5.4750us 2.8110us 18.907us cudaMemset
0.00% 5.6200us 1 5.6200us 5.6200us 5.6200us cuDeviceTotalMem
0.00% 5.1100us 6 851ns 255ns 3.8320us cudaConfigureCall
0.00% 2.3010us 3 767ns 256ns 1.7890us cuDeviceGetCount
0.00% 1.7880us 6 298ns 255ns 511ns cudaSetupArgument
0.00% 1.0210us 2 510ns 255ns 766ns cuDeviceGet
As you can see setRowReadRow is the fastest one that is correct since no conflict with this type of access but setRowReadCol and setColReadRow are the next one instead of setRowReadColPad or setRowReadColDynPad which have no memory conflict I don’t know why this happen or because I use different GPU than in the book or because the version of Cuda is different and there is something changing about memory handling or because of my GPU is buggy now and measuring time is impossible because of that warning.
Thanks for you help.