Coalescing global memory and avoiding shared bank conflicts Do I need to use this complex of indexin

I’ve been reading up on the patterns for coalescing global memory accesses and avoiding shared memory bank conflicts.

I’m working with an array of 19 floating point values for each of 2048 data points = 38912 floating point values. My current plan is to have 1 thread be responsible for 1 data point (19 attributes). This way I can fire off 2048 threads (64 threads, 32 blocks) to run in parallel. This means that the 64 threads in block 0 will need to access the first 64*19=1216 values, block 1 will need to access values 1216-2431, etc…

From what I gather, it’s safe (no global/shared penalties) to have each thread in a warp access sequential indices of the array.

I currently have the global->shared copy and shared access patterns working in this manner, but the index computations are more complex that I would like (modulo, division, multiplication multiple times).

I was hoping someone may be able to help me figure out a better way to handle this memory.

I’m currently running on a GTX 285 with 64 threads per block and 32 blocks but may be working on older devices, hence I would like to adhere to the compute capability 1.1/1.2 coalescing rules…

Copy global to shared:

[codebox]

    int sharedStartIndex = sharedIndex / numAttributes + THREADSPERBLOCK * (sharedIndex % numAttributes);

int sharedIncrement = THREADSPERBLOCK / numAttributes + THREADSPERBLOCK % numAttributes * THREADSPERBLOCK;

int numAttrPerBlock = numAttributes*THREADSPERBLOCK;

int globalStartIndex = blockIdx.x * numAttributes*THREADSPERBLOCK + sharedIndex;

for( i = 0;	i < numAttributes; i+= 1 )

{

	s_testAttr[sharedStartIndex] = d_testAttr[globalStartIndex];

	printf("t[%d]: storing global_attr[%d] into shared_attr[%d]\n", globalIndex, globalStartIndex, sharedStartIndex);

	sharedStartIndex = (sharedStartIndex + sharedIncrement);

	

	if( sharedStartIndex > THREADSPERBLOCK * numAttributes - 1 )

	{

		sharedStartIndex = sharedStartIndex % (numAttrPerBlock - 1);

	}

	globalStartIndex += THREADSPERBLOCK;

}

[/codebox]

Accessing shared:

[codebox]

const unsigned int sharedIndex = threadIdx.x;

const unsigned int globalIndex = blockIdx.x * blockDim.x + threadIdx.x;

int i;

printf("t[%d] accessing shared_attr[", globalIndex);



for( i = 0; i < numAttributes; i++ )

{

	printf("%d, ", sharedIndex+i*THREADSPERBLOCK );

}

printf("]\n");

[/codebox]

This results in what I believe is the proper access patterns for reading global memory as seen in the output when run in emulator debug:

[codebox]t[0]: storing global_attr[0] into shared_attr[0]

t[1]: storing global_attr[1] into shared_attr[64]

t[2]: storing global_attr[2] into shared_attr[128]

t[3]: storing global_attr[3] into shared_attr[192]

t[4]: storing global_attr[4] into shared_attr[256]

t[5]: storing global_attr[5] into shared_attr[320]

t[6]: storing global_attr[6] into shared_attr[384]

t[7]: storing global_attr[7] into shared_attr[448]

t[8]: storing global_attr[8] into shared_attr[512]

t[9]: storing global_attr[9] into shared_attr[576]

t[10]: storing global_attr[10] into shared_attr[640]

t[11]: storing global_attr[11] into shared_attr[704]

t[12]: storing global_attr[12] into shared_attr[768]

t[13]: storing global_attr[13] into shared_attr[832]

t[14]: storing global_attr[14] into shared_attr[896]

t[15]: storing global_attr[15] into shared_attr[960]

t[16]: storing global_attr[16] into shared_attr[1024]

t[17]: storing global_attr[17] into shared_attr[1088]

t[18]: storing global_attr[18] into shared_attr[1152]

t[19]: storing global_attr[19] into shared_attr[1]

t[20]: storing global_attr[20] into shared_attr[65]

t[21]: storing global_attr[21] into shared_attr[129]

t[22]: storing global_attr[22] into shared_attr[193]

t[23]: storing global_attr[23] into shared_attr[257]

t[24]: storing global_attr[24] into shared_attr[321]

t[25]: storing global_attr[25] into shared_attr[385]

t[26]: storing global_attr[26] into shared_attr[449]

t[27]: storing global_attr[27] into shared_attr[513]

t[28]: storing global_attr[28] into shared_attr[577]

t[29]: storing global_attr[29] into shared_attr[641]

t[30]: storing global_attr[30] into shared_attr[705]

t[31]: storing global_attr[31] into shared_attr[769]

t[32]: storing global_attr[32] into shared_attr[833]

t[33]: storing global_attr[33] into shared_attr[897]

t[34]: storing global_attr[34] into shared_attr[961]

t[35]: storing global_attr[35] into shared_attr[1025]

t[36]: storing global_attr[36] into shared_attr[1089]

t[37]: storing global_attr[37] into shared_attr[1153]

t[38]: storing global_attr[38] into shared_attr[2]

t[39]: storing global_attr[39] into shared_attr[66]

t[40]: storing global_attr[40] into shared_attr[130]

t[41]: storing global_attr[41] into shared_attr[194]

t[42]: storing global_attr[42] into shared_attr[258]

t[43]: storing global_attr[43] into shared_attr[322]

t[44]: storing global_attr[44] into shared_attr[386]

t[45]: storing global_attr[45] into shared_attr[450]

t[46]: storing global_attr[46] into shared_attr[514]

t[47]: storing global_attr[47] into shared_attr[578]

t[48]: storing global_attr[48] into shared_attr[642]

t[49]: storing global_attr[49] into shared_attr[706]

t[50]: storing global_attr[50] into shared_attr[770]

t[51]: storing global_attr[51] into shared_attr[834]

t[52]: storing global_attr[52] into shared_attr[898]

t[53]: storing global_attr[53] into shared_attr[962]

t[54]: storing global_attr[54] into shared_attr[1026]

t[55]: storing global_attr[55] into shared_attr[1090]

t[56]: storing global_attr[56] into shared_attr[1154]

t[57]: storing global_attr[57] into shared_attr[3]

t[58]: storing global_attr[58] into shared_attr[67]

t[59]: storing global_attr[59] into shared_attr[131]

t[60]: storing global_attr[60] into shared_attr[195]

t[61]: storing global_attr[61] into shared_attr[259]

t[62]: storing global_attr[62] into shared_attr[323]

t[63]: storing global_attr[63] into shared_attr[387]

t[0]: storing global_attr[64] into shared_attr[451]

t[1]: storing global_attr[65] into shared_attr[515]

t[2]: storing global_attr[66] into shared_attr[579]

t[3]: storing global_attr[67] into shared_attr[643]

t[4]: storing global_attr[68] into shared_attr[707]

t[5]: storing global_attr[69] into shared_attr[771]

t[6]: storing global_attr[70] into shared_attr[835]

t[7]: storing global_attr[71] into shared_attr[899]

t[8]: storing global_attr[72] into shared_attr[963]

t[9]: storing global_attr[73] into shared_attr[1027]

t[10]: storing global_attr[74] into shared_attr[1091]

t[11]: storing global_attr[75] into shared_attr[1155]

t[12]: storing global_attr[76] into shared_attr[4]

t[13]: storing global_attr[77] into shared_attr[68]

t[14]: storing global_attr[78] into shared_attr[132]

t[15]: storing global_attr[79] into shared_attr[196]

t[16]: storing global_attr[80] into shared_attr[260]

t[17]: storing global_attr[81] into shared_attr[324]

t[18]: storing global_attr[82] into shared_attr[388]

t[19]: storing global_attr[83] into shared_attr[452]

t[20]: storing global_attr[84] into shared_attr[516]

t[21]: storing global_attr[85] into shared_attr[580]

t[22]: storing global_attr[86] into shared_attr[644]

t[23]: storing global_attr[87] into shared_attr[708]

t[24]: storing global_attr[88] into shared_attr[772]

t[25]: storing global_attr[89] into shared_attr[836]

t[26]: storing global_attr[90] into shared_attr[900]

t[27]: storing global_attr[91] into shared_attr[964]

t[28]: storing global_attr[92] into shared_attr[1028]

t[29]: storing global_attr[93] into shared_attr[1092]

t[30]: storing global_attr[94] into shared_attr[1156]

t[31]: storing global_attr[95] into shared_attr[5]

t[32]: storing global_attr[96] into shared_attr[69]

t[33]: storing global_attr[97] into shared_attr[133]

t[34]: storing global_attr[98] into shared_attr[197]

t[35]: storing global_attr[99] into shared_attr[261]

t[36]: storing global_attr[100] into shared_attr[325]

t[37]: storing global_attr[101] into shared_attr[389]

t[38]: storing global_attr[102] into shared_attr[453]

t[39]: storing global_attr[103] into shared_attr[517]

t[40]: storing global_attr[104] into shared_attr[581]

t[41]: storing global_attr[105] into shared_attr[645]

t[42]: storing global_attr[106] into shared_attr[709]

t[43]: storing global_attr[107] into shared_attr[773]

t[44]: storing global_attr[108] into shared_attr[837]

t[45]: storing global_attr[109] into shared_attr[901]

t[46]: storing global_attr[110] into shared_attr[965]

t[47]: storing global_attr[111] into shared_attr[1029]

t[48]: storing global_attr[112] into shared_attr[1093]

t[49]: storing global_attr[113] into shared_attr[1157]

t[50]: storing global_attr[114] into shared_attr[6]

t[51]: storing global_attr[115] into shared_attr[70]

t[52]: storing global_attr[116] into shared_attr[134]

t[53]: storing global_attr[117] into shared_attr[198]

t[54]: storing global_attr[118] into shared_attr[262]

t[55]: storing global_attr[119] into shared_attr[326]

t[56]: storing global_attr[120] into shared_attr[390]

t[57]: storing global_attr[121] into shared_attr[454]

t[58]: storing global_attr[122] into shared_attr[518]

t[59]: storing global_attr[123] into shared_attr[582]

t[60]: storing global_attr[124] into shared_attr[646]

t[61]: storing global_attr[125] into shared_attr[710]

t[62]: storing global_attr[126] into shared_attr[774]

t[63]: storing global_attr[127] into shared_attr[838]

[/codebox]

It also appears to avoid shared memory bank conflicts because each thread is accessing a float with a 1 byte stride…

[codebox]

t[0] accessing shared_attr[0, 64, 128, 192, 256, 320, 384, 448, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152]

t[1] accessing shared_attr[1, 65, 129, 193, 257, 321, 385, 449, 513, 577, 641, 705, 769, 833, 897, 961, 1025, 1089, 1153]

t[2] accessing shared_attr[2, 66, 130, 194, 258, 322, 386, 450, 514, 578, 642, 706, 770, 834, 898, 962, 1026, 1090, 1154]

t[3] accessing shared_attr[3, 67, 131, 195, 259, 323, 387, 451, 515, 579, 643, 707, 771, 835, 899, 963, 1027, 1091, 1155]

t[4] accessing shared_attr[4, 68, 132, 196, 260, 324, 388, 452, 516, 580, 644, 708, 772, 836, 900, 964, 1028, 1092, 1156]

t[5] accessing shared_attr[5, 69, 133, 197, 261, 325, 389, 453, 517, 581, 645, 709, 773, 837, 901, 965, 1029, 1093, 1157]

t[6] accessing shared_attr[6, 70, 134, 198, 262, 326, 390, 454, 518, 582, 646, 710, 774, 838, 902, 966, 1030, 1094, 1158]

t[7] accessing shared_attr[7, 71, 135, 199, 263, 327, 391, 455, 519, 583, 647, 711, 775, 839, 903, 967, 1031, 1095, 1159]

t[8] accessing shared_attr[8, 72, 136, 200, 264, 328, 392, 456, 520, 584, 648, 712, 776, 840, 904, 968, 1032, 1096, 1160]

t[9] accessing shared_attr[9, 73, 137, 201, 265, 329, 393, 457, 521, 585, 649, 713, 777, 841, 905, 969, 1033, 1097, 1161]

t[10] accessing shared_attr[10, 74, 138, 202, 266, 330, 394, 458, 522, 586, 650, 714, 778, 842, 906, 970, 1034, 1098, 1162]

t[11] accessing shared_attr[11, 75, 139, 203, 267, 331, 395, 459, 523, 587, 651, 715, 779, 843, 907, 971, 1035, 1099, 1163]

t[12] accessing shared_attr[12, 76, 140, 204, 268, 332, 396, 460, 524, 588, 652, 716, 780, 844, 908, 972, 1036, 1100, 1164]

t[13] accessing shared_attr[13, 77, 141, 205, 269, 333, 397, 461, 525, 589, 653, 717, 781, 845, 909, 973, 1037, 1101, 1165]

t[14] accessing shared_attr[14, 78, 142, 206, 270, 334, 398, 462, 526, 590, 654, 718, 782, 846, 910, 974, 1038, 1102, 1166]

t[15] accessing shared_attr[15, 79, 143, 207, 271, 335, 399, 463, 527, 591, 655, 719, 783, 847, 911, 975, 1039, 1103, 1167]

t[16] accessing shared_attr[16, 80, 144, 208, 272, 336, 400, 464, 528, 592, 656, 720, 784, 848, 912, 976, 1040, 1104, 1168]

t[17] accessing shared_attr[17, 81, 145, 209, 273, 337, 401, 465, 529, 593, 657, 721, 785, 849, 913, 977, 1041, 1105, 1169]

t[18] accessing shared_attr[18, 82, 146, 210, 274, 338, 402, 466, 530, 594, 658, 722, 786, 850, 914, 978, 1042, 1106, 1170]

t[19] accessing shared_attr[19, 83, 147, 211, 275, 339, 403, 467, 531, 595, 659, 723, 787, 851, 915, 979, 1043, 1107, 1171]

t[20] accessing shared_attr[20, 84, 148, 212, 276, 340, 404, 468, 532, 596, 660, 724, 788, 852, 916, 980, 1044, 1108, 1172]

t[21] accessing shared_attr[21, 85, 149, 213, 277, 341, 405, 469, 533, 597, 661, 725, 789, 853, 917, 981, 1045, 1109, 1173]

t[22] accessing shared_attr[22, 86, 150, 214, 278, 342, 406, 470, 534, 598, 662, 726, 790, 854, 918, 982, 1046, 1110, 1174]

t[23] accessing shared_attr[23, 87, 151, 215, 279, 343, 407, 471, 535, 599, 663, 727, 791, 855, 919, 983, 1047, 1111, 1175]

t[24] accessing shared_attr[24, 88, 152, 216, 280, 344, 408, 472, 536, 600, 664, 728, 792, 856, 920, 984, 1048, 1112, 1176]

t[25] accessing shared_attr[25, 89, 153, 217, 281, 345, 409, 473, 537, 601, 665, 729, 793, 857, 921, 985, 1049, 1113, 1177]

t[26] accessing shared_attr[26, 90, 154, 218, 282, 346, 410, 474, 538, 602, 666, 730, 794, 858, 922, 986, 1050, 1114, 1178]

t[27] accessing shared_attr[27, 91, 155, 219, 283, 347, 411, 475, 539, 603, 667, 731, 795, 859, 923, 987, 1051, 1115, 1179]

t[28] accessing shared_attr[28, 92, 156, 220, 284, 348, 412, 476, 540, 604, 668, 732, 796, 860, 924, 988, 1052, 1116, 1180]

t[29] accessing shared_attr[29, 93, 157, 221, 285, 349, 413, 477, 541, 605, 669, 733, 797, 861, 925, 989, 1053, 1117, 1181]

t[30] accessing shared_attr[30, 94, 158, 222, 286, 350, 414, 478, 542, 606, 670, 734, 798, 862, 926, 990, 1054, 1118, 1182]

t[31] accessing shared_attr[31, 95, 159, 223, 287, 351, 415, 479, 543, 607, 671, 735, 799, 863, 927, 991, 1055, 1119, 1183]

t[32] accessing shared_attr[32, 96, 160, 224, 288, 352, 416, 480, 544, 608, 672, 736, 800, 864, 928, 992, 1056, 1120, 1184]

t[33] accessing shared_attr[33, 97, 161, 225, 289, 353, 417, 481, 545, 609, 673, 737, 801, 865, 929, 993, 1057, 1121, 1185]

t[34] accessing shared_attr[34, 98, 162, 226, 290, 354, 418, 482, 546, 610, 674, 738, 802, 866, 930, 994, 1058, 1122, 1186]

t[35] accessing shared_attr[35, 99, 163, 227, 291, 355, 419, 483, 547, 611, 675, 739, 803, 867, 931, 995, 1059, 1123, 1187]

t[36] accessing shared_attr[36, 100, 164, 228, 292, 356, 420, 484, 548, 612, 676, 740, 804, 868, 932, 996, 1060, 1124, 1188]

t[37] accessing shared_attr[37, 101, 165, 229, 293, 357, 421, 485, 549, 613, 677, 741, 805, 869, 933, 997, 1061, 1125, 1189]

t[38] accessing shared_attr[38, 102, 166, 230, 294, 358, 422, 486, 550, 614, 678, 742, 806, 870, 934, 998, 1062, 1126, 1190]

t[39] accessing shared_attr[39, 103, 167, 231, 295, 359, 423, 487, 551, 615, 679, 743, 807, 871, 935, 999, 1063, 1127, 1191]

t[40] accessing shared_attr[40, 104, 168, 232, 296, 360, 424, 488, 552, 616, 680, 744, 808, 872, 936, 1000, 1064, 1128, 1192]

t[41] accessing shared_attr[41, 105, 169, 233, 297, 361, 425, 489, 553, 617, 681, 745, 809, 873, 937, 1001, 1065, 1129, 1193]

t[42] accessing shared_attr[42, 106, 170, 234, 298, 362, 426, 490, 554, 618, 682, 746, 810, 874, 938, 1002, 1066, 1130, 1194]

t[43] accessing shared_attr[43, 107, 171, 235, 299, 363, 427, 491, 555, 619, 683, 747, 811, 875, 939, 1003, 1067, 1131, 1195]

t[44] accessing shared_attr[44, 108, 172, 236, 300, 364, 428, 492, 556, 620, 684, 748, 812, 876, 940, 1004, 1068, 1132, 1196]

t[45] accessing shared_attr[45, 109, 173, 237, 301, 365, 429, 493, 557, 621, 685, 749, 813, 877, 941, 1005, 1069, 1133, 1197]

t[46] accessing shared_attr[46, 110, 174, 238, 302, 366, 430, 494, 558, 622, 686, 750, 814, 878, 942, 1006, 1070, 1134, 1198]

t[47] accessing shared_attr[47, 111, 175, 239, 303, 367, 431, 495, 559, 623, 687, 751, 815, 879, 943, 1007, 1071, 1135, 1199]

t[48] accessing shared_attr[48, 112, 176, 240, 304, 368, 432, 496, 560, 624, 688, 752, 816, 880, 944, 1008, 1072, 1136, 1200]

t[49] accessing shared_attr[49, 113, 177, 241, 305, 369, 433, 497, 561, 625, 689, 753, 817, 881, 945, 1009, 1073, 1137, 1201]

t[50] accessing shared_attr[50, 114, 178, 242, 306, 370, 434, 498, 562, 626, 690, 754, 818, 882, 946, 1010, 1074, 1138, 1202]

t[51] accessing shared_attr[51, 115, 179, 243, 307, 371, 435, 499, 563, 627, 691, 755, 819, 883, 947, 1011, 1075, 1139, 1203]

t[52] accessing shared_attr[52, 116, 180, 244, 308, 372, 436, 500, 564, 628, 692, 756, 820, 884, 948, 1012, 1076, 1140, 1204]

t[53] accessing shared_attr[53, 117, 181, 245, 309, 373, 437, 501, 565, 629, 693, 757, 821, 885, 949, 1013, 1077, 1141, 1205]

t[54] accessing shared_attr[54, 118, 182, 246, 310, 374, 438, 502, 566, 630, 694, 758, 822, 886, 950, 1014, 1078, 1142, 1206]

t[55] accessing shared_attr[55, 119, 183, 247, 311, 375, 439, 503, 567, 631, 695, 759, 823, 887, 951, 1015, 1079, 1143, 1207]

t[56] accessing shared_attr[56, 120, 184, 248, 312, 376, 440, 504, 568, 632, 696, 760, 824, 888, 952, 1016, 1080, 1144, 1208]

t[57] accessing shared_attr[57, 121, 185, 249, 313, 377, 441, 505, 569, 633, 697, 761, 825, 889, 953, 1017, 1081, 1145, 1209]

t[58] accessing shared_attr[58, 122, 186, 250, 314, 378, 442, 506, 570, 634, 698, 762, 826, 890, 954, 1018, 1082, 1146, 1210]

t[59] accessing shared_attr[59, 123, 187, 251, 315, 379, 443, 507, 571, 635, 699, 763, 827, 891, 955, 1019, 1083, 1147, 1211]

t[60] accessing shared_attr[60, 124, 188, 252, 316, 380, 444, 508, 572, 636, 700, 764, 828, 892, 956, 1020, 1084, 1148, 1212]

t[61] accessing shared_attr[61, 125, 189, 253, 317, 381, 445, 509, 573, 637, 701, 765, 829, 893, 957, 1021, 1085, 1149, 1213]

t[62] accessing shared_attr[62, 126, 190, 254, 318, 382, 446, 510, 574, 638, 702, 766, 830, 894, 958, 1022, 1086, 1150, 1214]

t[63] accessing shared_attr[63, 127, 191, 255, 319, 383, 447, 511, 575, 639, 703, 767, 831, 895, 959, 1023, 1087, 1151, 1215]

[/codebox]

The visual profiler reports:

  • 203 warp serialize

  • 380 branches (76 of which are divergent)

  • 304 gld coalesced

– On the 285, I don’t have the option of profiling the gld/gst uncoalesced (what is the reason for this…other than it is compute capability 1.3?)

Thank you for the assistance!

One thing you may consider is storing the values transposed, in other words attribute 1 for all 2048 datapoints, then attribute 2 for all 2048 datapoints, and so forth. You might say datapoint-major order instead of attribute-major order. Then as long as the number of datapoints is a multiple of 16, the loads from global memory already have thread id coaligned with datapoint % 16.

For shared memory, consider a 2-D array, such as
shared float s_testAttr[THREADSPERBLOCK][NUMATTRS];

This will be free of bank conflicts as long as NUMATTRS is odd. In your case this is already true, but for example in the transpose example in the SDK, they add +1 to prevent bank conflicts. When this is the case, s_testAttr[0][0], s_testAttr[1][0], s_testAttr[2][0], … s_testAttr[15][0] will all lie in different banks.

Thanks for the advice Jamie, I really appreciate the assistance.

I’m currently passing the amount of shared memory into the kernel and creating pointers from that…

[codebox]

result* s_results = (result*)sharedMem;

unsigned int* s_classCount = (unsigned int*)&s_results[THREADSPERBLOCK*KNN];

[/codebox]

Can I make a 2-D array using this same method as seen with s_testAttr below?

[codebox]

result* s_results = (result*)sharedMem;

float* s_testAttr = (float*)&s_results[THREADSPERBLOCK*KNN];

unsigned int* s_classCount = (unsigned int*)&s_testAttr[THREADSPERBLOCK][NUMATTRS];

[/codebox]

Or would I need to treat it as a flat array and just index into it using threadID*NUMATTRS as the start of the attributes for each data point?

Thank you.