I’ve been reading up on the patterns for coalescing global memory accesses and avoiding shared memory bank conflicts.
I’m working with an array of 19 floating point values for each of 2048 data points = 38912 floating point values. My current plan is to have 1 thread be responsible for 1 data point (19 attributes). This way I can fire off 2048 threads (64 threads, 32 blocks) to run in parallel. This means that the 64 threads in block 0 will need to access the first 64*19=1216 values, block 1 will need to access values 1216-2431, etc…
From what I gather, it’s safe (no global/shared penalties) to have each thread in a warp access sequential indices of the array.
I currently have the global->shared copy and shared access patterns working in this manner, but the index computations are more complex that I would like (modulo, division, multiplication multiple times).
I was hoping someone may be able to help me figure out a better way to handle this memory.
I’m currently running on a GTX 285 with 64 threads per block and 32 blocks but may be working on older devices, hence I would like to adhere to the compute capability 1.1/1.2 coalescing rules…
Copy global to shared:
[codebox]
int sharedStartIndex = sharedIndex / numAttributes + THREADSPERBLOCK * (sharedIndex % numAttributes);
int sharedIncrement = THREADSPERBLOCK / numAttributes + THREADSPERBLOCK % numAttributes * THREADSPERBLOCK;
int numAttrPerBlock = numAttributes*THREADSPERBLOCK;
int globalStartIndex = blockIdx.x * numAttributes*THREADSPERBLOCK + sharedIndex;
for( i = 0; i < numAttributes; i+= 1 )
{
s_testAttr[sharedStartIndex] = d_testAttr[globalStartIndex];
printf("t[%d]: storing global_attr[%d] into shared_attr[%d]\n", globalIndex, globalStartIndex, sharedStartIndex);
sharedStartIndex = (sharedStartIndex + sharedIncrement);
if( sharedStartIndex > THREADSPERBLOCK * numAttributes - 1 )
{
sharedStartIndex = sharedStartIndex % (numAttrPerBlock - 1);
}
globalStartIndex += THREADSPERBLOCK;
}
[/codebox]
Accessing shared:
[codebox]
const unsigned int sharedIndex = threadIdx.x;
const unsigned int globalIndex = blockIdx.x * blockDim.x + threadIdx.x;
int i;
printf("t[%d] accessing shared_attr[", globalIndex);
for( i = 0; i < numAttributes; i++ )
{
printf("%d, ", sharedIndex+i*THREADSPERBLOCK );
}
printf("]\n");
[/codebox]
This results in what I believe is the proper access patterns for reading global memory as seen in the output when run in emulator debug:
[codebox]t[0]: storing global_attr[0] into shared_attr[0]
t[1]: storing global_attr[1] into shared_attr[64]
t[2]: storing global_attr[2] into shared_attr[128]
t[3]: storing global_attr[3] into shared_attr[192]
t[4]: storing global_attr[4] into shared_attr[256]
t[5]: storing global_attr[5] into shared_attr[320]
t[6]: storing global_attr[6] into shared_attr[384]
t[7]: storing global_attr[7] into shared_attr[448]
t[8]: storing global_attr[8] into shared_attr[512]
t[9]: storing global_attr[9] into shared_attr[576]
t[10]: storing global_attr[10] into shared_attr[640]
t[11]: storing global_attr[11] into shared_attr[704]
t[12]: storing global_attr[12] into shared_attr[768]
t[13]: storing global_attr[13] into shared_attr[832]
t[14]: storing global_attr[14] into shared_attr[896]
t[15]: storing global_attr[15] into shared_attr[960]
t[16]: storing global_attr[16] into shared_attr[1024]
t[17]: storing global_attr[17] into shared_attr[1088]
t[18]: storing global_attr[18] into shared_attr[1152]
t[19]: storing global_attr[19] into shared_attr[1]
t[20]: storing global_attr[20] into shared_attr[65]
t[21]: storing global_attr[21] into shared_attr[129]
t[22]: storing global_attr[22] into shared_attr[193]
t[23]: storing global_attr[23] into shared_attr[257]
t[24]: storing global_attr[24] into shared_attr[321]
t[25]: storing global_attr[25] into shared_attr[385]
t[26]: storing global_attr[26] into shared_attr[449]
t[27]: storing global_attr[27] into shared_attr[513]
t[28]: storing global_attr[28] into shared_attr[577]
t[29]: storing global_attr[29] into shared_attr[641]
t[30]: storing global_attr[30] into shared_attr[705]
t[31]: storing global_attr[31] into shared_attr[769]
t[32]: storing global_attr[32] into shared_attr[833]
t[33]: storing global_attr[33] into shared_attr[897]
t[34]: storing global_attr[34] into shared_attr[961]
t[35]: storing global_attr[35] into shared_attr[1025]
t[36]: storing global_attr[36] into shared_attr[1089]
t[37]: storing global_attr[37] into shared_attr[1153]
t[38]: storing global_attr[38] into shared_attr[2]
t[39]: storing global_attr[39] into shared_attr[66]
t[40]: storing global_attr[40] into shared_attr[130]
t[41]: storing global_attr[41] into shared_attr[194]
t[42]: storing global_attr[42] into shared_attr[258]
t[43]: storing global_attr[43] into shared_attr[322]
t[44]: storing global_attr[44] into shared_attr[386]
t[45]: storing global_attr[45] into shared_attr[450]
t[46]: storing global_attr[46] into shared_attr[514]
t[47]: storing global_attr[47] into shared_attr[578]
t[48]: storing global_attr[48] into shared_attr[642]
t[49]: storing global_attr[49] into shared_attr[706]
t[50]: storing global_attr[50] into shared_attr[770]
t[51]: storing global_attr[51] into shared_attr[834]
t[52]: storing global_attr[52] into shared_attr[898]
t[53]: storing global_attr[53] into shared_attr[962]
t[54]: storing global_attr[54] into shared_attr[1026]
t[55]: storing global_attr[55] into shared_attr[1090]
t[56]: storing global_attr[56] into shared_attr[1154]
t[57]: storing global_attr[57] into shared_attr[3]
t[58]: storing global_attr[58] into shared_attr[67]
t[59]: storing global_attr[59] into shared_attr[131]
t[60]: storing global_attr[60] into shared_attr[195]
t[61]: storing global_attr[61] into shared_attr[259]
t[62]: storing global_attr[62] into shared_attr[323]
t[63]: storing global_attr[63] into shared_attr[387]
t[0]: storing global_attr[64] into shared_attr[451]
t[1]: storing global_attr[65] into shared_attr[515]
t[2]: storing global_attr[66] into shared_attr[579]
t[3]: storing global_attr[67] into shared_attr[643]
t[4]: storing global_attr[68] into shared_attr[707]
t[5]: storing global_attr[69] into shared_attr[771]
t[6]: storing global_attr[70] into shared_attr[835]
t[7]: storing global_attr[71] into shared_attr[899]
t[8]: storing global_attr[72] into shared_attr[963]
t[9]: storing global_attr[73] into shared_attr[1027]
t[10]: storing global_attr[74] into shared_attr[1091]
t[11]: storing global_attr[75] into shared_attr[1155]
t[12]: storing global_attr[76] into shared_attr[4]
t[13]: storing global_attr[77] into shared_attr[68]
t[14]: storing global_attr[78] into shared_attr[132]
t[15]: storing global_attr[79] into shared_attr[196]
t[16]: storing global_attr[80] into shared_attr[260]
t[17]: storing global_attr[81] into shared_attr[324]
t[18]: storing global_attr[82] into shared_attr[388]
t[19]: storing global_attr[83] into shared_attr[452]
t[20]: storing global_attr[84] into shared_attr[516]
t[21]: storing global_attr[85] into shared_attr[580]
t[22]: storing global_attr[86] into shared_attr[644]
t[23]: storing global_attr[87] into shared_attr[708]
t[24]: storing global_attr[88] into shared_attr[772]
t[25]: storing global_attr[89] into shared_attr[836]
t[26]: storing global_attr[90] into shared_attr[900]
t[27]: storing global_attr[91] into shared_attr[964]
t[28]: storing global_attr[92] into shared_attr[1028]
t[29]: storing global_attr[93] into shared_attr[1092]
t[30]: storing global_attr[94] into shared_attr[1156]
t[31]: storing global_attr[95] into shared_attr[5]
t[32]: storing global_attr[96] into shared_attr[69]
t[33]: storing global_attr[97] into shared_attr[133]
t[34]: storing global_attr[98] into shared_attr[197]
t[35]: storing global_attr[99] into shared_attr[261]
t[36]: storing global_attr[100] into shared_attr[325]
t[37]: storing global_attr[101] into shared_attr[389]
t[38]: storing global_attr[102] into shared_attr[453]
t[39]: storing global_attr[103] into shared_attr[517]
t[40]: storing global_attr[104] into shared_attr[581]
t[41]: storing global_attr[105] into shared_attr[645]
t[42]: storing global_attr[106] into shared_attr[709]
t[43]: storing global_attr[107] into shared_attr[773]
t[44]: storing global_attr[108] into shared_attr[837]
t[45]: storing global_attr[109] into shared_attr[901]
t[46]: storing global_attr[110] into shared_attr[965]
t[47]: storing global_attr[111] into shared_attr[1029]
t[48]: storing global_attr[112] into shared_attr[1093]
t[49]: storing global_attr[113] into shared_attr[1157]
t[50]: storing global_attr[114] into shared_attr[6]
t[51]: storing global_attr[115] into shared_attr[70]
t[52]: storing global_attr[116] into shared_attr[134]
t[53]: storing global_attr[117] into shared_attr[198]
t[54]: storing global_attr[118] into shared_attr[262]
t[55]: storing global_attr[119] into shared_attr[326]
t[56]: storing global_attr[120] into shared_attr[390]
t[57]: storing global_attr[121] into shared_attr[454]
t[58]: storing global_attr[122] into shared_attr[518]
t[59]: storing global_attr[123] into shared_attr[582]
t[60]: storing global_attr[124] into shared_attr[646]
t[61]: storing global_attr[125] into shared_attr[710]
t[62]: storing global_attr[126] into shared_attr[774]
t[63]: storing global_attr[127] into shared_attr[838]
…
[/codebox]
It also appears to avoid shared memory bank conflicts because each thread is accessing a float with a 1 byte stride…
[codebox]
t[0] accessing shared_attr[0, 64, 128, 192, 256, 320, 384, 448, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1088, 1152]
t[1] accessing shared_attr[1, 65, 129, 193, 257, 321, 385, 449, 513, 577, 641, 705, 769, 833, 897, 961, 1025, 1089, 1153]
t[2] accessing shared_attr[2, 66, 130, 194, 258, 322, 386, 450, 514, 578, 642, 706, 770, 834, 898, 962, 1026, 1090, 1154]
t[3] accessing shared_attr[3, 67, 131, 195, 259, 323, 387, 451, 515, 579, 643, 707, 771, 835, 899, 963, 1027, 1091, 1155]
t[4] accessing shared_attr[4, 68, 132, 196, 260, 324, 388, 452, 516, 580, 644, 708, 772, 836, 900, 964, 1028, 1092, 1156]
t[5] accessing shared_attr[5, 69, 133, 197, 261, 325, 389, 453, 517, 581, 645, 709, 773, 837, 901, 965, 1029, 1093, 1157]
t[6] accessing shared_attr[6, 70, 134, 198, 262, 326, 390, 454, 518, 582, 646, 710, 774, 838, 902, 966, 1030, 1094, 1158]
t[7] accessing shared_attr[7, 71, 135, 199, 263, 327, 391, 455, 519, 583, 647, 711, 775, 839, 903, 967, 1031, 1095, 1159]
t[8] accessing shared_attr[8, 72, 136, 200, 264, 328, 392, 456, 520, 584, 648, 712, 776, 840, 904, 968, 1032, 1096, 1160]
t[9] accessing shared_attr[9, 73, 137, 201, 265, 329, 393, 457, 521, 585, 649, 713, 777, 841, 905, 969, 1033, 1097, 1161]
t[10] accessing shared_attr[10, 74, 138, 202, 266, 330, 394, 458, 522, 586, 650, 714, 778, 842, 906, 970, 1034, 1098, 1162]
t[11] accessing shared_attr[11, 75, 139, 203, 267, 331, 395, 459, 523, 587, 651, 715, 779, 843, 907, 971, 1035, 1099, 1163]
t[12] accessing shared_attr[12, 76, 140, 204, 268, 332, 396, 460, 524, 588, 652, 716, 780, 844, 908, 972, 1036, 1100, 1164]
t[13] accessing shared_attr[13, 77, 141, 205, 269, 333, 397, 461, 525, 589, 653, 717, 781, 845, 909, 973, 1037, 1101, 1165]
t[14] accessing shared_attr[14, 78, 142, 206, 270, 334, 398, 462, 526, 590, 654, 718, 782, 846, 910, 974, 1038, 1102, 1166]
t[15] accessing shared_attr[15, 79, 143, 207, 271, 335, 399, 463, 527, 591, 655, 719, 783, 847, 911, 975, 1039, 1103, 1167]
t[16] accessing shared_attr[16, 80, 144, 208, 272, 336, 400, 464, 528, 592, 656, 720, 784, 848, 912, 976, 1040, 1104, 1168]
t[17] accessing shared_attr[17, 81, 145, 209, 273, 337, 401, 465, 529, 593, 657, 721, 785, 849, 913, 977, 1041, 1105, 1169]
t[18] accessing shared_attr[18, 82, 146, 210, 274, 338, 402, 466, 530, 594, 658, 722, 786, 850, 914, 978, 1042, 1106, 1170]
t[19] accessing shared_attr[19, 83, 147, 211, 275, 339, 403, 467, 531, 595, 659, 723, 787, 851, 915, 979, 1043, 1107, 1171]
t[20] accessing shared_attr[20, 84, 148, 212, 276, 340, 404, 468, 532, 596, 660, 724, 788, 852, 916, 980, 1044, 1108, 1172]
t[21] accessing shared_attr[21, 85, 149, 213, 277, 341, 405, 469, 533, 597, 661, 725, 789, 853, 917, 981, 1045, 1109, 1173]
t[22] accessing shared_attr[22, 86, 150, 214, 278, 342, 406, 470, 534, 598, 662, 726, 790, 854, 918, 982, 1046, 1110, 1174]
t[23] accessing shared_attr[23, 87, 151, 215, 279, 343, 407, 471, 535, 599, 663, 727, 791, 855, 919, 983, 1047, 1111, 1175]
t[24] accessing shared_attr[24, 88, 152, 216, 280, 344, 408, 472, 536, 600, 664, 728, 792, 856, 920, 984, 1048, 1112, 1176]
t[25] accessing shared_attr[25, 89, 153, 217, 281, 345, 409, 473, 537, 601, 665, 729, 793, 857, 921, 985, 1049, 1113, 1177]
t[26] accessing shared_attr[26, 90, 154, 218, 282, 346, 410, 474, 538, 602, 666, 730, 794, 858, 922, 986, 1050, 1114, 1178]
t[27] accessing shared_attr[27, 91, 155, 219, 283, 347, 411, 475, 539, 603, 667, 731, 795, 859, 923, 987, 1051, 1115, 1179]
t[28] accessing shared_attr[28, 92, 156, 220, 284, 348, 412, 476, 540, 604, 668, 732, 796, 860, 924, 988, 1052, 1116, 1180]
t[29] accessing shared_attr[29, 93, 157, 221, 285, 349, 413, 477, 541, 605, 669, 733, 797, 861, 925, 989, 1053, 1117, 1181]
t[30] accessing shared_attr[30, 94, 158, 222, 286, 350, 414, 478, 542, 606, 670, 734, 798, 862, 926, 990, 1054, 1118, 1182]
t[31] accessing shared_attr[31, 95, 159, 223, 287, 351, 415, 479, 543, 607, 671, 735, 799, 863, 927, 991, 1055, 1119, 1183]
t[32] accessing shared_attr[32, 96, 160, 224, 288, 352, 416, 480, 544, 608, 672, 736, 800, 864, 928, 992, 1056, 1120, 1184]
t[33] accessing shared_attr[33, 97, 161, 225, 289, 353, 417, 481, 545, 609, 673, 737, 801, 865, 929, 993, 1057, 1121, 1185]
t[34] accessing shared_attr[34, 98, 162, 226, 290, 354, 418, 482, 546, 610, 674, 738, 802, 866, 930, 994, 1058, 1122, 1186]
t[35] accessing shared_attr[35, 99, 163, 227, 291, 355, 419, 483, 547, 611, 675, 739, 803, 867, 931, 995, 1059, 1123, 1187]
t[36] accessing shared_attr[36, 100, 164, 228, 292, 356, 420, 484, 548, 612, 676, 740, 804, 868, 932, 996, 1060, 1124, 1188]
t[37] accessing shared_attr[37, 101, 165, 229, 293, 357, 421, 485, 549, 613, 677, 741, 805, 869, 933, 997, 1061, 1125, 1189]
t[38] accessing shared_attr[38, 102, 166, 230, 294, 358, 422, 486, 550, 614, 678, 742, 806, 870, 934, 998, 1062, 1126, 1190]
t[39] accessing shared_attr[39, 103, 167, 231, 295, 359, 423, 487, 551, 615, 679, 743, 807, 871, 935, 999, 1063, 1127, 1191]
t[40] accessing shared_attr[40, 104, 168, 232, 296, 360, 424, 488, 552, 616, 680, 744, 808, 872, 936, 1000, 1064, 1128, 1192]
t[41] accessing shared_attr[41, 105, 169, 233, 297, 361, 425, 489, 553, 617, 681, 745, 809, 873, 937, 1001, 1065, 1129, 1193]
t[42] accessing shared_attr[42, 106, 170, 234, 298, 362, 426, 490, 554, 618, 682, 746, 810, 874, 938, 1002, 1066, 1130, 1194]
t[43] accessing shared_attr[43, 107, 171, 235, 299, 363, 427, 491, 555, 619, 683, 747, 811, 875, 939, 1003, 1067, 1131, 1195]
t[44] accessing shared_attr[44, 108, 172, 236, 300, 364, 428, 492, 556, 620, 684, 748, 812, 876, 940, 1004, 1068, 1132, 1196]
t[45] accessing shared_attr[45, 109, 173, 237, 301, 365, 429, 493, 557, 621, 685, 749, 813, 877, 941, 1005, 1069, 1133, 1197]
t[46] accessing shared_attr[46, 110, 174, 238, 302, 366, 430, 494, 558, 622, 686, 750, 814, 878, 942, 1006, 1070, 1134, 1198]
t[47] accessing shared_attr[47, 111, 175, 239, 303, 367, 431, 495, 559, 623, 687, 751, 815, 879, 943, 1007, 1071, 1135, 1199]
t[48] accessing shared_attr[48, 112, 176, 240, 304, 368, 432, 496, 560, 624, 688, 752, 816, 880, 944, 1008, 1072, 1136, 1200]
t[49] accessing shared_attr[49, 113, 177, 241, 305, 369, 433, 497, 561, 625, 689, 753, 817, 881, 945, 1009, 1073, 1137, 1201]
t[50] accessing shared_attr[50, 114, 178, 242, 306, 370, 434, 498, 562, 626, 690, 754, 818, 882, 946, 1010, 1074, 1138, 1202]
t[51] accessing shared_attr[51, 115, 179, 243, 307, 371, 435, 499, 563, 627, 691, 755, 819, 883, 947, 1011, 1075, 1139, 1203]
t[52] accessing shared_attr[52, 116, 180, 244, 308, 372, 436, 500, 564, 628, 692, 756, 820, 884, 948, 1012, 1076, 1140, 1204]
t[53] accessing shared_attr[53, 117, 181, 245, 309, 373, 437, 501, 565, 629, 693, 757, 821, 885, 949, 1013, 1077, 1141, 1205]
t[54] accessing shared_attr[54, 118, 182, 246, 310, 374, 438, 502, 566, 630, 694, 758, 822, 886, 950, 1014, 1078, 1142, 1206]
t[55] accessing shared_attr[55, 119, 183, 247, 311, 375, 439, 503, 567, 631, 695, 759, 823, 887, 951, 1015, 1079, 1143, 1207]
t[56] accessing shared_attr[56, 120, 184, 248, 312, 376, 440, 504, 568, 632, 696, 760, 824, 888, 952, 1016, 1080, 1144, 1208]
t[57] accessing shared_attr[57, 121, 185, 249, 313, 377, 441, 505, 569, 633, 697, 761, 825, 889, 953, 1017, 1081, 1145, 1209]
t[58] accessing shared_attr[58, 122, 186, 250, 314, 378, 442, 506, 570, 634, 698, 762, 826, 890, 954, 1018, 1082, 1146, 1210]
t[59] accessing shared_attr[59, 123, 187, 251, 315, 379, 443, 507, 571, 635, 699, 763, 827, 891, 955, 1019, 1083, 1147, 1211]
t[60] accessing shared_attr[60, 124, 188, 252, 316, 380, 444, 508, 572, 636, 700, 764, 828, 892, 956, 1020, 1084, 1148, 1212]
t[61] accessing shared_attr[61, 125, 189, 253, 317, 381, 445, 509, 573, 637, 701, 765, 829, 893, 957, 1021, 1085, 1149, 1213]
t[62] accessing shared_attr[62, 126, 190, 254, 318, 382, 446, 510, 574, 638, 702, 766, 830, 894, 958, 1022, 1086, 1150, 1214]
t[63] accessing shared_attr[63, 127, 191, 255, 319, 383, 447, 511, 575, 639, 703, 767, 831, 895, 959, 1023, 1087, 1151, 1215]
[/codebox]
The visual profiler reports:
-
203 warp serialize
-
380 branches (76 of which are divergent)
-
304 gld coalesced
– On the 285, I don’t have the option of profiling the gld/gst uncoalesced (what is the reason for this…other than it is compute capability 1.3?)
Thank you for the assistance!