A faster DASUM program than cublasDasum in CUBLAS 3.2 on Tesla C2070

Hi, everyone!

As a by-product of my project, I incidentally coded a faster DASUM program (m-dasum) than cublasDasum in CUBLAS 3.2 on Tesla C2070.

Performance:

  1. for vectors of 100 ~ 1000 elements, the speedup w.r.t cublasDasum is about 3.5.

  2. for vectors of greater than 10^6 elements, the speedup w.r.t cublasDasum is about 1.17.

  3. for vectors of the other sizes, m-dasum and cublasDasum are quite similar in performance.

1.png

I didn’t intentionally optimize the program. Only several common rules for developing CUDA program were used:

  1. 512 threads/block

  2. (512 + 16 + 16) * 8 (double) = 4352 Bytes/block (in shared memory)

  3. 8 blocks/SM. This is about 35 KB, which is less than 48 KB.

  4. in order to avoid bank conflicts, the elements in shared memory are arranged in the way below (X = pads, 16 banks for double precision):

bank 0 :    0   X  31  46  61  76  91 106 121 136 151 166 181 196 211 226 241   X 271 286 301 316 331 346 361 376 391 406 421 436 451 466 481 496

bank 1 :    1  16   X  47  62  77  92 107 122 137 152 167 182 197 212 227 242 256   X 287 302 317 332 347 362 377 392 407 422 437 452 467 482 497

bank 2 :    2  17  32   X  63  78  93 108 123 138 153 168 183 198 213 228 243 257 272   X 303 318 333 348 363 378 393 408 423 438 453 468 483 498

bank 3 :    3  18  33  48   X  79  94 109 124 139 154 169 184 199 214 229 244 258 273 288   X 319 334 349 364 379 394 409 424 439 454 469 484 499

bank 4 :    4  19  34  49  64   X  95 110 125 140 155 170 185 200 215 230 245 259 274 289 304   X 335 350 365 380 395 410 425 440 455 470 485 500

bank 5 :    5  20  35  50  65  80   X 111 126 141 156 171 186 201 216 231 246 260 275 290 305 320   X 351 366 381 396 411 426 441 456 471 486 501

bank 6 :    6  21  36  51  66  81  96   X 127 142 157 172 187 202 217 232 247 261 276 291 306 321 336   X 367 382 397 412 427 442 457 472 487 502

bank 7 :    7  22  37  52  67  82  97 112   X 143 158 173 188 203 218 233 248 262 277 292 307 322 337 352   X 383 398 413 428 443 458 473 488 503

bank 8 :    8  23  38  53  68  83  98 113 128   X 159 174 189 204 219 234 249 263 278 293 308 323 338 353 368   X 399 414 429 444 459 474 489 504

bank 9 :    9  24  39  54  69  84  99 114 129 144   X 175 190 205 220 235 250 264 279 294 309 324 339 354 369 384   X 415 430 445 460 475 490 505

bank 10:   10  25  40  55  70  85 100 115 130 145 160   X 191 206 221 236 251 265 280 295 310 325 340 355 370 385 400   X 431 446 461 476 491 506

bank 11:   11  26  41  56  71  86 101 116 131 146 161 176   X 207 222 237 252 266 281 296 311 326 341 356 371 386 401 416   X 447 462 477 492 507

bank 12:   12  27  42  57  72  87 102 117 132 147 162 177 192   X 223 238 253 267 282 297 312 327 342 357 372 387 402 417 432   X 463 478 493 508

bank 13:   13  28  43  58  73  88 103 118 133 148 163 178 193 208   X 239 254 268 283 298 313 328 343 358 373 388 403 418 433 448   X 479 494 509

bank 14:   14  29  44  59  74  89 104 119 134 149 164 179 194 209 224   X 255 269 284 299 314 329 344 359 374 389 404 419 434 449 464   X 495 510

bank 15:   15  30  45  60  75  90 105 120 135 150 165 180 195 210 225 240   X 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480   X 511

NOTE:

  1. I don’t know why I can NOT upload *.tar.bz2 file. I rename m_dasum.tar.bz2 to m_dasum.txt to cheat the server.

  2. The main program is in Fortran 90. Because my project is in Fortran, I have to work with Fortran as a master, and C as a slave.
    m_dasum.txt (2.69 KB)

Based on an email communication with Mark Harris. I misunderstood the concept of bank conflict. So it is NOT necessary to arrange the vector elements in such a padded manner.