Hi, everyone!
As a by-product of my project, I incidentally coded a faster DASUM program (m-dasum) than cublasDasum in CUBLAS 3.2 on Tesla C2070.
Performance:
-
for vectors of 100 ~ 1000 elements, the speedup w.r.t cublasDasum is about 3.5.
-
for vectors of greater than 10^6 elements, the speedup w.r.t cublasDasum is about 1.17.
-
for vectors of the other sizes, m-dasum and cublasDasum are quite similar in performance.
I didn’t intentionally optimize the program. Only several common rules for developing CUDA program were used:
-
512 threads/block
-
(512 + 16 + 16) * 8 (double) = 4352 Bytes/block (in shared memory)
-
8 blocks/SM. This is about 35 KB, which is less than 48 KB.
-
in order to avoid bank conflicts, the elements in shared memory are arranged in the way below (X = pads, 16 banks for double precision):
bank 0 : 0 X 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 X 271 286 301 316 331 346 361 376 391 406 421 436 451 466 481 496
bank 1 : 1 16 X 47 62 77 92 107 122 137 152 167 182 197 212 227 242 256 X 287 302 317 332 347 362 377 392 407 422 437 452 467 482 497
bank 2 : 2 17 32 X 63 78 93 108 123 138 153 168 183 198 213 228 243 257 272 X 303 318 333 348 363 378 393 408 423 438 453 468 483 498
bank 3 : 3 18 33 48 X 79 94 109 124 139 154 169 184 199 214 229 244 258 273 288 X 319 334 349 364 379 394 409 424 439 454 469 484 499
bank 4 : 4 19 34 49 64 X 95 110 125 140 155 170 185 200 215 230 245 259 274 289 304 X 335 350 365 380 395 410 425 440 455 470 485 500
bank 5 : 5 20 35 50 65 80 X 111 126 141 156 171 186 201 216 231 246 260 275 290 305 320 X 351 366 381 396 411 426 441 456 471 486 501
bank 6 : 6 21 36 51 66 81 96 X 127 142 157 172 187 202 217 232 247 261 276 291 306 321 336 X 367 382 397 412 427 442 457 472 487 502
bank 7 : 7 22 37 52 67 82 97 112 X 143 158 173 188 203 218 233 248 262 277 292 307 322 337 352 X 383 398 413 428 443 458 473 488 503
bank 8 : 8 23 38 53 68 83 98 113 128 X 159 174 189 204 219 234 249 263 278 293 308 323 338 353 368 X 399 414 429 444 459 474 489 504
bank 9 : 9 24 39 54 69 84 99 114 129 144 X 175 190 205 220 235 250 264 279 294 309 324 339 354 369 384 X 415 430 445 460 475 490 505
bank 10: 10 25 40 55 70 85 100 115 130 145 160 X 191 206 221 236 251 265 280 295 310 325 340 355 370 385 400 X 431 446 461 476 491 506
bank 11: 11 26 41 56 71 86 101 116 131 146 161 176 X 207 222 237 252 266 281 296 311 326 341 356 371 386 401 416 X 447 462 477 492 507
bank 12: 12 27 42 57 72 87 102 117 132 147 162 177 192 X 223 238 253 267 282 297 312 327 342 357 372 387 402 417 432 X 463 478 493 508
bank 13: 13 28 43 58 73 88 103 118 133 148 163 178 193 208 X 239 254 268 283 298 313 328 343 358 373 388 403 418 433 448 X 479 494 509
bank 14: 14 29 44 59 74 89 104 119 134 149 164 179 194 209 224 X 255 269 284 299 314 329 344 359 374 389 404 419 434 449 464 X 495 510
bank 15: 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 X 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480 X 511
NOTE:
-
I don’t know why I can NOT upload *.tar.bz2 file. I rename m_dasum.tar.bz2 to m_dasum.txt to cheat the server.
-
The main program is in Fortran 90. Because my project is in Fortran, I have to work with Fortran as a master, and C as a slave.
m_dasum.txt (2.69 KB)