Hello,
I’m working since last week-end on cuBlas porting applications, in order to estimate the difficulties to match a CBLAS ou FBLAS code to CuBLAS.
As my first exercice, I try to adapt the Linpack clone HPL (HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers), which launches normally under Atlas libraries.
After hours of hard work (my first one…), I succeed in launching xhpl on Cuda supported nVidia board (NVS 160 and 8600 GT).
First remark : I reach more that 200 Gflops (as HPL estimate it) under these two boards
Second remark, when I port the code to a host which holds Tesla 1060, my program crashes, not for small arrays, but for big ones !
For 256 order:
====================
T/V N NB P Q Time Gflops
WR00L2L2 256 256 1 1 0.20 5.708e-02
WR00L2L4 256 256 1 1 0.06 1.809e-01
WR00L2C2 256 256 1 1 0.06 1.833e-01
WR00L2C4 256 256 1 1 0.06 1.834e-01
WR00L2R2 256 256 1 1 0.06 1.863e-01
WR00L2R4 256 256 1 1 0.06 1.879e-01
WR00C2L2 256 256 1 1 0.06 1.856e-01
WR00C2L4 256 256 1 1 0.06 1.875e-01
WR00C2C2 256 256 1 1 0.06 1.893e-01
For 512 order:
[poseidon:17221] *** Process received signal ***
[poseidon:17221] Signal: Floating point exception (8)
[poseidon:17221] Signal code: Integer divide-by-zero (1)
[poseidon:17221] Failing at address: 0x7fdd3aec4b9b
[poseidon:17221] [ 0] /lib/libpthread.so.0 [0x7fdd3957ba80]
[poseidon:17221] [ 1] /usr/local/cuda/lib64/libcublas.so.3 [0x7fdd3aec4b9b]
[poseidon:17221] [ 2] /usr/local/cuda/lib64/libcublas.so.3(cublasDcopy+0x36c) [0x7fdd3af48f8c]
The more funny is that my 2 laptops with shared memory and 4 GB of RAM haven’t any problems and that a Tesla board with 4 GB on a 8Gb host crash…
Not very funny in fact… Any idea ?