CuSolver Sparse Code 2 error

ceeely · February 28, 2017, 2:09am

I’m trying to run a 55.5k x 55.5k matrix with 17.6mil non-zero sparse matrix on the cuSolverSp_LinearSolver samples. The attached matrics allows the program to run fine.

When I run the program, it stalls at step 4 due to Cuda error code = 2(CUSOLVER_STATUS_ALLOC_FAILED) “cuSolverSpDcsrlsvcholHost(…)”

Reading on the documentation, it says that this is due to
"
Resource allocation failed inside the cuSolver library. This is usually caused by a cudaMalloc() failure.

To correct: prior to the function call, deallocate previously allocated memory as much as possible. [url]http://docs.nvidia.com/cuda/cusolver/index.html#ixzz4ZwOk362V[/url] "

For context: My workstation has a Quadro K620 and a Tesla K40C.

(Part 2) When I run the program, (I may be wrong) it seems that only the Quadro is detected which leads to the notification of the following when I run the program on PGPROF.

“Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower.”

So the question is 2 part:

What can I safely deallocate or what should I modify to carry out the computation?
is there a way to verify that the program is using, or make the program use both GPUs?

Thank you.

Robert_Crovella · March 4, 2017, 3:39pm

You want to use your K40c (only). Trying to use the K620 also will never provide any benefit, and is probably not doable in any case, with the cusolver library.

Run deviceQuery. Note the order of enumeration of your 2 GPUs (make sure they both show up in deviceQuery)

Now suppose that the K40c was enumerated as the 2nd device (i.e. it appeared 2nd in the listing of GPUs in deviceQuery).

Then run your code like this on linux:

CUDA_VISIBLE_DEVICES=“1” ./my_executable

where you replace “my_executable” with the name or command line of the program you want to run.

If you are on windows, the above would have to be modified. If you want to follow that recipe, you will need to determine how to specify an environment variable in windows.

ceeely · March 6, 2017, 2:27am

I’m running on windows.

When I run the vanilla “findCudaDevice”, it states:

  GPU Device 1: "Quadro K620" with compute capability 5.0

Perhaps you can guide me to your suggested solution? thank you.

Robert_Crovella · March 6, 2017, 2:31am

I don’t know what “the vanilla findCudaDevice” is.

ceeely · March 6, 2017, 2:35am

I mean to say that the code

findCudaDevice(argc, (const char **)argv);

is in the cuSolverSP_linersolver sample code.

Robert_Crovella · March 6, 2017, 2:37am

Run the deviceQuery CUDA sample app on your system.

Paste the output verbatim into this thread.

The deviceQuery sample app is referred to in the windows installation guide for CUDA, and it is part of the proper verification of the CUDA install:

[url]http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#compiling-examples[/url]

So if you followed those instructions, you should have run it at some point already.

ceeely · March 6, 2017, 3:02am

As requested

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0_Utilities\deviceQuery\../.
./bin/win64/Debug/deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K40c"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11423 MBytes (11978276864 bytes
)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            745 MHz (0.75 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  CUDA Device Driver Mode (TCC or WDDM):         TCC (Tesla Compute Cluster Driv
er)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simu
ltaneously) >

Device 1: "Quadro K620"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2048 MBytes (2147483648 bytes)
  ( 3) Multiprocessors, (128) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1124 MHz (1.12 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Mo
del)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simu
ltaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Versi
on = 8.0, NumDevs = 2, Device0 = Tesla K40c, Device1 = Quadro K620
Result = PASS
Press any key to continue . . .

Thank you.

Robert_Crovella · March 6, 2017, 3:14am

So your K40c is device 0.

When I run the cuSolverSp_LinearSolver sample code, by itself, I get output like this:

$ ./cuSolverSp_LinearSolver
GPU Device 0: "Tesla K20Xm" with compute capability 3.5

Using default input file [./lap2D_5pt_n100.mtx]
step 1: read matrix market format
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm or -P=symamd
        The reordering will overwrite A such that
            A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A)
step 2.1: set right hand side vector (b) to 1
step 3: prepare data on device
step 4: solve A*x = b on CPU
step 5: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 4.547474E-12
(CPU) |A| = 8.000000E+00
(CPU) |x| = 7.513384E+02
(CPU) |b - A*x|/(|A|*|x|) = 7.565621E-16
step 6: solve A*x = b on GPU
step 7: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 1.818989E-12
(GPU) |A| = 8.000000E+00
(GPU) |x| = 7.513384E+02
(GPU) |b - A*x|/(|A|*|x|) = 3.026248E-16
timing chol: CPU =   0.067052 sec , GPU =   0.144262 sec
$

Can you paste in the output of what you get? Please just run it as-is, from the command line, don’t add any command line switches/options.

ceeely · March 6, 2017, 3:29am

Here it is:

GPU Device 1: "Quadro K620" with compute capability 5.0

sdkFindFilePath <lap2D_5pt_n100.mtx> in ./
Using default input file [./lap2D_5pt_n100.mtx]
step 1: read matrix market format
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm or -P=symamd
        The reordering will overwrite A such that
            A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A)
step 2.1: set right hand side vector (b)
step 3: prepare data on device
step 4: solve A*x = b on CPU
step 5: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 4.547474E-012
(CPU) |A| = 8.000000E+000
(CPU) |x| = 7.513384E+002
(CPU) |b - A*x|/(|A|*|x|) = 7.565621E-016
step 6: solve A*x = b on GPU
step 7: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 1.818989E-012
(GPU) |A| = 8.000000E+000
(GPU) |x| = 7.513384E+002
(GPU) |b - A*x|/(|A|*|x|) = 3.026248E-016
timing chol: CPU =   0.223782 sec , GPU =   0.169357 sec
Press any key to continue . . .

Robert_Crovella · March 6, 2017, 3:41am

Interesting.

can you add the following switch

-device=0

to the command line, and see if the output is different? (i.e. the GPU Device 1… line should be replaced with a different printout on that line)

ceeely · March 6, 2017, 3:45am

Which command line? C/C++ , CUDA C++ , or Linker ?

either way, the build has this:

1>------ Build started: Project: cuSolverSp_LinearSolver, Configuration: Debug x64 ------
1>cl : Command line warning D9002: ignoring unknown option '-device=0'
1>  Skipping... (no relevant changes detected)
1>  mmio_wrapper.cpp
1>  cuSolverSp_LinearSolver.cpp
1>cl : Command line warning D9002: ignoring unknown option '-device=0'
1>  mmio.c
1>LINK : warning LNK4044: unrecognized option '/device=0'; ignored
1>  cuSolverSp_LinearSolver_vs2013.vcxproj -> C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0_CUDALibraries\cuSolverSp_LinearSolver\../../bin/win64/Debug/cuSolverSp_LinearSolver.exe
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

And the output:

GPU Device 1: "Quadro K620" with compute capability 5.0

sdkFindFilePath <lap2D_5pt_n100.mtx> in ./
Using default input file [./lap2D_5pt_n100.mtx]
step 1: read matrix market format
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm or -P=symamd
        The reordering will overwrite A such that
            A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A)
step 2.1: set right hand side vector (b)
step 3: prepare data on device
step 4: solve A*x = b on CPU
step 5: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 4.547474E-012
(CPU) |A| = 8.000000E+000
(CPU) |x| = 7.513384E+002
(CPU) |b - A*x|/(|A|*|x|) = 7.565621E-016
step 6: solve A*x = b on GPU
step 7: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 1.818989E-012
(GPU) |A| = 8.000000E+000
(GPU) |x| = 7.513384E+002
(GPU) |b - A*x|/(|A|*|x|) = 3.026248E-016
timing chol: CPU =   0.163375 sec , GPU =   0.167745 sec
Press any key to continue . . .

Robert_Crovella · March 6, 2017, 3:51am

I asked you to run the program from the command line. The windows command line.

That means I don’t want you to run it from within visual studio.

Open a windows command prompt.
change directory to here:

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\bin\win64\Debug\

There should be a cuSolverSp_LinearSolver.exe in that directory. Run it from the command line in that command prompt window like this:

cuSolverSp_LinearSolver -device=0

ceeely · March 6, 2017, 4:10am

Oh sorry, my mistake.

Here it is:

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\bin\win64\Debug>cuSolverSp_L
inearSolver -device=0
gpuDeviceInit() CUDA Device [0]: "Tesla K40c
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./cuSolverSp_LinearSolver_data_files/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./common/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./common/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./src/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./src/cuSolverSp_LinearSolver/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./inc/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./0_Simple/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./1_Utilities/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./2_Graphics/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./3_Imaging/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./4_Finance/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./5_Simulations/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./6_Advanced/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./7_CUDALibraries/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./8_Android/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./samples/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./0_Simple/cuSolverSp_LinearSolver/data/

sdkFindFilePath <lap2D_5pt_n100.mtx> in ./1_Utilities/cuSolverSp_LinearSolver/da
ta/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./2_Graphics/cuSolverSp_LinearSolver/dat
a/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./3_Imaging/cuSolverSp_LinearSolver/data
/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./4_Finance/cuSolverSp_LinearSolver/data
/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./5_Simulations/cuSolverSp_LinearSolver/
data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./6_Advanced/cuSolverSp_LinearSolver/dat
a/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./7_CUDALibraries/cuSolverSp_LinearSolve
r/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ./7_CUDALibraries/cuSolverSp_LinearSolve
r/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../common/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../common/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../src/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../inc/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../0_Simple/cuSolverSp_LinearSolver/data
/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../1_Utilities/cuSolverSp_LinearSolver/d
ata/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../2_Graphics/cuSolverSp_LinearSolver/da
ta/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../3_Imaging/cuSolverSp_LinearSolver/dat
a/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../4_Finance/cuSolverSp_LinearSolver/dat
a/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../5_Simulations/cuSolverSp_LinearSolver
/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../6_Advanced/cuSolverSp_LinearSolver/da
ta/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../7_CUDALibraries/cuSolverSp_LinearSolv
er/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../8_Android/cuSolverSp_LinearSolver/dat
a/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../samples/cuSolverSp_LinearSolver/data/

sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../common/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../common/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../src/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../inc/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../sandbox/cuSolverSp_LinearSolver/da
ta/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../0_Simple/cuSolverSp_LinearSolver/d
ata/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../1_Utilities/cuSolverSp_LinearSolve
r/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../2_Graphics/cuSolverSp_LinearSolver
/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../3_Imaging/cuSolverSp_LinearSolver/
data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../4_Finance/cuSolverSp_LinearSolver/
data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../5_Simulations/cuSolverSp_LinearSol
ver/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../6_Advanced/cuSolverSp_LinearSolver
/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../7_CUDALibraries/cuSolverSp_LinearS
olver/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../8_Android/cuSolverSp_LinearSolver/
data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../samples/cuSolverSp_LinearSolver/da
ta/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../src/cuSolverSp_LinearSolver/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../src/cuSolverSp_LinearSolver/dat
a/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../src/cuSolverSp_LinearSolver/src
/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../src/cuSolverSp_LinearSolver/inc
/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../sandbox/cuSolverSp_LinearSolver
/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../sandbox/cuSolverSp_LinearSolver
/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../sandbox/cuSolverSp_LinearSolver
/src/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../sandbox/cuSolverSp_LinearSolver
/inc/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../0_Simple/cuSolverSp_LinearSolve
r/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../1_Utilities/cuSolverSp_LinearSo
lver/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../2_Graphics/cuSolverSp_LinearSol
ver/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../3_Imaging/cuSolverSp_LinearSolv
er/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../4_Finance/cuSolverSp_LinearSolv
er/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../5_Simulations/cuSolverSp_Linear
Solver/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../6_Advanced/cuSolverSp_LinearSol
ver/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../7_CUDALibraries/cuSolverSp_Line
arSolver/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../8_Android/cuSolverSp_LinearSolv
er/data/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../0_Simple/cuSolverSp_LinearSolve
r/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../1_Utilities/cuSolverSp_LinearSo
lver/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../2_Graphics/cuSolverSp_LinearSol
ver/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../3_Imaging/cuSolverSp_LinearSolv
er/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../4_Finance/cuSolverSp_LinearSolv
er/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../5_Simulations/cuSolverSp_Linear
Solver/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../6_Advanced/cuSolverSp_LinearSol
ver/
sdkFindFilePath <lap2D_5pt_n100.mtx> in ../../../7_CUDALibraries/cuSolverSp_Line
arSolver/
Using default input file [../../../7_CUDALibraries/cuSolverSp_LinearSolver/lap2D
_5pt_n100.mtx]
step 1: read matrix market format
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm or -P=symamd
        The reordering will overwrite A such that
            A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A)
step 2.1: set right hand side vector (b)
step 3: prepare data on device
step 4: solve A*x = b on CPU
step 5: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 4.547474E-012
(CPU) |A| = 8.000000E+000
(CPU) |x| = 7.513384E+002
(CPU) |b - A*x|/(|A|*|x|) = 7.565621E-016
step 6: solve A*x = b on GPU
step 7: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 1.818989E-012
(GPU) |A| = 8.000000E+000
(GPU) |x| = 7.513384E+002
(GPU) |b - A*x|/(|A|*|x|) = 3.026248E-016
timing chol: CPU =   0.170508 sec , GPU =   0.136387 sec

There is the "gpuDeviceInit() CUDA Device [0]: “Tesla K40c” so yeh, there’s a difference.

Robert_Crovella · March 6, 2017, 4:14am

That, then, is one possible method to run this code on the K40c in your machine instead of the Quadro K620

ceeely · March 6, 2017, 4:18am

i see, thank you. Does this meant that VS cannot run the Tesla? can i not isolate the display card from the computing card? Thanks in advance.

Robert_Crovella · March 6, 2017, 4:21am

No, you can certainly run on the Tesla card from within VS.

I don’t use VS very often, but I am sure there is a method to specify command-line arguments for any program you are running. It’s not by specifying compile or link switches, but actual command line arguments passed to the executable when it is run from within VS.

If you do some googling around I’m sure you can figure it out, such as this:

[url]c - Passing command line arguments in Visual Studio 2010? - Stack Overflow

if you use a method like that to specify -device=0 from within VS for the executable, then it should run on the K40c just like you did from the command line.

ceeely · March 6, 2017, 4:33am

Brilliant. It addresses the Tesla calling problem.

The main problem of Cuda Erroe 2 still persists though… guess the matrix was simply too big. Do you think the batched sparse solver overcome that problem?

Robert_Crovella · March 6, 2017, 4:45am

not sure which batched sparse solver you are referring to.

a batched solver is usually used for solving groups of problems. Not sure why you think that would help with an out-of-memory issue.

FWIW a 55kx55k sparse system with ~17 million NZ elements does not sound particularly large to me, but I haven’t tried it.

ceeely · March 6, 2017, 6:43am

thing is when I run that, it causes a Error code 2 which suggests that there’s a memory allocation error. Perhaps if you can suggest how I can isolate the problem so that I can over come this issue.

Thank you.