HPL

bbales2 · May 19, 2009, 1:47pm

Massimiliano Fatica, do you plan on ever releasing the source code that goes along with this paper: [url=“http://portal.acm.org/citation.cfm?id=1513895.1513901&coll=GUIDE&dl=GUIDE&CFID=35116491&CFTOKEN=21112885”]http://portal.acm.org/citation.cfm?id=1513...FTOKEN=21112885[/url]

It would be very useful for our project, as we use a modified HPL in our application.

Also, have you considered trying out a mixed precision approach to HPL like found here: [url=“http://icl.cs.utk.edu/projectsfiles/iter-ref/pubs/ut_cs_06_580.pdf”]http://icl.cs.utk.edu/projectsfiles/iter-r...t_cs_06_580.pdf[/url] ?

We aren’t looking for a general purpose library. We’d be glad to put together a hackjob until a formal GPU linpack library comes out.

Ben

mfatica · May 19, 2009, 2:45pm

There is no current plan to formally release the software.

I am making the software available to universities and research centers, send me an email from your official email address and I will send you the source code.

Massimiliano

theMarix · May 20, 2009, 2:12pm

One should keep in mind that, while something like this might work for your application, using mixed precision violates the rules for an official HPL run.

figueras · October 20, 2009, 6:40am

Well I tried to send you a message but i had received this error message : This message can not be sent because the recipient has their personal messenger disabled or they are in a member group not allowed to use the personal messenger.

So, please, can you send me the code for your article ?

Regards

Fabien

klh · May 10, 2010, 11:44pm

Hello Mr Fatica

I’m a passionate amateur who runs many HPL tests on my computers, trying to best optimize my machines. I actually run it with GotoBLAS on Debian GNU/Linux amd64 (Intel Quad Core CPU). I have two Nvidia 8800GT and I would like to test them.

This would be very nice if you could send me the source code to “accelerate Linpack with CUDA”.

Best regards.

avidday · May 11, 2010, 8:08am

Even if you had the source, it would do you no good - HPL is a double precision benchmark and your cards don’t support double precision,

klh · May 12, 2010, 7:29am

Heck, I have not thought about that. I just checked if my cards could support cuda. Never mind, thanks.

sindimo · December 14, 2010, 11:59am

Dear Dr. Fatica,

First of all, thank you for sharing that wonderful paper about HPL, I really enjoyed reading it.

We currently have a 24 node cluster with Tesla S0107’s to evaluate running our Parallel Oil-Water-Gas Reservoir Simulator (POWERS) on GPUs.

At the same time we wanted to benchmark the system using your HPL version of the benchmark.

We setup your CUDA_pinned version of HPL for S1070 (gt200) on our system, basically we are using the exact same versions of the software recommended in CUDA_LINPACK_README.txt:

â€¢ Openmpi 1.2.5

â€¢ Intel Compiler 10.1.015

â€¢ Intel MKL 10.0.3

â€¢ CUDA 2.2 (cudatoolkit_2.2_linux_64_rhel4.7.run & NVIDIA-Linux-x86_64-185.18.36-pkg2.run)

Everything compiles fine.

The nodes are dual socket Quad Core Nehalem with 12 GB RAM, each node sees 2 GPUs coming from an S1070.

Runs are with 80% of memory.

When running on a node with 8 CPU cores and 2 GPUs (OMP_NUM_THREADS=4):

[chewbacca@superbeast078 CUDA_pinned]$ mpirun -np 2 -hostfile nodes ./run_linpack.CUDA_pinned

================================================================================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

================================================================================

An explanation of the input/output parameters follows:

T/V    : Wall time / encoded variant.

N      : The order of the coefficient matrix A.

NB     : The partitioning blocking factor.

P      : The number of process rows.

Q      : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   32032 

NB     :    1152 

PMAP   : Row-major process mapping

P      :       1 

Q      :       2 

PFACT  :    Left 

NBMIN  :       4 

NDIV   :       2 

RFACT  :    Left 

BCAST  :   1ring 

DEPTH  :       0 

SWAP   : Mix (threshold = 128)

L1     : transposed form

U      : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be               1.110223e-16

- Computational tests pass if scaled residuals are less than                16.0

Assigning device 0  to process on node superbeast078 rank 0 

Assigning device 1  to process on node superbeast078 rank 1 

DTRSM split from environment variable 0.630000 

DGEMM split from environment variable 0.660000 

DTRSM split from environment variable 0.630000 

DGEMM split from environment variable 0.660000 

================================================================================

T/V                N    NB     P     Q               Time                 Gflops

--------------------------------------------------------------------------------

WR00L2L4       32032  1152     1     2             122.00              1.796e+02

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0044480 ...... PASSED

================================================================================

Finished      1 tests with the following results:

              1 tests completed and passed residual checks,

              0 tests completed and failed residual checks,

              0 tests skipped because of illegal input values.

--------------------------------------------------------------------------------

End of Tests.

================================================================================

Rpeak (theoretical) = 93 GFLOPS (8 CPU cores) + 154 GFLOPS (2 GPUs) = 247 GFLOPS

Rmax (actual) = 179.6 GFLOPS (73% efficiency)

So far the above tests are all running fine, however as soon as we try running on multiple GPUs from multiple nodes we get the below errors.

We still get the error even if we set the split to be as low as 0.50

We also tried reducing the value of N from 32032 to as low as 5000 and we still get the same error.

We have 24 nodes in the cluster and we tried the below run on multiple pairs of nodes just to make sure that itâ€™s not a hardware problem, but we get the same error regardless of the pair being used.

I was wondering if you have encountered this before and any idea how to address it?

Your valuable feed back is highly appreciated.

Thank you for your help.

Sincerely,

Mohamad Sindi

Saudi Aramco

EXPEC Computer Center

High Performance Computing Group

Error:

[sindimo@superbeast078 CUDA_pinned]$ mpirun -np 4 -hostfile nodes ./run_linpack.CUDA_pinned

================================================================================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

================================================================================

An explanation of the input/output parameters follows:

T/V    : Wall time / encoded variant.

N      : The order of the coefficient matrix A.

NB     : The partitioning blocking factor.

P      : The number of process rows.

Q      : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    5000 

NB     :     960 

PMAP   : Row-major process mapping

P      :       2 

Q      :       2 

PFACT  :    Left 

NBMIN  :       4 

NDIV   :       2 

RFACT  :    Left 

BCAST  :   1ring 

DEPTH  :       0 

SWAP   : Mix (threshold = 128)

L1     : transposed form

U      : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be               1.110223e-16

- Computational tests pass if scaled residuals are less than                16.0

Assigning device 0  to process on node superbeast078 rank 0 

Assigning device 1  to process on node superbeast078 rank 2 

Assigning device 0  to process on node superbeast079 rank 1 

Assigning device 1  to process on node superbeast079 rank 3 

DTRSM split from environment variable 0.500000 

DTRSM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

DTRSM split from environment variable 0.500000 

DTRSM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

!!!! device access error (write B) 11

!!!! device access error (write C) 11

 ** On entry to DGEMM  parameter number 10 had an illegal value

!!!! device access error (write B) 11

!!!! device access error (write C) 11

 ** On entry to DGEMM  parameter number 10 had an illegal value

[superbeast078:27713] *** Process received signal ***

[superbeast078:27712] *** Process received signal ***

[superbeast078:27713] Signal: Segmentation fault (11)

[superbeast078:27713] Signal code: Invalid permissions (2)

[superbeast078:27713] Failing at address: 0x2aa95c3608

[superbeast079:27114] *** Process received signal ***

[superbeast079:27114] Signal: Segmentation fault (11)

[superbeast079:27114] Signal code: Invalid permissions (2)

[superbeast079:27114] Failing at address: 0x2aaa6307d0

[superbeast079:27115] *** Process received signal ***

[superbeast079:27115] Signal: Segmentation fault (11)

[superbeast079:27115] Signal code: Invalid permissions (2)

[superbeast079:27115] Failing at address: 0x2aa9b5b7d0

./run_linpack.CUDA_pinned: line 14: 27712 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

./run_linpack.CUDA_pinned: line 14: 27713 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

./run_linpack.CUDA_pinned: line 14: 27115 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

./run_linpack.CUDA_pinned: line 14: 27114 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

Frank_Han · December 22, 2010, 7:57pm

I have a similar issue, but even worse,I cannot even run on 1 GPU.

================================================================================

HPLinpack 2.0 – High-Performance Linpack benchmark – September 10, 2008

Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

================================================================================

An explanation of the input/output parameters follows:

T/V : Wall time / encoded variant.

N : The order of the coefficient matrix A.

NB : The partitioning blocking factor.

P : The number of process rows.

Q : The number of process columns.

Time : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 25000

NB : 384 512 640 768 896 960 1024 1152

PMAP : Row-major process mapping

P : 1

Q : 1

PFACT : Left

NBMIN : 2

NDIV : 2

RFACT : Left

BCAST : 1ring

DEPTH : 1

SWAP : Mix (threshold = 192)

L1 : no-transposed form

U : no-transposed form

EQUIL : yes

ALIGN : 8 double precision words

The matrix A is randomly generated for each test.

The following scaled residual check will be computed:

||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

The relative machine precision (eps) is taken to be 1.110223e-16
Computational tests pass if scaled residuals are less than 16.0

[node001:14923] *** Process received signal ***

./run_linpack: line 17: 14923 Segmentation fault $HPL_DIR/bin/CUDA_pinned/xhpl

mpirun has exited due to process rank 0 with PID 14922 on

node node001 exiting without calling “finalize”. This may

have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).

Could anyone help on that.

Thanks in advance.

Dear Dr. Fatica,

First of all, thank you for sharing that wonderful paper about HPL, I really enjoyed reading it.

We currently have a 24 node cluster with Tesla S0107’s to evaluate running our Parallel Oil-Water-Gas Reservoir Simulator (POWERS) on GPUs.

At the same time we wanted to benchmark the system using your HPL version of the benchmark.

We setup your CUDA_pinned version of HPL for S1070 (gt200) on our system, basically we are using the exact same versions of the software recommended in CUDA_LINPACK_README.txt:

â€¢ Openmpi 1.2.5

â€¢ Intel Compiler 10.1.015

â€¢ Intel MKL 10.0.3

â€¢ CUDA 2.2 (cudatoolkit_2.2_linux_64_rhel4.7.run & NVIDIA-Linux-x86_64-185.18.36-pkg2.run)

Everything compiles fine.

The nodes are dual socket Quad Core Nehalem with 12 GB RAM, each node sees 2 GPUs coming from an S1070.

All runs are with 80% of memory.

For the initial test we used 1 core + 1 GPU (OMP_NUM_THREADS=8):

Assigning device 0  to process on node superbeast078 rank 0 

DTRSM split from environment variable 0.825000 

DGEMM split from environment variable 0.875000 

================================================================================

T/V                N    NB     P     Q               Time                 Gflops

--------------------------------------------------------------------------------

WR00L2L4       32032   960     1     1             279.00              7.854e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0043396 ...... PASSED

================================================================================

So we have:

Rpeak (theoretical) = 11 GFLOPS (CPU) + 77 GFLOPS (GPU) = 88 GFLOPS

Rmax (actual) = 78.54 GFLOPS (89.25% efficiency)

2 cores + 2 GPUs on 1 node (OMP_NUM_THREADS=4):

Assigning device 0  to process on node superbeast078 rank 0 

Assigning device 1  to process on node superbeast078 rank 1 

DTRSM split from environment variable 0.825000 

DGEMM split from environment variable 0.875000 

DTRSM split from environment variable 0.825000 

DGEMM split from environment variable 0.875000 

================================================================================

T/V                N    NB     P     Q               Time                 Gflops

--------------------------------------------------------------------------------

WR00L2L4       32032   960     1     2             162.96              1.345e+02

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0043175 ...... PASSED

================================================================================

So we have:

Rpeak (theoretical) = 22 GFLOPS (CPUs) + 154 GFLOPS (GPUs) = 176 GFLOPS

Rmax (actual) = 134.5 GFLOPS (76.4% efficiency)

So far the above tests are all running fine, however as soon as we try running on multiple GPUs from multiple nodes we get the below errors.

At first we thought whatâ€™s causing this could be the high value we had for CUDA_DTRSM_SPLIT and CUDA_DGEMM_SPLIT (0.875, 0.825) but that didnâ€™t seem to be the case since we get the same error even if we set those values as low as 0.50

We also tried reducing the value of N from 32032 to as low as 5000 and we still get the same error.

We have 24 nodes in the cluster and we tried the below run on multiple pairs of nodes just to make sure that itâ€™s not a hardware problem, but we get the same error regardless of the pair being used.

I was wondering if you have encountered this before and any idea how to address it?

Your valuable feed back is highly appreciated.

Thank you for your help.

Sincerely,

Mohamad Sindi

Saudi Aramco

EXPEC Computer Center

High Performance Computing Group

Error:

[sindimo@superbeast078 CUDA_pinned]$ mpirun -np 4 -hostfile nodes ./run_linpack.CUDA_pinned

================================================================================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

================================================================================

An explanation of the input/output parameters follows:

T/V    : Wall time / encoded variant.

N      : The order of the coefficient matrix A.

NB     : The partitioning blocking factor.

P      : The number of process rows.

Q      : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    5000 

NB     :     960 

PMAP   : Row-major process mapping

P      :       2 

Q      :       2 

PFACT  :    Left 

NBMIN  :       4 

NDIV   :       2 

RFACT  :    Left 

BCAST  :   1ring 

DEPTH  :       0 

SWAP   : Mix (threshold = 128)

L1     : transposed form

U      : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be               1.110223e-16

- Computational tests pass if scaled residuals are less than                16.0

Assigning device 0  to process on node superbeast078 rank 0 

Assigning device 1  to process on node superbeast078 rank 2 

Assigning device 0  to process on node superbeast079 rank 1 

Assigning device 1  to process on node superbeast079 rank 3 

DTRSM split from environment variable 0.500000 

DTRSM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

DTRSM split from environment variable 0.500000 

DTRSM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

!!!! device access error (write B) 11

!!!! device access error (write C) 11

 ** On entry to DGEMM  parameter number 10 had an illegal value

!!!! device access error (write B) 11

!!!! device access error (write C) 11

 ** On entry to DGEMM  parameter number 10 had an illegal value

[superbeast078:27713] *** Process received signal ***

[superbeast078:27712] *** Process received signal ***

[superbeast078:27713] Signal: Segmentation fault (11)

[superbeast078:27713] Signal code: Invalid permissions (2)

[superbeast078:27713] Failing at address: 0x2aa95c3608

[superbeast079:27114] *** Process received signal ***

[superbeast079:27114] Signal: Segmentation fault (11)

[superbeast079:27114] Signal code: Invalid permissions (2)

[superbeast079:27114] Failing at address: 0x2aaa6307d0

[superbeast079:27115] *** Process received signal ***

[superbeast079:27115] Signal: Segmentation fault (11)

[superbeast079:27115] Signal code: Invalid permissions (2)

[superbeast079:27115] Failing at address: 0x2aa9b5b7d0

./run_linpack.CUDA_pinned: line 14: 27712 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

./run_linpack.CUDA_pinned: line 14: 27713 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

./run_linpack.CUDA_pinned: line 14: 27115 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

./run_linpack.CUDA_pinned: line 14: 27114 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

pakky999 · January 8, 2011, 11:17am

I have a problem when run HPL on GPU-Cluster.

================================================================================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

================================================================================

An explanation of the input/output parameters follows:

T/V    : Wall time / encoded variant.

N      : The order of the coefficient matrix A.

NB     : The partitioning blocking factor.

P      : The number of process rows.

Q      : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   15360 

NB     :     512 

PMAP   : Column-major process mapping

P      :       1 

Q      :       1 

PFACT  :    Left    Crout    Right 

NBMIN  :       2        4 

NDIV   :       2 

RFACT  :    Left    Crout    Right 

BCAST  :   1ring 

DEPTH  :       0 

SWAP   : Mix (threshold = 64)

L1     : transposed form

U      : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be               1.110223e-16

- Computational tests pass if scaled residuals are less than                16.0

================================================================================

T/V                N    NB     P     Q               Time                 Gflops

--------------------------------------------------------------------------------

WC00L2L2       15360   512     1     1               3.21              7.534e+02

HPL ERROR from process # 0, on line 321 of function HPL_pdtest:

>>> Error code returned by solve is 1, skip**** <<<

HPL ERROR from process # 0, on line 321 of function HPL_pdtest:

Error code returned by solve is 1, skip**** <<<

How to solve this problem?

sindimo · February 18, 2011, 11:57am

I have a similar issue, but even worse,I cannot even run on 1 GPU.

================================================================================

HPLinpack 2.0 – High-Performance Linpack benchmark – September 10, 2008

Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

================================================================================

An explanation of the input/output parameters follows:

T/V : Wall time / encoded variant.

N : The order of the coefficient matrix A.

NB : The partitioning blocking factor.

P : The number of process rows.

Q : The number of process columns.

Time : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 25000

NB : 384 512 640 768 896 960 1024 1152

PMAP : Row-major process mapping

P : 1

Q : 1

PFACT : Left

NBMIN : 2

NDIV : 2

RFACT : Left

BCAST : 1ring

DEPTH : 1

SWAP : Mix (threshold = 192)

L1 : no-transposed form

U : no-transposed form

EQUIL : yes

ALIGN : 8 double precision words
The matrix A is randomly generated for each test.
The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
The relative machine precision (eps) is taken to be 1.110223e-16

Computational tests pass if scaled residuals are less than 16.0
[node001:14923] *** Process received signal ***

./run_linpack: line 17: 14923 Segmentation fault $HPL_DIR/bin/CUDA_pinned/xhpl

mpirun has exited due to process rank 0 with PID 14922 on

node node001 exiting without calling “finalize”. This may

have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).

Could anyone help on that.

Thanks in advance.

Hi all, I documented the procedure I did to get NVIDIA’s HPL working on both S1070 and S2050, I am getting similar efficiencies to what NVIDIA has published in their GTC2010 conference for 1 node runs (73% for C1060 and 63% for M2050). Here’s the link for the HOWTO, I hope you find it useful:

HOWTO - HPL on GPU

Thank you

Mohamad Sindi

Saudi Aramco

EXPEC Computer Center

High Performance Computing Group

DrMikeT · July 18, 2011, 11:07pm

Hi Mohamad, how did you obtain the code from Nvidia? We would also like to benchmark our cluster with HPL/CUDA (some nodes carry GPUs) but I have very hard time finding out whom to ask at Nvidia.

Regards

Michael

Topic		Replies	Views
HPLinpack for CUDA Any interest? CUDA Programming and Performance	27	11953	May 10, 2012
CUDA accelerated Linpack not running, undefined symbol dtrsm CUDA Programming and Performance	9	1196	December 10, 2017
Nvidia docker nvcr.io/nvidia/hpc-benchmarks:23.10 HPL running error at HPC ARM Developer-kit Container: HPC cuda	2	1296	February 22, 2024
CUDA accelerated Linpack seemingly not using any GPU CUDA Programming and Performance	18	3660	March 26, 2018
Run HPL on 4x A100 CUDA Programming and Performance	3	3066	July 17, 2021
HPL on cuBlas : Ok, but not on Tesla 1060 Board ! Tesla board crash on large array when launchin CUDA Programming and Performance	11	30432	December 20, 2009
Compiling the HPL benchmark using accelerator directives Legacy PGI Compilers	3	5630	December 1, 2010
Performance Issues with CUDA and Python/R CUDA Programming and Performance	16	6267	August 26, 2016
Problem with IPC CUDA Programming and Performance	10	3431	May 27, 2020
Intel paper: Debunking the 100X GPU vs. CPU myth CUDA Programming and Performance	36	25218	April 7, 2011

HPL

Related topics