HPLinpack for CUDA Any interest?

bump

friendly reminder, I’m still interested to see the code

bump, any news about this topic (code, performance)?

bump, any news about this topic (code, performance)?

Hi everyone,

Happy Holidays to you and yours…

Joyeuses Fêtes à votre famille et à vous…

I am really in need of this HPLinpack CUDA benchmark.

Is there any one who could give me the source file, or help me to get the source!!! any kind of help would be really appreciated.

Thank you very much.

I wish for every one to have a good festivity holiday and a full of health and prosperity 2011th!

Regards,

Yaser

There is a simplified, bare-bones, single GPU capable version of the code base available here git://github.com/avidday/hpl-cuda.git

It should be treated as a starting point for an optimized benchmark, not the optimized benchmark itself.

AvidDay, how long do you think it would take to start from scratch on HPL-2.00 and graft the GPU enabled code to replicate the ideas on Dr. Fatica’s paper? The target would be a cluster of modern multicore SMPs enabled by 1 or 2 Tesla GPUs per node. The objective is to have the CPU BLAS cooperate with the GPU CUBLAS to work together on the matrix computations and not say just off-load everything to the GPU.

You mentioned that you ahve been hand tuning or ‘auto-tuning’ the code. Have you tried this process on different GPU enabled targets? The question here is if you have hand tuned it say for GP with capability X how much additional effort was necessary to tuned it for a GPU with higher capabilities.

I am asking since you went through this exercise …

–michael

The source code is now available, see http://forums.nvidia.com/index.php?showtopic=207574 for details

Dear Dr Fatica,

thank you for the reply, I am familiar with your HPL 2.00 Cuda enabled code.

I was wondering how long do you think it would take to start from scratch on HPL-2.00 and graft the GPU enabled code to replicate the ideas on your paper? The target would be a cluster of modern multicore SMPs enabled by 1 or 2 Tesla GPUs per node. The objective is to have the CPU BLAS cooperate with the GPU CUBLAS to work together on the matrix computations and not say just off-load everything to the GPU.

On another note, does Nvidia plan to provide a more generic S/W layer to help with the writing code that can use BOTH CPU and GPU parts together? For GPUs attached to powerfull SMP nodes this approach seems to make good sense. In my oppinion manually splitting down the computation to the right granule, managing multiple communication streams, and scheduling communication and their corresponding kernels is very cumbersome for the developer.

Ideally, there should be something like, for instance, a “super-BLAS” library which based on the matrix/vector sizes automatically splits the workload to a CPU and a GPU part. It then maintains multiple streams and schedules the kernel execution based on the performance characteristics of the current platform. More importantly, when the performance characteristics of a platform change the code could “auto-tune” to them and relieve the developer for the manual and error prone (and frustrating) process.

For instance when a new I/O interface, host DRAM, GDRAM, GPU platform etc. it should re-adjust the computation based on that particular platform.

This is exactly I think what you showcased with you Cuda enabled HPL-2.00 code. And this is from my own small experience trying to replicate the results from your work my self from scratch.

Best regards,

Michael