I have implemented a linear solver with gotoblas2 library and now I am moving to CUBLAS(just changing single blas call to CUBLAS call) for very large system like 7000 of dimension.
I was surprised that gotoblas2 is still much faster than CUBLAS…Is it reasonable?
Depends on which BLAS call you are using. If it is SGEMM or a similar BLAS3 call, I would be very surprised. If it is SAXPY or another BLAS1 call, not so much.
Depends on which BLAS call you are using. If it is SGEMM or a similar BLAS3 call, I would be very surprised. If it is SAXPY or another BLAS1 call, not so much.
cublasSnrm2 is normally slower than any CPU version because it has to do a device to host copy to return the result. This incurs a lot of overhead, making it slower than the CPU. In general, all the level 1 BLAS functions don’t benefit much from CUDA because they don’t have many floating point operations to each memory access.
cublasSnrm2 is normally slower than any CPU version because it has to do a device to host copy to return the result. This incurs a lot of overhead, making it slower than the CPU. In general, all the level 1 BLAS functions don’t benefit much from CUDA because they don’t have many floating point operations to each memory access.
A reduction on 7000 entries is really a couple of orders of magnitude too small to get much speed up on the GPU, the total execution time becomes dominated by fixed latencies like the PCI-e bus transfer. The most computationally expensive call in your code is probably the ssbmv call, but even that really resolves to a few saxpy operations on the diagonals.
Your best hope is to fuse some of the operations together to reduce a lot of the fixed costs. Doing this
is very inefficient compared to using a single kernel launch for each which fuses the computation and copy/assignment operations. In both cases only requires a two minimal kernels with a handful of lines of code each. Krylov subspace methods can be done efficiently with good speed up on GPUs if the problem is large, you have the matrix in an efficient sparse format, and the “housekeeping” calculations can be done efficiently with minimal overhead. I think this code probably fails on all three counts.
A reduction on 7000 entries is really a couple of orders of magnitude too small to get much speed up on the GPU, the total execution time becomes dominated by fixed latencies like the PCI-e bus transfer. The most computationally expensive call in your code is probably the ssbmv call, but even that really resolves to a few saxpy operations on the diagonals.
Your best hope is to fuse some of the operations together to reduce a lot of the fixed costs. Doing this
is very inefficient compared to using a single kernel launch for each which fuses the computation and copy/assignment operations. In both cases only requires a two minimal kernels with a handful of lines of code each. Krylov subspace methods can be done efficiently with good speed up on GPUs if the problem is large, you have the matrix in an efficient sparse format, and the “housekeeping” calculations can be done efficiently with minimal overhead. I think this code probably fails on all three counts.
A reduction on 7000 entries is really a couple of orders of magnitude too small to get much speed up on the GPU, the total execution time becomes dominated by fixed latencies like the PCI-e bus transfer. The most computationally expensive call in your code is probably the ssbmv call, but even that really resolves to a few saxpy operations on the diagonals.
Your best hope is to fuse some of the operations together to reduce a lot of the fixed costs. Doing this
[codebox]/*GPU_BandMinRes(float *ACD, int N, int HBand,float *b,float *xMR, float tol, int n_max)
ACD = Given a Band Symmetric Matrix As with dimension N and half band HBand, the equivalent
compact band matrix ACD in CUBLAS format has dimention (2*HBand+1)xN as described in
CUBLAS reference guide Function cublasDgbmv();
A reduction on 7000 entries is really a couple of orders of magnitude too small to get much speed up on the GPU, the total execution time becomes dominated by fixed latencies like the PCI-e bus transfer. The most computationally expensive call in your code is probably the ssbmv call, but even that really resolves to a few saxpy operations on the diagonals.
Your best hope is to fuse some of the operations together to reduce a lot of the fixed costs. Doing this
[codebox]/*GPU_BandMinRes(float *ACD, int N, int HBand,float *b,float *xMR, float tol, int n_max)
ACD = Given a Band Symmetric Matrix As with dimension N and half band HBand, the equivalent
compact band matrix ACD in CUBLAS format has dimention (2*HBand+1)xN as described in
CUBLAS reference guide Function cublasDgbmv();
The only way to be sure is to profile it and see where it is slow. That said, all of your own kernels look like they could be improved in a number of way. Whether it is worth doing depends on where you code is spending most of its time.
Looking at the Lanzcos update kernel as an example:
__global__ void lanczos_norm(float* Av, float* v_hat, float* v, float alpha, float beta, float beta_old, int N) {
__shared__ float svhat[LANCZOS_THREAD_PER_BLOCK];
__shared__ float svold[LANCZOS_THREAD_PER_BLOCK];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i<N){
svhat[tid] = v_hat[i];
svold[tid] = v[i];
v[i] = svhat[tid]/beta;
svhat[tid] = Av[i] - alpha*v[i] - beta*svold[tid];
v_hat[i]=svhat[tid];
svhat[tid] = svhat[tid]*svhat[tid];
}
else svhat[tid] = 0;
__syncthreads();
if (LANCZOS_THREAD_PER_BLOCK >= 512) {
if (tid < 256) { svhat[tid] += svhat[tid + 256]; }
__syncthreads();
}
if (LANCZOS_THREAD_PER_BLOCK >= 256) {
if (tid < 128) { svhat[tid] += svhat[tid + 128]; }
__syncthreads();
}
if (LANCZOS_THREAD_PER_BLOCK >= 128) {
if (tid < 64) { svhat[tid] += svhat[tid + 64]; }
__syncthreads();
}
if (tid < 32) {
if (LANCZOS_THREAD_PER_BLOCK >= 64) svhat[tid] += svhat[tid + 32];
if (LANCZOS_THREAD_PER_BLOCK >= 32) svhat[tid] += svhat[tid + 16];
if (LANCZOS_THREAD_PER_BLOCK >= 16) svhat[tid] += svhat[tid + 8];
if (LANCZOS_THREAD_PER_BLOCK >= 8) svhat[tid] += svhat[tid + 4];
if (LANCZOS_THREAD_PER_BLOCK >= 4) svhat[tid] += svhat[tid + 2];
if (LANCZOS_THREAD_PER_BLOCK >= 2) svhat[tid] += svhat[tid + 1];
}
// write result for this block to global mem
if (tid == 0) Av[blockIdx.x] = svhat[0];
}
This code computes one norm value per thread and then performs a block wise sum reduction on it. The arithmetic intensity of the kernel is very low and the “reduction ratio” is rather poor. Why not have every thread compute multiple norm values and sum them, then reduce them. By doing so you will need fewer blocks (so fewer total shared memory reductions, which is faster), need to copy less partial sums back from device (which will be faster), and need fewer operations on the host to complete the norm calculation (which is also faster).
The norm calculation itself doesn’t need to be done in shared memory, it only needs to be loaded into shared memory for the reduction. Doing the calculations using shared memory is slower than doing them using registers. svold is a completely unnecessary use of shared memory which will hurt occupancy. The reduction code itself could probably be improved a bit as well (if you are running on a Fermi card it may not always work correctly). My current code for doing something like this (an Linfty norm in this case) look something like:
typedef float ValueType;
__global__ void linfty_norm(unsigned int num_rows, const ValueType * x0, ValueType * norms)
{
extern __shared__ ValueType vbuffer[];
volatile int tidx = threadIdx.x + blockIdx.x * blockDim.x;
volatile int stride = gridDim.x * blockDim.x;
// Loop through vector to find local maximum
ValueType lumax = 0;
for( unsigned int i = tidx; i < num_rows; i+=stride ) {
lumax = fmax( lumax, fabs(x0[i]) );
}
// Load into shared memory
vbuffer[threadIdx.x] = lumax; __syncthreads();
// First warp performs shared memory reduction
// The volatile in register makes the code "Fermi safe"
if (threadIdx.x < warpSize) {
volatile ValueType *a = &vbuffer[threadIdx.x];
#pragma unroll
for(int i=warpSize; i<blockDim.x; i+=warpSize) {
*a = fmax( *a, *(a+i));
}
*a = fmax( *a, *(a+16));
*a = fmax( *a, *(a+8));
*a = fmax( *a, *(a+4));
*a = fmax( *a, *(a+2));
// Finalise and write out the results to global memory
if (threadIdx.x == 0) {
norms[blockIdx.x] = fmax(*a, *(a+1));
}
}
}
The only way to be sure is to profile it and see where it is slow. That said, all of your own kernels look like they could be improved in a number of way. Whether it is worth doing depends on where you code is spending most of its time.
Looking at the Lanzcos update kernel as an example:
__global__ void lanczos_norm(float* Av, float* v_hat, float* v, float alpha, float beta, float beta_old, int N) {
__shared__ float svhat[LANCZOS_THREAD_PER_BLOCK];
__shared__ float svold[LANCZOS_THREAD_PER_BLOCK];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i<N){
svhat[tid] = v_hat[i];
svold[tid] = v[i];
v[i] = svhat[tid]/beta;
svhat[tid] = Av[i] - alpha*v[i] - beta*svold[tid];
v_hat[i]=svhat[tid];
svhat[tid] = svhat[tid]*svhat[tid];
}
else svhat[tid] = 0;
__syncthreads();
if (LANCZOS_THREAD_PER_BLOCK >= 512) {
if (tid < 256) { svhat[tid] += svhat[tid + 256]; }
__syncthreads();
}
if (LANCZOS_THREAD_PER_BLOCK >= 256) {
if (tid < 128) { svhat[tid] += svhat[tid + 128]; }
__syncthreads();
}
if (LANCZOS_THREAD_PER_BLOCK >= 128) {
if (tid < 64) { svhat[tid] += svhat[tid + 64]; }
__syncthreads();
}
if (tid < 32) {
if (LANCZOS_THREAD_PER_BLOCK >= 64) svhat[tid] += svhat[tid + 32];
if (LANCZOS_THREAD_PER_BLOCK >= 32) svhat[tid] += svhat[tid + 16];
if (LANCZOS_THREAD_PER_BLOCK >= 16) svhat[tid] += svhat[tid + 8];
if (LANCZOS_THREAD_PER_BLOCK >= 8) svhat[tid] += svhat[tid + 4];
if (LANCZOS_THREAD_PER_BLOCK >= 4) svhat[tid] += svhat[tid + 2];
if (LANCZOS_THREAD_PER_BLOCK >= 2) svhat[tid] += svhat[tid + 1];
}
// write result for this block to global mem
if (tid == 0) Av[blockIdx.x] = svhat[0];
}
This code computes one norm value per thread and then performs a block wise sum reduction on it. The arithmetic intensity of the kernel is very low and the “reduction ratio” is rather poor. Why not have every thread compute multiple norm values and sum them, then reduce them. By doing so you will need fewer blocks (so fewer total shared memory reductions, which is faster), need to copy less partial sums back from device (which will be faster), and need fewer operations on the host to complete the norm calculation (which is also faster).
The norm calculation itself doesn’t need to be done in shared memory, it only needs to be loaded into shared memory for the reduction. Doing the calculations using shared memory is slower than doing them using registers. svold is a completely unnecessary use of shared memory which will hurt occupancy. The reduction code itself could probably be improved a bit as well (if you are running on a Fermi card it may not always work correctly). My current code for doing something like this (an Linfty norm in this case) look something like:
typedef float ValueType;
__global__ void linfty_norm(unsigned int num_rows, const ValueType * x0, ValueType * norms)
{
extern __shared__ ValueType vbuffer[];
volatile int tidx = threadIdx.x + blockIdx.x * blockDim.x;
volatile int stride = gridDim.x * blockDim.x;
// Loop through vector to find local maximum
ValueType lumax = 0;
for( unsigned int i = tidx; i < num_rows; i+=stride ) {
lumax = fmax( lumax, fabs(x0[i]) );
}
// Load into shared memory
vbuffer[threadIdx.x] = lumax; __syncthreads();
// First warp performs shared memory reduction
// The volatile in register makes the code "Fermi safe"
if (threadIdx.x < warpSize) {
volatile ValueType *a = &vbuffer[threadIdx.x];
#pragma unroll
for(int i=warpSize; i<blockDim.x; i+=warpSize) {
*a = fmax( *a, *(a+i));
}
*a = fmax( *a, *(a+16));
*a = fmax( *a, *(a+8));
*a = fmax( *a, *(a+4));
*a = fmax( *a, *(a+2));
// Finalise and write out the results to global memory
if (threadIdx.x == 0) {
norms[blockIdx.x] = fmax(*a, *(a+1));
}
}
}
The only way to be sure is to profile it and see where it is slow. That said, all of your own kernels look like they could be improved in a number of way. Whether it is worth doing depends on where you code is spending most of its time.
Looking at the Lanzcos update kernel as an example:
[codebox] Method #call GPU usec %GPU time ins. throughput
ssbmvu_main 3596 391299 74.63 0.330816
sdot_gld_main 3595 44165.4 8.42 0.560313
memcpyDtoH 7191 32451.2 6.18
lanczos_norm 3595 30190.1 5.75 0.165724
update 3595 26139.4 4.98 0.0709608
memcpyHtoD 3 33.664 0
initialize 1 8.672 0 0.211152[/codebox]
The execution time depends mostly from the ssbmvu_main(matrix vector moltiplication) that is very slow…I can not understand why!
About improving my function…I have followed your suggestions…but I am not sure how to compute multiple norm values without using shared memory.
Moreover my “lanczos_norm kernel” compute not only the norm, but also vector “v_hat” that will be used later, this means that the number of blocks depend of the vector length.
The only way to be sure is to profile it and see where it is slow. That said, all of your own kernels look like they could be improved in a number of way. Whether it is worth doing depends on where you code is spending most of its time.
Looking at the Lanzcos update kernel as an example:
[codebox] Method #call GPU usec %GPU time ins. throughput
ssbmvu_main 3596 391299 74.63 0.330816
sdot_gld_main 3595 44165.4 8.42 0.560313
memcpyDtoH 7191 32451.2 6.18
lanczos_norm 3595 30190.1 5.75 0.165724
update 3595 26139.4 4.98 0.0709608
memcpyHtoD 3 33.664 0
initialize 1 8.672 0 0.211152[/codebox]
The execution time depends mostly from the ssbmvu_main(matrix vector moltiplication) that is very slow…I can not understand why!
About improving my function…I have followed your suggestions…but I am not sure how to compute multiple norm values without using shared memory.
Moreover my “lanczos_norm kernel” compute not only the norm, but also vector “v_hat” that will be used later, this means that the number of blocks depend of the vector length.
In a typical sparse/banded implementation on the GPU, you would launch 1 thread per row to compute the dot product, which is complete memory bandwidth bound. That probably means 700 threads for your 700x700 case, which is very small. The sort of sparse systems I solve using Krylov subspace methods on GPUs have several million rows. I suspect your problems are just far too small for the GPU to perform well.
You need to add the shared memory size to the kernel launch statement. So a kernel launch looks something like this (using the Linfty norm example again), where shmsize is the dynamic shared memory per block: