The problem may lie elsewhere. A beta printout of zero could be explained by e.g. alpha being zero, and you haven’t shown the value of alpha. A rho printout of zero could be explained based on the actual vectors you pass to the dot product, which you haven’t shown. They might also be “zero” because of the limited range that you have specified for your printf statement. I’m not suggesting those are the actual issues, so simply saying “alpha is not zero” for example won’t be enough for others to help you. I also agree that a complete example is needed.
A few other comments:
The only way your code could make sense is if either rho is a managed variable, or rho is a host variable and the cublas pointer mode is host which is the default.
At least in the case of rho being a host variable, the cudaStreamSynchronize() should not be necessary at all. This is because there is an implicit synchronization for cublasDdot when the result is returned to a host variable (i.e. when the pointer mode is host).
There is nothing wrong with the dot product calculation. First, if you uncomment the cudaDeviceSynchronize() that you have in your code, it doesn’t change the result. So clearly your original theory:
is not correct.
Next, if you copy the r0 and r vectors back to the host, and do a dot-product on them in host code, you will observe that the result is 0.000000000079987. So that explains why your print-out of rho on the second iteration appears to be zero (you are only printing 6 digits). You’ll need to debug backwards to find out why the dot product of those two vectors is approximately zero. The issue is not with the Ddot calculation, nor is it with the behavior of cudaStreamSynchronzie(). The actual calculation at that point, on the actual supplied vectors, is approximately zero.
Since you already have a CPU and GPU realization, debug could proceed by printout of the results of each calculation as well as its inputs, at each step, in both host and device versions. Sure, its tedious. Debug often is. That should quickly identify the point at which the CPU and GPU calculation diverge, and focus your attention.
Thank you for your deep analysis of the code. If you compare the output of the CPU and GPU versions of the bicgstab functions, you will see that there seems to be a problem with the cublasDdot(…) function in the second iteration.
I have now added more printf(…) statements to compare the steps of CPU and GPU versions. The updated code is attached.
moving forward, please do not post code as attachments. Your zip file just contains a single file, post that as code in your posting directly using the forum tools. This makes it easier to discuss your code as well as for searching.
I also don’t see any output:
Here’s what I did before making my last posting. I copied the vectors that are fed into the dot-product from the GPU back to the host, and did the dot-product in host code. That’s how I came up with this:
So suggesting that the GPU Dot product (cublasDdot) itself is the issue for why your print-out is showing 0.00000 is not sensible to me, and I won’t be able to help any further with that claim.
I think there is some issue with the forum. I replied on the email assuming that it will create the post in the forum appropriately.
Here is the code and the output again:
Output:
r0 = 0.840188, r = 0.840188
rho = 333307.870866, rho_prev = 1.000000
beta = 333307.870866
prev err = 0.000000, err = 217.722910
r0 = 0.840188, r = 0.260499
rho = 44063.788360, rho_prev = 333307.870866 (This should be the value of rho in the GPU version as well, there can be a slight variation in the values but not like the current values)
beta = 0.283511
prev err = 217.722910, err = 235.933494
hepta iterations = 2, err = 235.933494
CPU Execution time: 106.732000 ms
r0 = 0.840188, r = 0.840188
rho = 333307.870865977835, rho_prev = 1.000000000000
beta = 333307.870865977835
prev err = 0.000000, err = 217.722910
r0 = 0.840188, r = 0.229024 (Here the results of r are slightly different than the CPU execution but still rho is supposed to be a dot product of r0 and r vectors that should be greater than 0.840188 * 0.229024 = 0.192423).
rho = 0.000000000081, rho_prev = 333307.870865977835
beta = 0.000000000000
prev err = 217.722910, err = 227.133623
cudastreams: cuda hepta iterations = 2, err = 227.133623
GPU Streams Execution time: 33.312386 ms
x[0] = 0.371959, y[0] = 0.363693
void mat_vec_hepta(TYPE *A, TYPE *x, TYPE y, int n){
int i,j,k;
for(i=0; i<n; i++){
for(j=0; j<7; j++){
k = i+3-j;
if(k<0 || k>=n) continue;
y[k] += A[i7+j] * x[i];
}
}
}
void bicgstab_hepta(TYPE *A, TYPE *b, TYPE *x, int n) {
TYPE *r, *r0, *p, *s, *t, *v;
TYPE alpha, beta, omega, rho, rho_prev, err=0.0, err_prev;
int i, k;
void cublasCheck(cublasStatus_t e, int line=LINE) {
if (e != CUBLAS_STATUS_SUCCESS) {
printf(“cuBLAS failure %s:%d: ‘%d’: %s\n”, FILE, line, e, cublasGetStatusString(e));
exit(0);
}
}
void cudaCheck(cudaError_t e, int line=LINE) {
if (e != cudaSuccess) {
printf(“CUDA failure %s:%d: ‘%s’\n”, FILE, line, cudaGetErrorString(e));
exit(0);
}
} global void mat_vec_hepta_cuda_kernel(TYPE *A, TYPE *x, TYPE y, int n){
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j, k;
if(i<n){
for(j=0; j<7; j++){
k = i+3-j;
if(k<0 || k>=n) continue;
atomicAdd(&y[k], A[i7+j] * x[i]);
}
}
}
void mat_vec_hepta_cuda(TYPE *A, TYPE *x, TYPE y, int n, cudaStream_t st=NULL){
// int i,j,k;
// for(i=0; i<n; i++){
// for(j=0; j<7; j++){
// k = i+3-j;
// if(k<0 || k>=n) continue;
// y[k] += A[i7+j] * x[i];
// }
// }
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
if (st != NULL) {
mat_vec_hepta_cuda_kernel<<<blocksPerGrid, threadsPerBlock, 0, st>>>(A, x, y, n);
} else {
mat_vec_hepta_cuda_kernel<<<blocksPerGrid, threadsPerBlock>>>(A, x, y, n);
}
// mat_vec_hepta_cuda_kernel<<<blocksPerGrid, threadsPerBlock>>>(A, x, y, n);
cudaCheck(cudaDeviceSynchronize());
}
#ifdef DOUBLE global void cuda_init(double *x, double *r, double *r0, double *p, double *s, double *t, double *v, double *b, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
x[i] = p[i] = s[i] = t[i] = v[i] = 0.0;
r[i] = r0[i] = b[i];
}
}
nvtxRangePushA("cudastreams: cuda memory allocation and deallocation");
cudaCheck(cudaMalloc(&r, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&r0, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&p, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&s, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&t, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&v, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&tmp, n * sizeof(double)), __LINE__);
nvtxRangePop();
nvtxRangePushA("cudastreams: cuda init");
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
cuda_init<<<blocksPerGrid, threadsPerBlock>>>(x, r, r0, p, s, t, v, b, n);
cudaCheck(cudaDeviceSynchronize());
nvtxRangePop();
// Step 4
rho = 1.0;
alpha = 1.0;
omega = 1.0;
nvtxRangePushA("cudastreams: cublas");
cublasHandle_t handle;
cudaStream_t s1, s2, s3, s4, s5;
cudaEvent_t es17, es16;
cublasCheck(cublasCreate(&handle), __LINE__);
cudaCheck(cudaStreamCreate(&s1), __LINE__);
cudaCheck(cudaStreamCreate(&s2), __LINE__);
cudaCheck(cudaStreamCreate(&s3), __LINE__);
cudaCheck(cudaStreamCreate(&s4), __LINE__);
cudaCheck(cudaStreamCreate(&s5), __LINE__);
cudaCheck(cudaEventCreate(&es17), __LINE__);
cudaCheck(cudaEventCreate(&es16), __LINE__);
nvtxRangePop();
int k;
for (k = 0; k < MAX_ITER; k++) {
// printf("rho = %.6f, rho_prev = %.6f\n", rho, rho_prev);
//T1:
rho_prev = rho;
//T2:
rho = 0.0;
nvtxRangePushA("cudastreams: cublas");
// A = 6, B = 4, C = 2, D = 1, E = 1, F = 1, G = 1, H = 2, I = 1, J = 2, K = 2, L = 1, M = 2, N = 2, O = 1, P = 1
//Start of PP1 Region
// Step 7
//T3:
double tr0, tr;
cudaCheck(cudaMemcpy(&tr0, r0, sizeof(double), cudaMemcpyDeviceToHost));
cudaCheck(cudaMemcpy(&tr, r, sizeof(double), cudaMemcpyDeviceToHost));
printf(“r0 = %.6f, r = %.6f\n”, tr0, tr);
cublasCheck(cublasSetStream(handle, s1), LINE);
cublasCheck(cublasDdot(handle, n, r0, 1, r, 1, &rho), LINE); //rho = r0 * r, A
cudaCheck(cudaStreamSynchronize(s1));
// cudaCheck(cudaDeviceSynchronize());
//T4:
printf(“rho = %.12f, rho_prev = %.12f\n”, rho, rho_prev);
beta = (rho / rho_prev) * (alpha / omega); // Step 8, B CURRENT ISSUE IS HERE
printf(“beta = %.12f\n”, beta);
// Step 9
//T5:
double nomega = -omega; // A
//T6:
cublasCheck(cublasSetStream(handle, s2), LINE);
cublasCheck(cublasDaxpy(handle, n, &nomega, v, 1, p, 1), LINE); // p = p - omega * v, B
//T7:
cublasCheck(cublasSetStream(handle, s4), LINE);
cublasCheck(cublasDcopy(handle, n, r, 1, tmp, 1), LINE); // tmp = r, A
//End of PP1 Region