Incorrect Behaviour of cudaStreamSynchronize()

It seems that the cudaStreamSynchronize() is not blocking the host thread to continue execution which is the required behaviour.

In the following code, we should get the updated value of ‘rho’ (non-zero) but it gives me zero.
Let me know if I am missing anything here.

    double tr0, tr;
    cudaCheck(cudaMemcpy(&tr0, r0, sizeof(double), cudaMemcpyDeviceToHost));
    cudaCheck(cudaMemcpy(&tr, r, sizeof(double), cudaMemcpyDeviceToHost));
    printf("r0 = %.6f, r = %.6f\n", tr0, tr);
    cublasCheck(cublasSetStream(handle, s1), __LINE__);
    cublasCheck(cublasDdot(handle, n, r0, 1, r, 1, &rho), __LINE__); //rho = r0 * r, A
    cudaCheck(cudaStreamSynchronize(s1));

printf("rho = %.6f, rho_prev = %.6f\n", rho, rho_prev);
    beta = (rho / rho_prev) * (alpha / omega); // Step 8, B CURRENT ISSUE IS HERE
    printf("beta = %.6f\n", beta);

Output:
r0 = 0.840188, r = 0.229024
rho = 0.000000, rho_prev = 333307.870866
beta = 0.000000

Please show a full minimal reproducer that can be compiled and run.

I don’t have a problem with the code you have shown:

# cat t271.cu
#include <cublas_v2.h>
#include <iostream>
#include <cassert>

#define cudaCheck(x) x
#define cublasCheck(x,y) assert(x==CUBLAS_STATUS_SUCCESS)



int main(){
    double *r0, *r;
    cudaMalloc(&r0, sizeof(double));
    cudaMalloc(&r, sizeof(double));
    double h_r0 = 0.840188;
    double h_r = 0.229024;
    double rho_prev = 333307.870866;
    double rho;
    double alpha = 100;
    double omega = 1;
    double beta;
    cudaMemcpy(r0, &h_r0, sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(r, &h_r, sizeof(double), cudaMemcpyHostToDevice);
    cublasHandle_t handle;
    cudaStream_t s1;
    cublasCreate(&handle);
    cudaStreamCreate(&s1);
    const int n = 1;

    double tr0, tr;
    cudaCheck(cudaMemcpy(&tr0, r0, sizeof(double), cudaMemcpyDeviceToHost));
    cudaCheck(cudaMemcpy(&tr, r, sizeof(double), cudaMemcpyDeviceToHost));
    printf("r0 = %.6f, r = %.6f\n", tr0, tr);
    cublasCheck(cublasSetStream(handle, s1), __LINE__);
    cublasCheck(cublasDdot(handle, n, r0, 1, r, 1, &rho), __LINE__); //rho = r0 * r, A
    cudaCheck(cudaStreamSynchronize(s1));

    printf("rho = %.6f, rho_prev = %.6f\n", rho, rho_prev);
    beta = (rho / rho_prev) * (alpha / omega); // Step 8, B CURRENT ISSUE IS HERE
    printf("beta = %.6f\n", beta);
}
# nvcc -o t271 t271.cu -lcublas
# compute-sanitizer ./t271
========= COMPUTE-SANITIZER
r0 = 0.840188, r = 0.229024
rho = 0.192423, rho_prev = 333307.870866
beta = 0.000058
========= ERROR SUMMARY: 0 errors
#

The problem may lie elsewhere. A beta printout of zero could be explained by e.g. alpha being zero, and you haven’t shown the value of alpha. A rho printout of zero could be explained based on the actual vectors you pass to the dot product, which you haven’t shown. They might also be “zero” because of the limited range that you have specified for your printf statement. I’m not suggesting those are the actual issues, so simply saying “alpha is not zero” for example won’t be enough for others to help you. I also agree that a complete example is needed.

A few other comments:

  • The only way your code could make sense is if either rho is a managed variable, or rho is a host variable and the cublas pointer mode is host which is the default.
  • At least in the case of rho being a host variable, the cudaStreamSynchronize() should not be necessary at all. This is because there is an implicit synchronization for cublasDdot when the result is returned to a host variable (i.e. when the pointer mode is host).

PFA the complete code.
Please note that it runs correctly in the first iteration of the bicgstab loop but it gives an error in the second iteration.

Output:

beta = 333307.870866
prev err = 0.000000, err = 217.722910
beta = 0.283511
prev err = 217.722910, err = 235.933494
hepta iterations = 2, err = 235.933494
CPU Execution time: 85.456000 ms
r0 = 0.840188, r = 0.840188
rho = 333307.870866, rho_prev = 1.000000
beta = 333307.870866
prev err = 0.000000, err = 217.722910
r0 = 0.840188, r = 0.229024
rho = 0.000000, rho_prev = 333307.870866
beta = 0.000000
prev err = 217.722910, err = 227.133623
cudastreams: cuda hepta iterations = 2, err = 227.133623
GPU Streams Execution time: 32.776287 ms
x[0] = 0.371959, y[0] = 0.363693
bicgstab_minimal.zip (3.5 KB)

There is nothing wrong with the dot product calculation. First, if you uncomment the cudaDeviceSynchronize() that you have in your code, it doesn’t change the result. So clearly your original theory:

is not correct.

Next, if you copy the r0 and r vectors back to the host, and do a dot-product on them in host code, you will observe that the result is 0.000000000079987. So that explains why your print-out of rho on the second iteration appears to be zero (you are only printing 6 digits). You’ll need to debug backwards to find out why the dot product of those two vectors is approximately zero. The issue is not with the Ddot calculation, nor is it with the behavior of cudaStreamSynchronzie(). The actual calculation at that point, on the actual supplied vectors, is approximately zero.

Since you already have a CPU and GPU realization, debug could proceed by printout of the results of each calculation as well as its inputs, at each step, in both host and device versions. Sure, its tedious. Debug often is. That should quickly identify the point at which the CPU and GPU calculation diverge, and focus your attention.

Dear Robert,

Thank you for your deep analysis of the code. If you compare the output of the CPU and GPU versions of the bicgstab functions, you will see that there seems to be a problem with the cublasDdot(…) function in the second iteration.

I have now added more printf(…) statements to compare the steps of CPU and GPU versions. The updated code is attached.

Output:

(Attachment bicgstab_minimal.cu is missing)

  1. I don’t see it.

  2. moving forward, please do not post code as attachments. Your zip file just contains a single file, post that as code in your posting directly using the forum tools. This makes it easier to discuss your code as well as for searching.

  3. I also don’t see any output:

Here’s what I did before making my last posting. I copied the vectors that are fed into the dot-product from the GPU back to the host, and did the dot-product in host code. That’s how I came up with this:

So suggesting that the GPU Dot product (cublasDdot) itself is the issue for why your print-out is showing 0.00000 is not sensible to me, and I won’t be able to help any further with that claim.

I think there is some issue with the forum. I replied on the email assuming that it will create the post in the forum appropriately.

Here is the code and the output again:

Output:

r0 = 0.840188, r = 0.840188
rho = 333307.870866, rho_prev = 1.000000
beta = 333307.870866
prev err = 0.000000, err = 217.722910
r0 = 0.840188, r = 0.260499
rho = 44063.788360, rho_prev = 333307.870866 (This should be the value of rho in the GPU version as well, there can be a slight variation in the values but not like the current values)
beta = 0.283511
prev err = 217.722910, err = 235.933494
hepta iterations = 2, err = 235.933494
CPU Execution time: 106.732000 ms

r0 = 0.840188, r = 0.840188
rho = 333307.870865977835, rho_prev = 1.000000000000
beta = 333307.870865977835
prev err = 0.000000, err = 217.722910
r0 = 0.840188, r = 0.229024 (Here the results of r are slightly different than the CPU execution but still rho is supposed to be a dot product of r0 and r vectors that should be greater than 0.840188 * 0.229024 = 0.192423).
rho = 0.000000000081, rho_prev = 333307.870865977835
beta = 0.000000000000
prev err = 217.722910, err = 227.133623
cudastreams: cuda hepta iterations = 2, err = 227.133623
GPU Streams Execution time: 33.312386 ms
x[0] = 0.371959, y[0] = 0.363693

Code:
include <stdio.h>
include <stdlib.h>
include <cuda.h>
include <cuda_runtime.h>
include <cublas_v2.h>
include <math.h>
include <nvToolsExt.h>
include <sys/time.h>

define PREFETCHING
define CUDA_STREAMS

define ITER 10
define MAX_ITER 2
define TOL 1e-6
define TOL_CUDA 1e-6

define TYPE double
define DOUBLE

//Initialization Constants
define GPU_INIT 0
define CPU_INIT 1
define BOTH_INIT 2

void mat_vec_hepta(TYPE *A, TYPE *x, TYPE y, int n){
int i,j,k;
for(i=0; i<n; i++){
for(j=0; j<7; j++){
k = i+3-j;
if(k<0 || k>=n) continue;
y[k] += A[i
7+j] * x[i];
}
}
}

void bicgstab_hepta(TYPE *A, TYPE *b, TYPE *x, int n) {
TYPE *r, *r0, *p, *s, *t, *v;
TYPE alpha, beta, omega, rho, rho_prev, err=0.0, err_prev;
int i, k;

r = (TYPE*) malloc(n * sizeof(TYPE));
r0 = (TYPE*) malloc(n * sizeof(TYPE));
p = (TYPE*) malloc(n * sizeof(TYPE));
s = (TYPE*) malloc(n * sizeof(TYPE));
t = (TYPE*) malloc(n * sizeof(TYPE));
v = (TYPE*) malloc(n * sizeof(TYPE));

for (i = 0; i < n; i++) {
    x[i] = 0.0; // x initial guess, Step 1
    r[i] = b[i]; // Step 2
    r0[i] = b[i]; // Step 3
    // Step 5
    p[i] = 0.0;
    s[i] = 0.0;
    t[i] = 0.0;
    v[i] = 0.0;
}

// Step 4
rho = 1.0;
alpha = 1.0;
omega = 1.0;

for (k = 0; k < MAX_ITER; k++) { // Step 6
// printf("rho = %.6f, rho_prev = %.6f\n", rho, rho_prev);
    rho_prev = rho;
    rho = 0.0;

    printf("r0 = %.6f, r = %.6f\n", r0[0], r[0]);

    // Step 7
    for(i=0; i<n; i++){
        rho += r0[i] * r[i];
    }

printf("rho = %.6f, rho_prev = %.6f\n", rho, rho_prev);
    beta = (rho / rho_prev) * (alpha / omega); // Step 8, B CURRENT ISSUE IS HERE
    printf("beta = %.6f\n", beta);
    // beta = (rho / rho_prev) * (alpha / omega); // Step 8
    // printf("beta = %.6f\n", beta);

    // Step 9
    for (i = 0; i < n; i++) {
        p[i] = r[i] + beta * (p[i] - omega * v[i]);
    }

    nvtxRangePushA("mat_vec_hepta");
    // Step 10
    mat_vec_hepta(A, p, v, n);
    nvtxRangePop();

    // Step 11
    alpha = 0.0;
    for(i=0; i<n; i++){
        alpha += r0[i] * v[i];
    }
    alpha = rho / alpha;

    // Step 12
    for(i=0; i<n; i++){
        s[i] = r[i] - alpha * v[i];
    }

    nvtxRangePushA("mat_vec_hepta");
    // Step 13
    mat_vec_hepta(A, s, t, n);
    nvtxRangePop();

    // Step 14
    omega = 0.0;
    for(i=0; i<n; i++){
        omega += t[i] * s[i];
    }
    TYPE temp = 0.0;
    for(i=0; i<n; i++){
        temp += t[i] * t[i];
    }
    omega = omega / temp;

    // Step 15
    for(i=0; i<n; i++){
        x[i] += alpha * p[i] + omega * s[i];
        r[i] = s[i] - omega * t[i];
    }

    // Step 16
    err_prev = err;
    err = 0.0;
    for(i=0; i<n; i++){
        err += x[i] * x[i];
    }
    err = sqrt(err);
    printf("prev err = %.6f, err = %.6f\n", err_prev, err);
    if (fabs(err-err_prev) < TOL) {
        break;
    }
}

printf("hepta iterations = %d, err = %.6f\n", k, err);
// printf("alpha = %.6f, beta = %.6f, omega = %.6f\n", alpha, beta, omega);
// printf("rho = %.6f, rho_prev = %.6f\n", rho, rho_prev);

free(r);
free(r0);
free(p);
free(s);
free(t);
free(v);

}

void cublasCheck(cublasStatus_t e, int line=LINE) {
if (e != CUBLAS_STATUS_SUCCESS) {
printf(“cuBLAS failure %s:%d: ‘%d’: %s\n”, FILE, line, e, cublasGetStatusString(e));
exit(0);
}
}

void cudaCheck(cudaError_t e, int line=LINE) {
if (e != cudaSuccess) {
printf(“CUDA failure %s:%d: ‘%s’\n”, FILE, line, cudaGetErrorString(e));
exit(0);
}
}
global void mat_vec_hepta_cuda_kernel(TYPE *A, TYPE *x, TYPE y, int n){
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j, k;
if(i<n){
for(j=0; j<7; j++){
k = i+3-j;
if(k<0 || k>=n) continue;
atomicAdd(&y[k], A[i
7+j] * x[i]);
}
}
}

void mat_vec_hepta_cuda(TYPE *A, TYPE *x, TYPE y, int n, cudaStream_t st=NULL){
// int i,j,k;
// for(i=0; i<n; i++){
// for(j=0; j<7; j++){
// k = i+3-j;
// if(k<0 || k>=n) continue;
// y[k] += A[i
7+j] * x[i];
// }
// }
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
if (st != NULL) {
mat_vec_hepta_cuda_kernel<<<blocksPerGrid, threadsPerBlock, 0, st>>>(A, x, y, n);
} else {
mat_vec_hepta_cuda_kernel<<<blocksPerGrid, threadsPerBlock>>>(A, x, y, n);
}
// mat_vec_hepta_cuda_kernel<<<blocksPerGrid, threadsPerBlock>>>(A, x, y, n);
cudaCheck(cudaDeviceSynchronize());

}

#ifdef DOUBLE
global void cuda_init(double *x, double *r, double *r0, double *p, double *s, double *t, double *v, double *b, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
x[i] = p[i] = s[i] = t[i] = v[i] = 0.0;
r[i] = r0[i] = b[i];
}
}

void bicgstab_hepta_cudastreams(double *A, double *b, double *x, int n) {
double *r, *r0, *p, *s, *t, *v, *tmp;
double alpha, beta, omega, rho, rho_prev, err=0.0, err_prev;

nvtxRangePushA("cudastreams: cuda memory allocation and deallocation");

cudaCheck(cudaMalloc(&r, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&r0, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&p, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&s, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&t, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&v, n * sizeof(double)), __LINE__);
cudaCheck(cudaMalloc(&tmp, n * sizeof(double)), __LINE__);

nvtxRangePop();

nvtxRangePushA("cudastreams: cuda init");

int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
cuda_init<<<blocksPerGrid, threadsPerBlock>>>(x, r, r0, p, s, t, v, b, n);
cudaCheck(cudaDeviceSynchronize());

nvtxRangePop();

// Step 4
rho = 1.0;
alpha = 1.0;
omega = 1.0;

nvtxRangePushA("cudastreams: cublas");
cublasHandle_t handle;
cudaStream_t s1, s2, s3, s4, s5;
cudaEvent_t es17, es16;
cublasCheck(cublasCreate(&handle), __LINE__);
cudaCheck(cudaStreamCreate(&s1), __LINE__);
cudaCheck(cudaStreamCreate(&s2), __LINE__);
cudaCheck(cudaStreamCreate(&s3), __LINE__);
cudaCheck(cudaStreamCreate(&s4), __LINE__);
cudaCheck(cudaStreamCreate(&s5), __LINE__);
cudaCheck(cudaEventCreate(&es17), __LINE__);
cudaCheck(cudaEventCreate(&es16), __LINE__);
nvtxRangePop();

int k;
for (k = 0; k < MAX_ITER; k++) {
// printf("rho = %.6f, rho_prev = %.6f\n", rho, rho_prev);

//T1:
rho_prev = rho;
//T2:
rho = 0.0;

    nvtxRangePushA("cudastreams: cublas");
    // A = 6, B = 4, C = 2, D = 1, E = 1, F = 1, G = 1, H = 2, I = 1, J = 2, K = 2, L = 1, M = 2, N = 2, O = 1, P = 1

//Start of PP1 Region
// Step 7
//T3:
double tr0, tr;
cudaCheck(cudaMemcpy(&tr0, r0, sizeof(double), cudaMemcpyDeviceToHost));
cudaCheck(cudaMemcpy(&tr, r, sizeof(double), cudaMemcpyDeviceToHost));
printf(“r0 = %.6f, r = %.6f\n”, tr0, tr);
cublasCheck(cublasSetStream(handle, s1), LINE);
cublasCheck(cublasDdot(handle, n, r0, 1, r, 1, &rho), LINE); //rho = r0 * r, A
cudaCheck(cudaStreamSynchronize(s1));
// cudaCheck(cudaDeviceSynchronize());

//T4:
printf(“rho = %.12f, rho_prev = %.12f\n”, rho, rho_prev);
beta = (rho / rho_prev) * (alpha / omega); // Step 8, B CURRENT ISSUE IS HERE
printf(“beta = %.12f\n”, beta);

    // Step 9

//T5:
double nomega = -omega; // A
//T6:
cublasCheck(cublasSetStream(handle, s2), LINE);
cublasCheck(cublasDaxpy(handle, n, &nomega, v, 1, p, 1), LINE); // p = p - omega * v, B
//T7:
cublasCheck(cublasSetStream(handle, s4), LINE);
cublasCheck(cublasDcopy(handle, n, r, 1, tmp, 1), LINE); // tmp = r, A
//End of PP1 Region

//T8:
cublasCheck(cublasSetStream(handle, s2), LINE);
cublasCheck(cublasDaxpy(handle, n, &beta, p, 1, tmp, 1), LINE); // tmp = tmp + beta * p, C

//T9:
cublasCheck(cublasDcopy(handle, n, tmp, 1, p, 1), LINE); // p = tmp, D
// cudaCheck(cudaDeviceSynchronize());
nvtxRangePop();

    nvtxRangePushA("cudastreams: mat_vec_hepta_cuda");
    // Step 10

//T10:
mat_vec_hepta_cuda(A, p, v, n, s2); // v = A * p, E
// printf(“p[0] = %.6f\n”, p[0]);
nvtxRangePop();

    nvtxRangePushA("cudastreams: cublas");
    // Step 11

//T11:
alpha = 0.0; // C
//T12:
cublasCheck(cublasDdot(handle, n, r0, 1, v, 1, &alpha), LINE); // alpha = r0 * v, F
cudaCheck(cudaStreamSynchronize(s2));
//T13:
alpha = rho / alpha; // G

    // Step 12

//T14:
cublasCheck(cublasSetStream(handle, s5), LINE);
cublasCheck(cublasDcopy(handle, n, r, 1, s, 1), LINE); // s = r, A
//T15:
double nalpha = -alpha; // H
//T16:
cublasCheck(cublasSetStream(handle, s2), LINE);
cublasCheck(cublasDaxpy(handle, n, &nalpha, v, 1, s, 1), LINE); // s = s - alpha * v, I
cudaCheck(cudaEventRecord(es16, s2));
// cudaCheck(cudaDeviceSynchronize());
nvtxRangePop();

    nvtxRangePushA("cudastreams: mat_vec_hepta_cuda");
    // Step 13

//T17:
mat_vec_hepta_cuda(A, s, t, n, s2); // t = A * s, J
cudaCheck(cudaEventRecord(es17, s2));
nvtxRangePop();

    nvtxRangePushA("cudastreams: cublas");
    // Step 14

//T18:
omega = 0.0; // B
//T19:
cublasCheck(cublasDdot(handle, n, t, 1, s, 1, &omega), LINE); // omega = t * s, K
//T20:
double temp = 0.0; // A
//T21:
cublasCheck(cublasSetStream(handle, s3), LINE);
cudaCheck(cudaStreamWaitEvent(s3, es17));
cublasCheck(cublasDdot(handle, n, t, 1, t, 1, &temp), LINE); // temp = t * t, K
cudaCheck(cudaStreamSynchronize(s3));
//T22:
omega = omega / temp; // L

    // Step 15

//T23:
cublasCheck(cublasDaxpy(handle, n, &alpha, p, 1, x, 1), LINE); // x = x + alpha * p, H
//T24:
cublasCheck(cublasDaxpy(handle, n, &omega, s, 1, x, 1), LINE); // x = x + omega * s, M
//T25:
nomega = -omega; // M
//T26:
cudaCheck(cudaStreamWaitEvent(s3, es16));
cublasCheck(cublasDcopy(handle, n, s, 1, r, 1), LINE); // r = s, J
//T27:
cublasCheck(cublasSetStream(handle, s4), LINE);
cublasCheck(cublasDaxpy(handle, n, &nomega, t, 1, r, 1), LINE); // r = r - omega * t, N

    // Step 16

//T28:
err_prev = err; // A
//T29:
err = 0.0; // B
//T30:
cublasCheck(cublasSetStream(handle, s3), LINE);
cublasCheck(cublasDdot(handle, n, x, 1, x, 1, &err), LINE); // err = x * x, N
cudaCheck(cudaStreamSynchronize(s3));
//T31:
err = sqrt(err); // O
printf(“prev err = %.6f, err = %.6f\n”, err_prev, err);
//T32:
if (fabs(err - err_prev) < TOL_CUDA) { // P
break;
}
nvtxRangePop();
// cudaCheck(cudaDeviceSynchronize());
}

cudaCheck(cudaEventDestroy(es17));
cudaCheck(cudaEventDestroy(es16));
cudaCheck(cudaStreamDestroy(s1));
cudaCheck(cudaStreamDestroy(s2));
cudaCheck(cudaStreamDestroy(s3));
cudaCheck(cudaStreamDestroy(s4));
cudaCheck(cudaStreamDestroy(s5));

printf("cudastreams: cuda hepta iterations = %d, err = %.6f\n", k, err);
// printf("alpha = %.6f, beta = %.6f, omega = %.6f\n", alpha, beta, omega);
// printf("rho = %.6f, rho_prev = %.6f\n", rho, rho_prev);

nvtxRangePushA("cudastreams: cuda memory allocation and deallocation");
cudaCheck(cudaFree(r));
cudaCheck(cudaFree(r0));
cudaCheck(cudaFree(p));
cudaCheck(cudaFree(s));
cudaCheck(cudaFree(t));
cudaCheck(cudaFree(v));
cudaCheck(cudaFree(tmp));
nvtxRangePop();

cublasCheck(cublasDestroy(handle), __LINE__);

}
endif

void check_result(TYPE *x, TYPE *y, int n){
int i;
for(i=0; i<n; i++){
if(fabs(x[i]-y[i]) > TOL){
printf(“x[%d] = %.6f, y[%d] = %.6f\n”, i, x[i], i, y[i]);
return;
}
}
printf(“Correct!\n”);
}

void init(TYPE *A_hepta, TYPE *b, TYPE *x_hepta, TYPE *x_hepta_cuda, int n, int device_init = BOTH_INIT) {
int i, j, hi;

for (i = 0; i < n; i++) {
    b[i] = rand() / (TYPE) RAND_MAX;
    if (device_init == GPU_INIT) {
        x_hepta_cuda[i] = 0.0;
    } else if (device_init == CPU_INIT) {
        x_hepta[i] = 0.0;
    } else {
        x_hepta[i] = 0.0;
        x_hepta_cuda[i] = 0.0;
    }
    for (j = 0; j < 7; j++) {
        A_hepta[i*7+j] = 0.0;
    }
}

for (i = 0; i < n; i++) {
    for (j = 0; j < 7; j++) {
        hi = i + 3 - j;
        if (hi < 0 || hi >= n) continue;
        A_hepta[i*7+j] = rand() / (TYPE) RAND_MAX;
    }
}

}

int main(int argc, char *argv) {

int n = 1000000;
if(argc > 1){
    n = atoi(argv[1]);
}

TYPE *b, *A_hepta, *x_hepta, *x_hepta_cuda;

cudaCheck(cudaMallocManaged(&A_hepta, n * 7 * sizeof(TYPE)), __LINE__);
cudaCheck(cudaMallocManaged(&b, n * sizeof(TYPE)), __LINE__);
x_hepta = (TYPE*) malloc(n * sizeof(TYPE));
cudaCheck(cudaMallocManaged(&x_hepta_cuda, n * sizeof(TYPE)), __LINE__);

init(A_hepta, b, x_hepta, x_hepta_cuda, n);

int deviceID;
cudaGetDevice(&deviceID);

#ifdef PREFETCHING
cudaCheck(cudaMemPrefetchAsync(A_hepta, n * 7 * sizeof(TYPE), deviceID));
cudaCheck(cudaMemPrefetchAsync(b, n * sizeof(TYPE), deviceID));
cudaCheck(cudaMemPrefetchAsync(x_hepta_cuda, n * sizeof(TYPE), deviceID));
endif

struct timeval start, end;
nvtxRangePushA("bicgstab_hepta");
gettimeofday(&start, NULL);
bicgstab_hepta(A_hepta, b, x_hepta, n);
gettimeofday(&end, NULL);
double elapsedTime = (end.tv_sec - start.tv_sec) * 1000.0; // sec to ms
elapsedTime += (end.tv_usec - start.tv_usec) / 1000.0; // us to ms
nvtxRangePop();
printf("CPU Execution time: %.6f ms\n\n", elapsedTime);

cudaEvent_t cuda_start, cuda_stop;
cudaEventCreate(&cuda_start);
cudaEventCreate(&cuda_stop);
float milliseconds = 0;

nvtxRangePushA("bicgstab_hepta_cudastreams");
cudaEventRecord(cuda_start);
bicgstab_hepta_cudastreams(A_hepta, b, x_hepta_cuda, n);
cudaCheck(cudaDeviceSynchronize(), __LINE__);
cudaEventRecord(cuda_stop);
cudaEventSynchronize(cuda_stop);
milliseconds = 0;
cudaEventElapsedTime(&milliseconds, cuda_start, cuda_stop);
nvtxRangePop();
printf("GPU Streams Execution time: %.6f ms\n", milliseconds);

check_result(x_hepta, x_hepta_cuda, n);

cudaEventDestroy(cuda_start);
cudaEventDestroy(cuda_stop);


free(x_hepta);

cudaFree(A_hepta);
cudaFree(b);
cudaFree(x_hepta_cuda);

return 0;

}

Code file also attached.
bicgstab_minimal.zip (3.5 KB)

Please format code correctly when posting here. A simple approach:

  1. Edit your post by clicking on the pencil icon below the post.
  2. Select the code
  3. click the </> button at the top of the post edit window
  4. save your changes

Please do that now. I’ll have more to say after you clean up your posting.