CUDA indexing issues

Hi all,

I want to start by saying that I am new to computing with CUDA so if I am overlooking something, I apologize.

I am working on a personal project that is computation heavy - I’m encountering that comparing CPU generated results to GPU generated results is problematic up to a certain extent. Specifically, elementwise-matrix-multiplication has been problematic for me. In my tests, all the intermediate results until the elementwise-matrix-multiplication were identical - I even switched to using cuPy to make sure that it wasn’t my mistake, but the cuPy implementation actually yields the same results and issues.

Just pondering, I imagine that it has to do with stressing the GPU’s resources or some weird indexing error, but I’d really like to know more about this because it is hindering my (and likely many others’) progress.

If there are any other details I can provide to help, please let me know!

It’s fairly common for the GPU to not generate bitwise identical calculation results compared to the CPU (or another platform). You can find numerous questions of this type, and there is not one single reason that describes them all.

It’s generally not based on “stress” or indexing errors . If you have an indexing error, that is most probably a bug in your code.

Here are a few examples:

1 2

If this is floating-point computation in which small numerical differences are observed between multiple platforms, I would suggest reading NVIDIA’s whitepaper, followed by Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd edition, SIAM 2002.

Unless the indexing is based on floating-point computation, the source of indexing errors is highly likely to be the code’s author, with off-by-one errors a very common scenario.

I evaluated both methods using all intermediate results using NumPy’s allclose() function to check that the results were similar. Only the element-wise product is proving to cause issues for me - I mean the result of doing:

mat3 = mat1 * mat2

Differs radically between cuPy and NumPy, although

mat1_cuPy, mat2_cuPy = mat1_NumPy, mat2_NumPy

This isn’t really the place to discuss numpy, and isn’t really a focus area for cupy either.

If you have a CUDA C++ complete example that demonstrates the issue you are having, I think there is a good chance someone would be able to offer advice.

Does it have to be CUDA C++? I can give detailed explanation using my implementation using Numba which I believe was built on CUDA.

My problem isn’t the floating-point computation, actually it performs this correctly - my problem is in the computation, the wrong value is being accessed.

For context, my code was originally something like:

@cuda.jit
def elementwise_matrix_multiplication_3d(A_d,B_d,C_d, N,M,K):
   thidx = ...
   thidy = ...
   thidz = ...
   if thidx<N and thidy<M and thidz < K:
      C_d[thidx,thidy,thidz] = A_d[thidx,thidy,thidz]*B_d[thidx,thidy,thidz]
   cuda.syncthreads()

To debug this I just added print statements , i.e. :

...
print( ... , 'B_d(', thidx, thidy, thidz, '): ', B_d[thidx,thidy,thidz], ...)

It actually prints the wrong values for the second matrix, but it prints the correct values for the first matrix.

The only reason I mention cuPy is that I believe that since it is a reliable API, that it would produce the correct output, but actually it is generating the same output as my Numba implementation.

If you can provide a complete example using numba, that may help others to help you.

Your example should be something others can run without adding anything or changing anything. Your example should make evident the numbers you are comparing to see the difference between CPU and GPU results. That would normally mean a complete example that shows the calculation on the CPU, the calculation on the GPU, and a printout of the results you are comparing and expecting to be equal.

1 Like

(1) Passed the wrong matrix?
(2) Passed a sub-matrix incorrectly?
(3) Unintentional aliasing of matrices?
(4) Inadvertent overwriting of indexing information?
(5) Matrix allocation failed (i.e function operating on uninitialized data)?
(6) Incorrect dimension(s) used in matrix allocation, leading to out-of-bounds access?

To debug this, I would make the matrices very small (e.g. 3x3), and print the entire matrix directly prior to calling the function and as first action inside the function. Obviously they should be identical. If that works, gradually increase the size of the matrices until failures are observed.

the fact that you are observing the same behavior with Numba and cuPy indicates that the problem is am issue in the code.

2 Likes

The following is a script I put together for anyone to follow. The dependencies are NumPy, cuPy and Numba.

import numpy as np
import cupy as cp
from numba import cuda
import math
@cuda.jit
def elementwise_matrix_multiplication_3D(A_d,B_d,C_d,N,M,K):
    '''
    INPUTS:
        A_d : np array, shape (N,M,K) matrix.
        B_d : np array, shape (N,M,K) matrix.
        N : int, number of rows.
        M : int, number of columns.
        K : int, number of depth.
    OUTPUTS:
        C_d : np array, shape (N,M,K) matrix.
    DETAILS:
        GPU Accelerated elementwise matrix multiplication.
    '''
    thidx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
    thidy = cuda.blockDim.y*cuda.blockIdx.y + cuda.threadIdx.y
    thidz = cuda.blockDim.z*cuda.blockIdx.z + cuda.threadIdx.z
    if thidx < N and thidy < M and thidz < K:
        prod = A_d[thidx,thidy,thidz]*B_d[thidx,thidy,thidz]
        #print('A:(',thidx,thidy,thidz,'):',A_d[thidx,thidy,thidz],'B:(',thidx,thidy,thidz,'):',B_d[thidx,thidy,thidz],'Prod:(',thidx,thidy,thidz,'):',prod)
        #cuda.atomic.add(C_d,(thidx,thidy,thidz),prod)
        C_d[thidx,thidy,thidz] = prod
    cuda.syncthreads()
@cuda.jit
def FWF_FeatureMap_DEBUG(X_d,X_vec_d,X_e,X_p,X_n,x_d,N,L,Order,sigma,factorial_list_d):
    thidx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
    thidy = cuda.blockDim.y*cuda.blockIdx.y + cuda.threadIdx.y
    thidz = cuda.blockDim.z*cuda.blockIdx.z + cuda.threadIdx.z
    if thidx < N and thidy < L and thidz < Order:
        exp_part = math.exp((-x_d[thidx, thidy]**2)/(2*sigma**2))
        pow_part = x_d[thidx, thidy]**thidz
        norm_part = (sigma**thidz) * math.sqrt(factorial_list_d[thidz])
        X_e[thidx,thidy] = exp_part
        X_p[thidx,thidy,thidz] = pow_part
        X_n[thidz] = norm_part
        X_d[thidx,thidy,thidz] = exp_part * pow_part / norm_part
        X_vec_d[thidx,Order*thidy+thidz] = exp_part * pow_part / norm_part
    cuda.syncthreads()
def FWF_ACC_CPUOnly_DEBUG(x,d,xtest,Order,sigma,rcond=1e-15):
    '''
    x: np array, shape (N,L), input time-series data.
    d: np array, shape (N,), output time-series data.
    xtest: np array, shape (Ntest,L), input test time-series data.
    dtest: np array, shape (Ntest,), output test time-series data.
    Order: int, maximum order considered.
    sigma: float, kernel spread.
    rcond: float, regularization constant.

    '''
    #region Define the constants
    N,L = x.shape
    Ntest = xtest.shape[0]
    O = np.arange(Order)
    sigma_pow = np.power(sigma,O)
    factorial_sqrt = np.sqrt(np.array([math.factorial(o) for o in O],dtype=float))
    #endregion
    #region Vectorized feature map
    #print(x.shape)
    x_e = x[:,:,np.newaxis]
    #print(x_e.shape)
    xtest_e = xtest[:,:,np.newaxis]
    X_e = np.exp(-x_e**2/(2*sigma**2))
    X_p = x_e**O
    X_n = sigma_pow*factorial_sqrt
    X = (X_e * X_p/X_n).reshape(N,L,Order)
    X_vec = (np.exp(-x_e**2/(2*sigma**2)) * (x_e**O)/(sigma_pow*factorial_sqrt)).reshape(N,-1)
    Xtest = (np.exp(-xtest_e**2/(2*sigma**2)) * (xtest_e**O)/(sigma_pow*factorial_sqrt)).reshape(Ntest,L,Order)
    Xtest_vec = (np.exp(-xtest_e**2/(2*sigma**2)) * (xtest_e**O)/(sigma_pow*factorial_sqrt)).reshape(Ntest,-1)
    D = np.repeat(d[:, np.newaxis], Order, axis=1)
    D_e = np.repeat(D[:, np.newaxis, :], L, axis=1)
    #endregion
    #region Calculate predictions
    V = (X_vec.T@X_vec)/N
    Vinv = np.linalg.pinv(V,rcond=rcond)
    Ptemp = X*D_e
    P = np.mean(Ptemp,axis=0).flatten()
    W = Vinv@P
    Y = X_vec@W
    Ytest = Xtest_vec@W
    #endregion
    return Y,Ytest,X,X_vec,X_e,X_p,X_n,D_e,V,Ptemp,P,W
def FWF_ACC_DEBUG(x,d,xtest,Order,sigma,rcond=1e-15):
    #region Define the constants
    N = x.shape[0]
    L = x.shape[1]
    Nt = xtest.shape[0]
    OL = Order*L
    #endregion
    #region Allocate the host memory
    f_list = np.array([np.math.factorial(i) for i in range(Order)],dtype=np.float64)
    #endregion
    #region Allocate the device memory
    X_d = cp.zeros((N,L,Order),dtype=cp.float64)
    X_vec_d = cp.zeros((N,OL),dtype=cp.float64)
    X_e_d = cp.zeros((N,L),dtype=cp.float64)
    X_p_d = cp.zeros((N,L,Order),dtype=cp.float64)
    X_n_d = cp.zeros((Order,),dtype=cp.float64)
    Xt_d = cp.zeros((Nt,L,Order),dtype=cp.float64)
    Xt_vec_d = cp.zeros((Nt,OL),dtype=cp.float64)
    Xt_e_d = cp.zeros((N,L),dtype=cp.float64)
    Xt_p_d = cp.zeros((N,L,Order),dtype=cp.float64)
    Xt_n_d = cp.zeros((Order,),dtype=cp.float64)
    #endregion
    #region Copy the data to the device
    x_d = cp.asarray(x)
    xt_d = cp.asarray(xtest)
    d_d = cp.asarray(d)
    f_list_d = cp.asarray(f_list)
    #endregion
    #region Initialize the device memory
    D_d = cp.repeat(d_d[:, cp.newaxis], Order, axis=1)
    D_e_d = cp.repeat(D_d[:, cp.newaxis, :], L, axis=1)
    D = cp.asnumpy(D_e_d)
     #endregion
    #region Create the feature map for the training data
    FWF_FeatureMap_DEBUG[(N//32+1,L//8+1,Order//4+1),(32,8,4)](X_d,X_vec_d,X_e_d,X_p_d,X_n_d,x_d,N,L,Order,sigma,f_list_d)
    FWF_FeatureMap_DEBUG[(Nt//32+1,L//8+1,Order//4+1),(32,8,4)](Xt_d,Xt_vec_d,Xt_e_d,Xt_p_d,Xt_n_d,xt_d,Nt,L,Order,sigma,f_list_d)
    X = cp.asnumpy(X_d)
    X_vec = cp.asnumpy(X_vec_d)
    X_e = cp.asnumpy(X_e_d)
    X_p = cp.asnumpy(X_p_d)
    X_n = cp.asnumpy(X_n_d)
    #endregion
    #region Calculate the covariance matrix
    V_d = X_vec_d.T@X_vec_d/N
    #endregion
    #region Get the inverse of the covariance matrix
    V = cp.asnumpy(V_d)
    Vinv_d = cp.linalg.pinv(V_d,rcond=rcond)
    #endregion
    #region Calculate the projection matrix
    #Calculate the projection matrix
    Ptemp_d = X_d*D_e_d
    Ptemp = cp.asnumpy(Ptemp_d)
    #Calculate the column mean
    P_d = cp.mean(Ptemp_d,axis=0).flatten()
    P = cp.asnumpy(P_d)
    #endregion
    #region Calculate the weights
    W_d = Vinv_d@P_d
    W = cp.asarray(W_d)
    #endregion
    #region Calculate the predictions
    Y_d = X_vec_d@W_d
    Yt_d = Xt_vec_d@W_d
    Y = cp.asnumpy(Y_d)
    Yt = cp.asnumpy(Yt_d)
    #endregion
    #region Free the memory
    del X_d,X_vec_d,Xt_d,Xt_vec_d,V_d,Vinv_d,D_d,Ptemp_d,P_d,W_d,Y_d,Yt_d,f_list_d
    #endregion
    return Y,Yt,X,X_vec,X_e,X_p,X_n,D,V,Ptemp,P,W
if __name__ == '__main__':

    
    #Load the x's & d's
    x = np.loadtxt('x.txt')
    d = np.loadtxt('d.txt')
    xtest = np.loadtxt('xtest.txt')
    test_N,test_L = x.shape
    test_Order = 8
    test_sigma = 0.125
    test_rcond = 2**(-25)

    pTrain_FWF,pTest_FWF,X_FWF,X_vec_FWF,X_e_FWF,X_p_FWF,X_n_FWF,D_FWF,V_FWF,Ptemp_FWF,P_FWF,W_FWF=FWF_ACC_CPUOnly_DEBUG(x,d,xtest,test_Order,test_sigma,test_rcond)
    pTrain_FWFE,pTest_FWFE,X_FWFE,X_vec_FWFE,X_e_FWFE,X_p_FWFE,X_n_FWFE,D_FWFE,V_FWFE,Ptemp_FWFE,P_FWFE,W_FWFE=FWF_ACC_DEBUG(x,d,xtest,test_Order,test_sigma,test_rcond)

    #region checking elementwise matrix multiplication
    
    test_mat = np.arange(test_N*test_L*test_Order).reshape(test_N,test_L,test_Order)
    test_mat2 = (np.arange(test_N*test_L*test_Order)*2+1).reshape(test_N,test_L,test_Order)
    test_mult = test_mat*test_mat2

    test_mat_d = cuda.to_device(test_mat)
    test_mat2_d = cuda.to_device(test_mat2)

    test_mult_d = cuda.device_array((test_N,test_L,test_Order),dtype=np.float64)
    elementwise_matrix_multiplication_3D[(2,1,1),(16,8,8)](test_mat_d,test_mat2_d,test_mult_d,test_N,test_L,test_Order)
    test_mult_device = test_mult_d.copy_to_host()

    print('Synthetic Test Matrix Multiplication')
    print(test_mult.shape)
    print(np.allclose(test_mult,test_mult_device))

    test_Ptemp_d = cp.zeros((test_N,test_L,test_Order),dtype=cp.float64)
    test_X_d = cp.array(X_FWFE)
    test_D_d = cp.array(D_FWFE)
    elementwise_matrix_multiplication_3D[(2,1,1),(16,8,8)](test_X_d,test_D_d,test_Ptemp_d,test_N,test_L,test_Order)
    test_Ptemp = cp.asnumpy(test_Ptemp_d)
    print('Compare with Function Output')
    print(np.allclose(Ptemp_FWF,Ptemp_FWFE))
    print('Compare with Externally Computed Output')
    print(np.allclose(Ptemp_FWF,test_Ptemp))
    #endregion


Bellow are the relevant data files to actually test.
xtest.txt (59.5 KB)
d.txt (812 Bytes)
x.txt (6.3 KB)

previously you had said this:

My suggestion would be to pare down your test case until it just does that. Just the calculation that you think is incorrect. And I would concur with njuffa: try demonstrating the problem e.g. with a 3x3 matrix.

I edited my code above. In my new test I actually separated the tasks a bit:

FWF_ACC_CPUONLY_DEBUG() is a function that generates all the intermediate outputs on the CPU sequentially.

FWF_ACC_DEBUG() is a function that generates all the intermediate outputs on the GPU sequentially.

In the second part of the test script after generating the intermediate outputs, I used outputs that matched before applying my elementwise_matrix_multiplication_3d() function, applied the function to them, and got a different (correct) result.

As a matter of fact, I separated the initialization and computation tasks entirely - I changed this so that all memory initialization was handled with only the CPU and the computations were handled with the GPU. Once I did this my code works.

Why would this happen?