CUBLAS + Kernels + double precision + "for loop" = strange behavior

JeremiahPalmer · August 10, 2009, 8:54pm

Hello!

I have been working on a code and have encountered some strange behavior. I have whittled my code down to this somewhat basic code that also demonstrates the behavior. The algorithm is simple: First, allocate space on the CPU (v_h) and GPU (v_d) and set both double precision arrays to zero. Next, set the values of v_d to a non-zero (123.456) and copy v_d to v_h. Finally, the values in v_h are checked to see if they have been set to the correct non-zero value. What I find is that when I perform this algorithm using a “for loop” that increases the length of the arrays, sometimes not all of the values in v_h are set to the non-zero value.

The error occurs for random iterations for random entries within the iterations.

I am using cublasDscal to zero out the v_d entries. When I use my own kernel to zero out the v_d entries, I have no problem. Also, strangely, when I use float instead of double (and use cublasScal) I also have no problem. Here is my code:

[codebox]#define MAIN

#include <stdlib.h>

#include <stdio.h>

#include <cuda.h>

#include <cublas.h>

#define START_SIZE 10

#define STOP_SIZE 40000

#define BLOCK_SIZE 4

#define my_type double

global void myscal(int len, my_type alpha, my_type *x)

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

if (j < len){ x[j] = x[j]*alpha; }

__syncthreads();

}

global void make_different(my_type *a, int len)

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

if (j < len){ a[j] = 123.456f; }

__syncthreads();

}

void cuda_safe(cudaError_t cuda_err, char *message){

if(cuda_err != cudaSuccess) {

printf("ERROR: %s : %s\n",message,cudaGetErrorString(cuda_err));

exit(EXIT_FAILURE);

}

void cublas_safe(cublasStatus cublas_err, char *message){

if(cublas_err != CUBLAS_STATUS_SUCCESS) {

printf("ERROR: %s : %i\n",message,cublas_err);

exit(EXIT_FAILURE);

}

void check_vec(my_type *v, int len, bool &success)

{

for (int i=0; i<len; i++){

if (v[i] != 123.456f){

  success = false;

  printf("len=%d, FAIL at i=%d, value=%f (should be 123.456)\n",len,i,v[i]);

}

}

int main(void)

{

int i, len, n_blocks;

bool use_myscal, success = true;

size_t size_v;

my_type *v_d, *v_h;

//------------------------------------------------------------------------//

// Switch the following bool false <–> true to observe problem

use_myscal = false;

//------------------------------------------------------------------------//

cublas_safe(cublasInit(), “cublasInit”);

cuda_safe(cudaMalloc((void **) &v_d, STOP_SIZE*sizeof(my_type)), “cudaMalloc”);

v_h = (my_type )malloc(STOP_SIZEsizeof(my_type));

for (len = START_SIZE; len <= STOP_SIZE; len++){

n_blocks = len/BLOCK_SIZE + (len%BLOCK_SIZE == 0 ? 0:1);

size_v = len * sizeof(my_type);

// Set v_d to zero using either cublas’s or my dscal

if (use_myscal){

  myscal<<<n_blocks,BLOCK_SIZE>>>(len, 0.0, v_d);

}else{

  cublasDscal(len, 0.0, v_d, 1);

  cublas_safe(cublasGetError(), "cublasDscal");

}

//  Set v_h to zero 

for (i=0; i<len; i++){ v_h[i] = 0.0; }

// Set the values of v_d to something non-zero

make_different<<<n_blocks,BLOCK_SIZE>>>(v_d, len);

// Copy from device to host: v_d → v_h

cuda_safe(cudaMemcpy(v_h, v_d, size_v,cudaMemcpyDeviceToHost),"cudaMemcpy");

// Check values of v_h on host

check_vec(v_h, len, success);

}

free(v_h);

cuda_safe(cudaFree(v_d), “cudaFree”);

cublas_safe(cublasShutdown(), “cublasShutdown”);

if (success){ printf(“The program completed successfully.\n”); }

      else{ printf("The program completed unsuccessfully.\n"); }

}

[/codebox]

Note that if use_myscal is true, the code will use my kernel to zero out the entries in v_d, otherwise it will use cublasDscal.

Here is an example of the output I get:

[codebox]

len=10303, FAIL at i=10295, value=0.000000 (should be 123.456)

len=12039, FAIL at i=12035, value=0.000000 (should be 123.456)

len=12042, FAIL at i=12038, value=0.000000 (should be 123.456)

len=12042, FAIL at i=12039, value=0.000000 (should be 123.456)

len=16126, FAIL at i=16122, value=0.000000 (should be 123.456)

len=16126, FAIL at i=16123, value=0.000000 (should be 123.456)

len=21585, FAIL at i=21578, value=0.000000 (should be 123.456)

len=21585, FAIL at i=21579, value=0.000000 (should be 123.456)

len=22266, FAIL at i=22258, value=0.000000 (should be 123.456)

len=22266, FAIL at i=22259, value=0.000000 (should be 123.456)

len=25927, FAIL at i=25923, value=0.000000 (should be 123.456)

len=26149, FAIL at i=26146, value=0.000000 (should be 123.456)

len=26149, FAIL at i=26147, value=0.000000 (should be 123.456)

len=26853, FAIL at i=26850, value=0.000000 (should be 123.456)

len=26853, FAIL at i=26851, value=0.000000 (should be 123.456)

len=26855, FAIL at i=26851, value=0.000000 (should be 123.456)

len=26862, FAIL at i=26858, value=0.000000 (should be 123.456)

len=26862, FAIL at i=26859, value=0.000000 (should be 123.456)

len=27990, FAIL at i=27982, value=0.000000 (should be 123.456)

len=27990, FAIL at i=27983, value=0.000000 (should be 123.456)

len=32847, FAIL at i=32843, value=0.000000 (should be 123.456)

len=36434, FAIL at i=36418, value=0.000000 (should be 123.456)

len=36434, FAIL at i=36419, value=0.000000 (should be 123.456)

[/codebox]

Recall that this output is random, meaning that if I had run the code again, I’d get different output.

Does anyone have any ideas on what the issue is? It seems like it’s a synchronization issue. Could it be a problem within cublasDscal?

Thanks!

tmurray · August 10, 2009, 9:08pm

are you compiling with -arch sm_13?

JeremiahPalmer · August 10, 2009, 9:17pm

Yes, I am.

JeremiahPalmer · August 12, 2009, 2:31pm

Does anyone have any ideas on this bug? Over 99% of the time, all of the copies are performed correctly.

LSChien · August 12, 2009, 5:01pm

I copy your code to test in my machine. It is OK, even using cublasDscal()

my platform: vc2005, cuda 2.3, driver: 190.38, GTX295

what’s version of your cuda toolkit?

JeremiahPalmer · August 12, 2009, 6:45pm

my platform: gcc4.1.2, cuda 2.2, quadro fx 5800, Linux2.6.18-128.el5

LSChien · August 13, 2009, 2:25pm

I use another machine to test your code.

platform: fedora 10 x86_64, gcc 4.3.2, cuda 2.2, driver 185.18.14, GTX260

the code is O.K for “float” and “double”

(of course, you must issue -arch sm_13 when compiling “double”)

I cannot understand your error

JeremiahPalmer · August 13, 2009, 2:44pm

Thanks for checking. I don’t understand my error either. Perhaps I should submit this in a bug report?

janw · August 13, 2009, 6:13pm

Currently don’t have card plugged in to test your code, but a cudaThreadSynchronize() after kernel invocation could help…

JeremiahPalmer · August 13, 2009, 6:17pm

Thanks for the suggestion. I tried your idea and it didn’t fix the problem. Besides, I have __syncthreads at the end of my kernels.

tmurray · August 13, 2009, 7:01pm

does this bug repro with 2.3 and 190.xx?

JeremiahPalmer · August 13, 2009, 8:59pm

I installed cuda 2.3 and re-ran this problem - NO ERRORS!

Thanks for all of the input and ideas. I can finally move on with my project.

:D

JeremiahPalmer · September 23, 2009, 3:41pm

Well, apparently, the problem only goes away briefly after a reboot. I added a lot more error handling to my code to find the issue, but I can’t find the problem. Does anyone have any ideas? It seems like a synchronization issue, but I have cudaThreadSynchronization() all over the place. Here’s the modified code:

[codebox]#define MAIN

#include <stdlib.h>

#include <stdio.h>

#include <cuda.h>

#include <cublas.h>

#include “my_cublas.h”

#define START_SIZE 10

#define STOP_SIZE 40000

#define BLOCK_SIZE 4

#define my_type double

global void myscal(int len, my_type alpha, my_type *x)

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

if (j < len){ x[j] = x[j]*alpha; }

}

global void make_different(my_type *a, int len)

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

if (j < len){ a[j] = 123.456; }

}

global void make_zero(my_type *a, int len)

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

if (j < len){ a[j] = 0.0; }

}

void cuda_safe(cudaError_t cuda_err, char *message){

if(cuda_err != cudaSuccess) {

printf("ERROR: %s : %s\n",message,cudaGetErrorString(cuda_err));

exit(EXIT_FAILURE);

}

void cublas_safe(cublasStatus cublas_err, char *message){

if(cublas_err != CUBLAS_STATUS_SUCCESS) {

printf("ERROR: %s : %i\n",message,cublas_err);

exit(EXIT_FAILURE);

}

void check_vec(my_type *v, int len, bool &success)

{

int i;

for (i=0; i<len; i++){

if (v[i] != 123.456){

  success = false;

  printf("len=%d, FAIL at i=%d, value=%f (should be 123.456)\n",len,i,v[i]);

}

}

int main(void)

{

int i, n_blocks;

bool use_myscal, success = true;

size_t size_v;

my_type *v_d, *v_h;

//------------------------------------------------------------------------//

// Switch the following bool false <–> true to observe problem

use_myscal = false;

//------------------------------------------------------------------------//

cublas_safe(cublasInit(), “cublasInit”);

cuda_safe(cudaMalloc((void **) &v_d, STOP_SIZE*sizeof(my_type)), “cudaMalloc”);

v_h = (my_type )malloc(STOP_SIZEsizeof(my_type));

cuda_safe(cudaGetLastError(), “error before make_zero”);

make_zero<<<STOP_SIZE, 1>>>(v_d, STOP_SIZE);

cuda_safe(cudaThreadSynchronize(), “error with make_zero, sync”);

cuda_safe(cudaGetLastError(), “error after make_zero”);

for (int len = START_SIZE; len <= STOP_SIZE; len++){

n_blocks = len/BLOCK_SIZE + (len%BLOCK_SIZE == 0 ? 0:1);

size_v = len * sizeof(my_type);

// Set v_d to zero using either cublas’s or my dscal

cuda_safe(cudaThreadSynchronize(), "sync error 1");

if (use_myscal){

  cuda_safe(cudaGetLastError(), "error before myscal");

  myscal<<<n_blocks,BLOCK_SIZE>>>(len, 0.0, v_d);

  cuda_safe(cudaThreadSynchronize(), "error with myscal, sync");

  cuda_safe(cudaGetLastError(), "error after myscal");

}else{

  //cublasSscal(len, 0.0, v_d, 1);

  //cublas_safe(cublasGetError(), "cublasSscal");

  cublasDscal(len, 0.0, v_d, 1);

  cublas_safe(cublasGetError(), "cublasDscal");

}

//  Set v_h to zero 

for (i=0; i<len; i++){ v_h[i] = 0.0; }

cuda_safe(cudaThreadSynchronize(), "sync error 2");

// Set the values of v_d to something non-zero

cuda_safe(cudaGetLastError(), "error before make_different");

make_different<<<n_blocks,BLOCK_SIZE>>>(v_d, len);

cuda_safe(cudaThreadSynchronize(), "error with make_different, sync");

cuda_safe(cudaGetLastError(), "error after make_different");

// Copy from device to host: v_d → v_h

cuda_safe(cudaMemcpy(v_h, v_d, size_v,cudaMemcpyDeviceToHost),"cudaMemcpy");

cuda_safe(cudaThreadSynchronize(), "sync error 3");

// Check values of v_h on host

check_vec(v_h, len, success);

cuda_safe(cudaThreadSynchronize(), "sync error 4");

}

free(v_h);

cuda_safe(cudaFree(v_d), “cudaFree”);

cublas_safe(cublasShutdown(), “cublasShutdown”);

if (success){ printf(“The program completed successfully.\n”); }

      else{ printf("The program completed unsuccessfully.\n"); }

}

[/codebox]

Again, the make_different kernel seems to skip doing it’s job RANDOMLY, for RANDOM entries. Here is a collection of sample output of the code run several times in succession:

[codebox]lady 37% ./test_dsytrd_cuda

len=37695, FAIL at i=37691, value=0.000000 (should be 123.456)

The program completed unsuccessfully.

lady 38% ./test_dsytrd_cuda

The program completed successfully.

lady 39% ./test_dsytrd_cuda

len=34874, FAIL at i=34870, value=0.000000 (should be 123.456)

len=34874, FAIL at i=34871, value=0.000000 (should be 123.456)

The program completed unsuccessfully.

lady 40% ./test_dsytrd_cuda

len=25909, FAIL at i=25906, value=0.000000 (should be 123.456)

len=25909, FAIL at i=25907, value=0.000000 (should be 123.456)

The program completed unsuccessfully.

lady 41% ./test_dsytrd_cuda

len=37938, FAIL at i=37922, value=0.000000 (should be 123.456)

len=37938, FAIL at i=37923, value=0.000000 (should be 123.456)

The program completed unsuccessfully.

[/codebox]

“Features” of this bug - it goes away when using float instead of double; it goes away if I use my own kernel to scale the vector instead of cublasDscal.

Help me, Obi-Wans, you’re my only hope!!!

jam11 · September 23, 2009, 11:02pm

No error here if I remove :
#include “my_cublas.h”

nvcc -arch=sm_13 -cuda jeremiahplamer.cu -I(your_path)/cuda/include
gcc -o cubb jeremiahplamer.cu.cpp -L(your_path)/cuda/lib64 -lcudart -lcublas

for both double and float.

JeremiahPalmer · September 24, 2009, 3:06pm

Correct, I had forgotten to remove “my_cublas” - this is just a wrapper around cublas.

Thanks for checking my code, jam1. Usually, the problem doesn’t manifest itself until it has been run many, many times. Could you re-try running the code, say, 1000 times?

Thanks,

Jeremiah

jam11 · September 25, 2009, 12:57pm

No problems for many runs. Your problem is possibly hardware related. Did you check the temperature while doing your run?

JeremiahPalmer · September 25, 2009, 2:04pm

Thanks for checking. I checked the temp - it’s around 70 degrees!

jam11 · September 25, 2009, 5:25pm

My card does not exceed 59C. (GTX260) . I think this is your problem. I would not trust anything above 60C.

Topic		Replies	Views
cublasZgemm() gives false result for large data and potential bug GPU-Accelerated Libraries	6	1155	October 12, 2021
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7786	January 11, 2020
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	948	August 23, 2018
HPL on cuBlas : Ok, but not on Tesla 1060 Board ! Tesla board crash on large array when launchin CUDA Programming and Performance	11	30449	December 20, 2009
CUDA error when running matrixMulCUBLAS sample - Ubuntu 16.04 CUDA Setup and Installation	19	13443	May 4, 2018
Issue when calling cublasDdot from within kernel GPU-Accelerated Libraries	7	932	March 21, 2018
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4861	April 7, 2021
CUBLAS problem CUDA Programming and Performance	32	19317	March 28, 2012
0: copyout Memcpy... FAILED: 4(unspecified launch failure) Legacy PGI Compilers	3	5247	May 13, 2013
cannot compile with cublas code in kernel CUDA Programming and Performance	4	1019	July 8, 2019

CUBLAS + Kernels + double precision + "for loop" = strange behavior

Related topics