CUBLAS + Kernels + double precision + "for loop" = strange behavior

Hello!

I have been working on a code and have encountered some strange behavior. I have whittled my code down to this somewhat basic code that also demonstrates the behavior. The algorithm is simple: First, allocate space on the CPU (v_h) and GPU (v_d) and set both double precision arrays to zero. Next, set the values of v_d to a non-zero (123.456) and copy v_d to v_h. Finally, the values in v_h are checked to see if they have been set to the correct non-zero value. What I find is that when I perform this algorithm using a “for loop” that increases the length of the arrays, sometimes not all of the values in v_h are set to the non-zero value.

The error occurs for random iterations for random entries within the iterations.

I am using cublasDscal to zero out the v_d entries. When I use my own kernel to zero out the v_d entries, I have no problem. Also, strangely, when I use float instead of double (and use cublasScal) I also have no problem. Here is my code:

[codebox]#define MAIN

#include <stdlib.h>

#include <stdio.h>

#include <cuda.h>

#include <cublas.h>

#define START_SIZE 10

#define STOP_SIZE 40000

#define BLOCK_SIZE 4

#define my_type double

global void myscal(int len, my_type alpha, my_type *x)

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

if (j < len){ x[j] = x[j]*alpha; }

__syncthreads();

}

global void make_different(my_type *a, int len)

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

if (j < len){ a[j] = 123.456f; }

__syncthreads();

}

void cuda_safe(cudaError_t cuda_err, char *message){

if(cuda_err != cudaSuccess) {

printf("ERROR: %s : %s\n",message,cudaGetErrorString(cuda_err));

exit(EXIT_FAILURE);

}

}

void cublas_safe(cublasStatus cublas_err, char *message){

if(cublas_err != CUBLAS_STATUS_SUCCESS) {

printf("ERROR: %s : %i\n",message,cublas_err);

exit(EXIT_FAILURE);

}

}

void check_vec(my_type *v, int len, bool &success)

{

for (int i=0; i<len; i++){

if (v[i] != 123.456f){

  success = false;

  printf("len=%d, FAIL at i=%d, value=%f (should be 123.456)\n",len,i,v[i]);

}

}

}

int main(void)

{

int i, len, n_blocks;

bool use_myscal, success = true;

size_t size_v;

my_type *v_d, *v_h;

//------------------------------------------------------------------------//

// Switch the following bool false <–> true to observe problem

use_myscal = false;

//------------------------------------------------------------------------//

cublas_safe(cublasInit(), “cublasInit”);

cuda_safe(cudaMalloc((void **) &v_d, STOP_SIZE*sizeof(my_type)), “cudaMalloc”);

v_h = (my_type )malloc(STOP_SIZEsizeof(my_type));

for (len = START_SIZE; len <= STOP_SIZE; len++){

n_blocks = len/BLOCK_SIZE + (len%BLOCK_SIZE == 0 ? 0:1);

size_v = len * sizeof(my_type);

// Set v_d to zero using either cublas’s or my dscal

if (use_myscal){

  myscal<<<n_blocks,BLOCK_SIZE>>>(len, 0.0, v_d);

}else{

  cublasDscal(len, 0.0, v_d, 1);

  cublas_safe(cublasGetError(), "cublasDscal");

}

//  Set v_h to zero 

for (i=0; i<len; i++){ v_h[i] = 0.0; }

// Set the values of v_d to something non-zero

make_different<<<n_blocks,BLOCK_SIZE>>>(v_d, len);

// Copy from device to host: v_d -> v_h

cuda_safe(cudaMemcpy(v_h, v_d, size_v,cudaMemcpyDeviceToHost),"cudaMemcpy");

// Check values of v_h on host

check_vec(v_h, len, success);

}

free(v_h);

cuda_safe(cudaFree(v_d), “cudaFree”);

cublas_safe(cublasShutdown(), “cublasShutdown”);

if (success){ printf(“The program completed successfully.\n”); }

      else{ printf("The program completed unsuccessfully.\n"); }

}

[/codebox]

Note that if use_myscal is true, the code will use my kernel to zero out the entries in v_d, otherwise it will use cublasDscal.

Here is an example of the output I get:

[codebox]

len=10303, FAIL at i=10295, value=0.000000 (should be 123.456)

len=12039, FAIL at i=12035, value=0.000000 (should be 123.456)

len=12042, FAIL at i=12038, value=0.000000 (should be 123.456)

len=12042, FAIL at i=12039, value=0.000000 (should be 123.456)

len=16126, FAIL at i=16122, value=0.000000 (should be 123.456)

len=16126, FAIL at i=16123, value=0.000000 (should be 123.456)

len=21585, FAIL at i=21578, value=0.000000 (should be 123.456)

len=21585, FAIL at i=21579, value=0.000000 (should be 123.456)

len=22266, FAIL at i=22258, value=0.000000 (should be 123.456)

len=22266, FAIL at i=22259, value=0.000000 (should be 123.456)

len=25927, FAIL at i=25923, value=0.000000 (should be 123.456)

len=26149, FAIL at i=26146, value=0.000000 (should be 123.456)

len=26149, FAIL at i=26147, value=0.000000 (should be 123.456)

len=26853, FAIL at i=26850, value=0.000000 (should be 123.456)

len=26853, FAIL at i=26851, value=0.000000 (should be 123.456)

len=26855, FAIL at i=26851, value=0.000000 (should be 123.456)

len=26862, FAIL at i=26858, value=0.000000 (should be 123.456)

len=26862, FAIL at i=26859, value=0.000000 (should be 123.456)

len=27990, FAIL at i=27982, value=0.000000 (should be 123.456)

len=27990, FAIL at i=27983, value=0.000000 (should be 123.456)

len=32847, FAIL at i=32843, value=0.000000 (should be 123.456)

len=36434, FAIL at i=36418, value=0.000000 (should be 123.456)

len=36434, FAIL at i=36419, value=0.000000 (should be 123.456)

[/codebox]

Recall that this output is random, meaning that if I had run the code again, I’d get different output.

Does anyone have any ideas on what the issue is? It seems like it’s a synchronization issue. Could it be a problem within cublasDscal?

Thanks!

are you compiling with -arch sm_13?

Yes, I am.

Does anyone have any ideas on this bug? Over 99% of the time, all of the copies are performed correctly.

I copy your code to test in my machine. It is OK, even using cublasDscal()

my platform: vc2005, cuda 2.3, driver: 190.38, GTX295

what’s version of your cuda toolkit?

my platform: gcc4.1.2, cuda 2.2, quadro fx 5800, Linux2.6.18-128.el5

I use another machine to test your code.

platform: fedora 10 x86_64, gcc 4.3.2, cuda 2.2, driver 185.18.14, GTX260

the code is O.K for “float” and “double”

(of course, you must issue -arch sm_13 when compiling “double”)

I cannot understand your error

Thanks for checking. I don’t understand my error either. Perhaps I should submit this in a bug report?

Currently don’t have card plugged in to test your code, but a cudaThreadSynchronize() after kernel invocation could help…

Thanks for the suggestion. I tried your idea and it didn’t fix the problem. Besides, I have __syncthreads at the end of my kernels.

does this bug repro with 2.3 and 190.xx?

I installed cuda 2.3 and re-ran this problem - NO ERRORS!

Thanks for all of the input and ideas. I can finally move on with my project.

:D

Well, apparently, the problem only goes away briefly after a reboot. I added a lot more error handling to my code to find the issue, but I can’t find the problem. Does anyone have any ideas? It seems like a synchronization issue, but I have cudaThreadSynchronization() all over the place. Here’s the modified code:

[codebox]#define MAIN

#include <stdlib.h>

#include <stdio.h>

#include <cuda.h>

#include <cublas.h>

#include “my_cublas.h”

#define START_SIZE 10

#define STOP_SIZE 40000

#define BLOCK_SIZE 4

#define my_type double

global void myscal(int len, my_type alpha, my_type *x)

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

if (j < len){ x[j] = x[j]*alpha; }

}

global void make_different(my_type *a, int len)

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

if (j < len){ a[j] = 123.456; }

}

global void make_zero(my_type *a, int len)

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

if (j < len){ a[j] = 0.0; }

}

void cuda_safe(cudaError_t cuda_err, char *message){

if(cuda_err != cudaSuccess) {

printf("ERROR: %s : %s\n",message,cudaGetErrorString(cuda_err));

exit(EXIT_FAILURE);

}

}

void cublas_safe(cublasStatus cublas_err, char *message){

if(cublas_err != CUBLAS_STATUS_SUCCESS) {

printf("ERROR: %s : %i\n",message,cublas_err);

exit(EXIT_FAILURE);

}

}

void check_vec(my_type *v, int len, bool &success)

{

int i;

for (i=0; i<len; i++){

if (v[i] != 123.456){

  success = false;

  printf("len=%d, FAIL at i=%d, value=%f (should be 123.456)\n",len,i,v[i]);

}

}

}

int main(void)

{

int i, n_blocks;

bool use_myscal, success = true;

size_t size_v;

my_type *v_d, *v_h;

//------------------------------------------------------------------------//

// Switch the following bool false <–> true to observe problem

use_myscal = false;

//------------------------------------------------------------------------//

cublas_safe(cublasInit(), “cublasInit”);

cuda_safe(cudaMalloc((void **) &v_d, STOP_SIZE*sizeof(my_type)), “cudaMalloc”);

v_h = (my_type )malloc(STOP_SIZEsizeof(my_type));

cuda_safe(cudaGetLastError(), “error before make_zero”);

make_zero<<<STOP_SIZE, 1>>>(v_d, STOP_SIZE);

cuda_safe(cudaThreadSynchronize(), “error with make_zero, sync”);

cuda_safe(cudaGetLastError(), “error after make_zero”);

for (int len = START_SIZE; len <= STOP_SIZE; len++){

n_blocks = len/BLOCK_SIZE + (len%BLOCK_SIZE == 0 ? 0:1);

size_v = len * sizeof(my_type);

// Set v_d to zero using either cublas’s or my dscal

cuda_safe(cudaThreadSynchronize(), "sync error 1");

if (use_myscal){

  cuda_safe(cudaGetLastError(), "error before myscal");

  myscal<<<n_blocks,BLOCK_SIZE>>>(len, 0.0, v_d);

  cuda_safe(cudaThreadSynchronize(), "error with myscal, sync");

  cuda_safe(cudaGetLastError(), "error after myscal");

}else{

  //cublasSscal(len, 0.0, v_d, 1);

  //cublas_safe(cublasGetError(), "cublasSscal");

  cublasDscal(len, 0.0, v_d, 1);

  cublas_safe(cublasGetError(), "cublasDscal");

}

//  Set v_h to zero 

for (i=0; i<len; i++){ v_h[i] = 0.0; }

cuda_safe(cudaThreadSynchronize(), "sync error 2");

// Set the values of v_d to something non-zero

cuda_safe(cudaGetLastError(), "error before make_different");

make_different<<<n_blocks,BLOCK_SIZE>>>(v_d, len);

cuda_safe(cudaThreadSynchronize(), "error with make_different, sync");

cuda_safe(cudaGetLastError(), "error after make_different");

// Copy from device to host: v_d -> v_h

cuda_safe(cudaMemcpy(v_h, v_d, size_v,cudaMemcpyDeviceToHost),"cudaMemcpy");

cuda_safe(cudaThreadSynchronize(), "sync error 3");

// Check values of v_h on host

check_vec(v_h, len, success);

cuda_safe(cudaThreadSynchronize(), "sync error 4");

}

free(v_h);

cuda_safe(cudaFree(v_d), “cudaFree”);

cublas_safe(cublasShutdown(), “cublasShutdown”);

if (success){ printf(“The program completed successfully.\n”); }

      else{ printf("The program completed unsuccessfully.\n"); }

}

[/codebox]

Again, the make_different kernel seems to skip doing it’s job RANDOMLY, for RANDOM entries. Here is a collection of sample output of the code run several times in succession:

[codebox]lady 37% ./test_dsytrd_cuda

len=37695, FAIL at i=37691, value=0.000000 (should be 123.456)

The program completed unsuccessfully.

lady 38% ./test_dsytrd_cuda

The program completed successfully.

lady 39% ./test_dsytrd_cuda

len=34874, FAIL at i=34870, value=0.000000 (should be 123.456)

len=34874, FAIL at i=34871, value=0.000000 (should be 123.456)

The program completed unsuccessfully.

lady 40% ./test_dsytrd_cuda

len=25909, FAIL at i=25906, value=0.000000 (should be 123.456)

len=25909, FAIL at i=25907, value=0.000000 (should be 123.456)

The program completed unsuccessfully.

lady 41% ./test_dsytrd_cuda

len=37938, FAIL at i=37922, value=0.000000 (should be 123.456)

len=37938, FAIL at i=37923, value=0.000000 (should be 123.456)

The program completed unsuccessfully.

[/codebox]

“Features” of this bug - it goes away when using float instead of double; it goes away if I use my own kernel to scale the vector instead of cublasDscal.

Help me, Obi-Wans, you’re my only hope!!!

No error here if I remove :
#include “my_cublas.h”

nvcc -arch=sm_13 -cuda jeremiahplamer.cu -I(your_path)/cuda/include
gcc -o cubb jeremiahplamer.cu.cpp -L(your_path)/cuda/lib64 -lcudart -lcublas

for both double and float.

Correct, I had forgotten to remove “my_cublas” - this is just a wrapper around cublas.

Thanks for checking my code, jam1. Usually, the problem doesn’t manifest itself until it has been run many, many times. Could you re-try running the code, say, 1000 times?

Thanks,

Jeremiah

No problems for many runs. Your problem is possibly hardware related. Did you check the temperature while doing your run?

Thanks for checking. I checked the temp - it’s around 70 degrees!

My card does not exceed 59C. (GTX260) . I think this is your problem. I would not trust anything above 60C.