Hello!
I have been working on a code and have encountered some strange behavior. I have whittled my code down to this somewhat basic code that also demonstrates the behavior. The algorithm is simple: First, allocate space on the CPU (v_h) and GPU (v_d) and set both double precision arrays to zero. Next, set the values of v_d to a non-zero (123.456) and copy v_d to v_h. Finally, the values in v_h are checked to see if they have been set to the correct non-zero value. What I find is that when I perform this algorithm using a “for loop” that increases the length of the arrays, sometimes not all of the values in v_h are set to the non-zero value.
The error occurs for random iterations for random entries within the iterations.
I am using cublasDscal to zero out the v_d entries. When I use my own kernel to zero out the v_d entries, I have no problem. Also, strangely, when I use float instead of double (and use cublasScal) I also have no problem. Here is my code:
[codebox]#define MAIN
#include <stdlib.h>
#include <stdio.h>
#include <cuda.h>
#include <cublas.h>
#define START_SIZE 10
#define STOP_SIZE 40000
#define BLOCK_SIZE 4
#define my_type double
global void myscal(int len, my_type alpha, my_type *x)
{
int j = blockIdx.x * blockDim.x + threadIdx.x;
if (j < len){ x[j] = x[j]*alpha; }
__syncthreads();
}
global void make_different(my_type *a, int len)
{
int j = blockIdx.x * blockDim.x + threadIdx.x;
if (j < len){ a[j] = 123.456f; }
__syncthreads();
}
void cuda_safe(cudaError_t cuda_err, char *message){
if(cuda_err != cudaSuccess) {
printf("ERROR: %s : %s\n",message,cudaGetErrorString(cuda_err));
exit(EXIT_FAILURE);
}
}
void cublas_safe(cublasStatus cublas_err, char *message){
if(cublas_err != CUBLAS_STATUS_SUCCESS) {
printf("ERROR: %s : %i\n",message,cublas_err);
exit(EXIT_FAILURE);
}
}
void check_vec(my_type *v, int len, bool &success)
{
for (int i=0; i<len; i++){
if (v[i] != 123.456f){
success = false;
printf("len=%d, FAIL at i=%d, value=%f (should be 123.456)\n",len,i,v[i]);
}
}
}
int main(void)
{
int i, len, n_blocks;
bool use_myscal, success = true;
size_t size_v;
my_type *v_d, *v_h;
//------------------------------------------------------------------------//
// Switch the following bool false <–> true to observe problem
use_myscal = false;
//------------------------------------------------------------------------//
cublas_safe(cublasInit(), “cublasInit”);
cuda_safe(cudaMalloc((void **) &v_d, STOP_SIZE*sizeof(my_type)), “cudaMalloc”);
v_h = (my_type )malloc(STOP_SIZEsizeof(my_type));
for (len = START_SIZE; len <= STOP_SIZE; len++){
n_blocks = len/BLOCK_SIZE + (len%BLOCK_SIZE == 0 ? 0:1);
size_v = len * sizeof(my_type);
// Set v_d to zero using either cublas’s or my dscal
if (use_myscal){
myscal<<<n_blocks,BLOCK_SIZE>>>(len, 0.0, v_d);
}else{
cublasDscal(len, 0.0, v_d, 1);
cublas_safe(cublasGetError(), "cublasDscal");
}
// Set v_h to zero
for (i=0; i<len; i++){ v_h[i] = 0.0; }
// Set the values of v_d to something non-zero
make_different<<<n_blocks,BLOCK_SIZE>>>(v_d, len);
// Copy from device to host: v_d → v_h
cuda_safe(cudaMemcpy(v_h, v_d, size_v,cudaMemcpyDeviceToHost),"cudaMemcpy");
// Check values of v_h on host
check_vec(v_h, len, success);
}
free(v_h);
cuda_safe(cudaFree(v_d), “cudaFree”);
cublas_safe(cublasShutdown(), “cublasShutdown”);
if (success){ printf(“The program completed successfully.\n”); }
else{ printf("The program completed unsuccessfully.\n"); }
}
[/codebox]
Note that if use_myscal is true, the code will use my kernel to zero out the entries in v_d, otherwise it will use cublasDscal.
Here is an example of the output I get:
[codebox]
len=10303, FAIL at i=10295, value=0.000000 (should be 123.456)
len=12039, FAIL at i=12035, value=0.000000 (should be 123.456)
len=12042, FAIL at i=12038, value=0.000000 (should be 123.456)
len=12042, FAIL at i=12039, value=0.000000 (should be 123.456)
len=16126, FAIL at i=16122, value=0.000000 (should be 123.456)
len=16126, FAIL at i=16123, value=0.000000 (should be 123.456)
len=21585, FAIL at i=21578, value=0.000000 (should be 123.456)
len=21585, FAIL at i=21579, value=0.000000 (should be 123.456)
len=22266, FAIL at i=22258, value=0.000000 (should be 123.456)
len=22266, FAIL at i=22259, value=0.000000 (should be 123.456)
len=25927, FAIL at i=25923, value=0.000000 (should be 123.456)
len=26149, FAIL at i=26146, value=0.000000 (should be 123.456)
len=26149, FAIL at i=26147, value=0.000000 (should be 123.456)
len=26853, FAIL at i=26850, value=0.000000 (should be 123.456)
len=26853, FAIL at i=26851, value=0.000000 (should be 123.456)
len=26855, FAIL at i=26851, value=0.000000 (should be 123.456)
len=26862, FAIL at i=26858, value=0.000000 (should be 123.456)
len=26862, FAIL at i=26859, value=0.000000 (should be 123.456)
len=27990, FAIL at i=27982, value=0.000000 (should be 123.456)
len=27990, FAIL at i=27983, value=0.000000 (should be 123.456)
len=32847, FAIL at i=32843, value=0.000000 (should be 123.456)
len=36434, FAIL at i=36418, value=0.000000 (should be 123.456)
len=36434, FAIL at i=36419, value=0.000000 (should be 123.456)
[/codebox]
Recall that this output is random, meaning that if I had run the code again, I’d get different output.
Does anyone have any ideas on what the issue is? It seems like it’s a synchronization issue. Could it be a problem within cublasDscal?
Thanks!