Hi All,
I have been swamped at my CUDA program debugging for 2 days. I am writing a CUDA implementation for Maximum Likelyhood Estimation method.
I have fully accomplished C++ version and worked pretty well. Now, I am porting the code to CUDA but have memory access violation bug detected by Nsight 1.51. In my kernel function(global void Iter_process), I declared “struct” defined in the same file. This “struct” wraps up some device functions and shared variables and arrays. Basically, it works the same as class in C++. I am using CUDA Tool Kit 3.2 and GTX480. The latest CUDA programming guide declares C++ is supported in CUDA.
The error message is:
CUDA Memory Checker detected 64 threads caused an access violation:
Launch Parameters
CUcontext = 00ac44a8
CUstream = 00000000
CUmodule = 06433518
CUfunction = 0641a7b0
FunctionName = Z12Iter_processPfS_S_S_S_S_S_S_S_S_S_S_S_S_S_PiS0_S_S_S
gridDim = {1,1,1}
blockDim = {8,8,1}
sharedSize = 1280
Parameters:
Parameters (raw):
0x05100000 0x05101800 0x05101a00 0x05101c00
0x05101e00 0x05102000 0x05100600 0x05100800
0x05100a00 0x05100c00 0x05100e00 0x05101000
0x05101200 0x05101400 0x05101600 0x05100400
0x05100200 0x05102200 0x05102400 0x05102600
GPU State:
Address Size Type Block Thread blockIdx threadIdx PC Source
00000050 4 adr ld 0 0 {0,0,0} {0,0,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000050 4 adr ld 0 1 {0,0,0} {1,0,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000050 4 adr ld 0 2 {0,0,0} {2,0,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000050 4 adr ld 0 3 {0,0,0} {3,0,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000050 4 adr ld 0 4 {0,0,0} {4,0,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000050 4 adr ld 0 5 {0,0,0} {5,0,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000050 4 adr ld 0 6 {0,0,0} {6,0,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000050 4 adr ld 0 7 {0,0,0} {7,0,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000054 4 adr ld 0 8 {0,0,0} {0,1,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000054 4 adr ld 0 9 {0,0,0} {1,1,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000054 4 adr ld 0 10 {0,0,0} {2,1,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000054 4 adr ld 0 11 {0,0,0} {3,1,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000054 4 adr ld 0 12 {0,0,0} {4,1,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000054 4 adr ld 0 13 {0,0,0} {5,1,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000054 4 adr ld 0 14 {0,0,0} {6,1,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000054 4 adr ld 0 15 {0,0,0} {7,1,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000058 4 adr ld 0 16 {0,0,0} {0,2,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000058 4 adr ld 0 17 {0,0,0} {1,2,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000058 4 adr ld 0 18 {0,0,0} {2,2,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000058 4 adr ld 0 19 {0,0,0} {3,2,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000058 4 adr ld 0 20 {0,0,0} {4,2,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000058 4 adr ld 0 21 {0,0,0} {5,2,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000058 4 adr ld 0 22 {0,0,0} {6,2,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000058 4 adr ld 0 23 {0,0,0} {7,2,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000005c 4 adr ld 0 24 {0,0,0} {0,3,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000005c 4 adr ld 0 25 {0,0,0} {1,3,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000005c 4 adr ld 0 26 {0,0,0} {2,3,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000005c 4 adr ld 0 27 {0,0,0} {3,3,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000005c 4 adr ld 0 28 {0,0,0} {4,3,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000005c 4 adr ld 0 29 {0,0,0} {5,3,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000005c 4 adr ld 0 30 {0,0,0} {6,3,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000005c 4 adr ld 0 31 {0,0,0} {7,3,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000060 4 adr ld 0 32 {0,0,0} {0,4,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000060 4 adr ld 0 33 {0,0,0} {1,4,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000060 4 adr ld 0 34 {0,0,0} {2,4,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000060 4 adr ld 0 35 {0,0,0} {3,4,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000060 4 adr ld 0 36 {0,0,0} {4,4,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000060 4 adr ld 0 37 {0,0,0} {5,4,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000060 4 adr ld 0 38 {0,0,0} {6,4,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000060 4 adr ld 0 39 {0,0,0} {7,4,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000064 4 adr ld 0 40 {0,0,0} {0,5,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000064 4 adr ld 0 41 {0,0,0} {1,5,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000064 4 adr ld 0 42 {0,0,0} {2,5,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000064 4 adr ld 0 43 {0,0,0} {3,5,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000064 4 adr ld 0 44 {0,0,0} {4,5,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000064 4 adr ld 0 45 {0,0,0} {5,5,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000064 4 adr ld 0 46 {0,0,0} {6,5,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000064 4 adr ld 0 47 {0,0,0} {7,5,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000068 4 adr ld 0 48 {0,0,0} {0,6,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000068 4 adr ld 0 49 {0,0,0} {1,6,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000068 4 adr ld 0 50 {0,0,0} {2,6,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000068 4 adr ld 0 51 {0,0,0} {3,6,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000068 4 adr ld 0 52 {0,0,0} {4,6,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000068 4 adr ld 0 53 {0,0,0} {5,6,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000068 4 adr ld 0 54 {0,0,0} {6,6,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
00000068 4 adr ld 0 55 {0,0,0} {7,6,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000006c 4 adr ld 0 56 {0,0,0} {0,7,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000006c 4 adr ld 0 57 {0,0,0} {1,7,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000006c 4 adr ld 0 58 {0,0,0} {2,7,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000006c 4 adr ld 0 59 {0,0,0} {3,7,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000006c 4 adr ld 0 60 {0,0,0} {4,7,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000006c 4 adr ld 0 61 {0,0,0} {5,7,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000006c 4 adr ld 0 62 {0,0,0} {6,7,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
0000006c 4 adr ld 0 63 {0,0,0} {7,7,0} 020468 c:\users\rocketer\documents\visual studio 2010\projects\mle_cuda\mle.cu:61
Summary of access violations:
Parallel Nsight Debug
Memory Checker detected 64 access violations!
error = access violation on load
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x00000050
accessSize = 4
See Output view for details.
I also copied part of my source code here:
Kernel Function:
global void Iter_process(float *p_, float *thetax, float *thetay, float *thetaz, float *thetaI, float *thetabg, float *sigma0x, float *sigma0y, float *sigma, float *Ax, float *Ay, float *Bx, float *By, float *d, float *gamma, int *Npixels, int *Iternum, float *DEBUG, float *MIU, float *Inter_array){
__shared__ theta_package llf, llf_;
__shared__ float deltaX, deltaY, deltaZ, deltaI, deltaBg;
__shared__ float deltax[MAX_BUF];
__shared__ float deltay[MAX_BUF];
__shared__ float plus_expx[MAX_BUF];
__shared__ float minus_expx[MAX_BUF];
__shared__ float plus_expy[MAX_BUF];
__shared__ float minus_expy[MAX_BUF];
__shared__ float plus_expzx[MAX_BUF];
__shared__ float minus_expzx[MAX_BUF];
__shared__ float plus_expzy[MAX_BUF];
__shared__ float minus_expzy[MAX_BUF];
__shared__ float p[MAX_BUF][MAX_BUF];
for( int y = 0; y < *Npixels; y++)
for( int x = 0; x < *Npixels; x++)
p[y][x] = p_[y * (*Npixels) + x];
// Structure declaration
IMG_model IMG;
for( int iter = 0; iter < *Iternum; iter++){
IMG.thetas.x = *thetax;
IMG.thetas.y = *thetay;
IMG.thetas.z = *thetaz;
IMG.thetas.I = *thetaI;
IMG.thetas.bg = *thetabg;
IMG.paras.sigma0x = *sigma0x;
IMG.paras.sigma0y = *sigma0y;
IMG.paras.sigma = *sigma;
IMG.paras.Ax = *Ax;
IMG.paras.Bx = *Bx;
IMG.paras.Ay = *Ay;
IMG.paras.By = *By;
IMG.paras.d = *d;
IMG.paras.gamma = *gamma;
IMG.paras.square_sigma = powf(*sigma, 2);
IMG.paras.square_sigma0x = powf(*sigma0x, 2);
IMG.paras.square_sigma0y = powf(*sigma0y, 2);
IMG.paras.square_d = powf(*d, 2);
IMG.paras.sigmax = *sigma0x * sqrtf(1 + powf(*thetaz - *gamma, 2)/IMG.paras.square_d + *Ax * powf(*thetaz - *gamma, 3)/IMG.paras.square_d + *Bx * powf(*thetaz - *gamma, 4)/IMG.paras.square_d);
IMG.paras.sigmay = *sigma0y * sqrtf(1 + powf(*thetaz - *gamma, 2)/IMG.paras.square_d + *Ay * powf(*thetaz - *gamma, 3)/IMG.paras.square_d + *By * powf(*thetaz - *gamma, 4)/IMG.paras.square_d);
IMG.paras.square_sigmax = powf(IMG.paras.sigmax, 2);
IMG.paras.square_sigmay = powf(IMG.paras.sigmay, 2);
unsigned int i = threadIdx.x;
unsigned int j = threadIdx.y;
unsigned int index = i * (*Npixels) + j;
deltax[i] = IMG.delta(i, *thetax);
deltay[i] = IMG.delta(i, *thetay);
plus_expx[i] = expf(-powf((i - *thetax + 0.5), 2)/(2.0 * IMG.paras.square_sigma));
minus_expx[i] = expf(-powf(i - *thetax - 0.5, 2)/(2.0 * IMG.paras.square_sigma));
plus_expy[i] = expf(-powf(i - *thetay + 0.5, 2)/(2.0 * IMG.paras.square_sigma));
minus_expy[i] = expf(-powf(i - *thetay - 0.5, 2)/(2.0 * IMG.paras.square_sigma));
plus_expzx[i] = expf(-powf(i - *thetax + 0.5, 2)/(2.0 * IMG.paras.square_sigmax));
minus_expzx[i] = expf(-powf(i - *thetax - 0.5, 2)/(2.0 * IMG.paras.square_sigmax));
plus_expzy[i] = expf(-powf(i - *thetay + 0.5, 2)/(2.0 * IMG.paras.square_sigmay));
minus_expzy[i] = expf(-powf(i - *thetay - 0.5, 2)/(2.0 * IMG.paras.square_sigmay));
__syncthreads();
IMG.arrays.deltax = deltax;
IMG.arrays.deltay = deltay;
IMG.arrays.plus_expx = plus_expx;
IMG.arrays.minus_expx = minus_expx;
IMG.arrays.plus_expy = plus_expy;
IMG.arrays.minus_expy = minus_expy;
IMG.arrays.plus_expzx = plus_expzx;
IMG.arrays.minus_expzx = minus_expzx;
IMG.arrays.plus_expzy = plus_expzy;
IMG.arrays.minus_expzy = minus_expzy;
__syncthreads();
// if( i < *Npixels && j < *Npixels){
// MIU[index] = miu[index] = *thetaI * deltax[j] * deltay[i] + *thetabg;
// }
IMG.miu[j][i] = *thetaI * deltax[j] * deltay[i] + *thetabg;
__syncthreads();
//IMG.miu = miu;
llf.x = llf.y = llf.z = llf.I = llf.bg = llf_.x = llf_.y = llf_.z = llf_.I = llf_.bg = 0;
__syncthreads();
if( i < *Npixels && j < *Npixels){
llf.x += IMG.MIU_1st_der_thetax(i, j) * (p[j][i] / IMG.miu[j][i] - 1.0);
llf.y += IMG.MIU_1st_der_thetay(i, j) * (p[j][i] / IMG.miu[j][i] - 1.0);
llf.z += IMG.MIU_1st_der_thetaz(i, j) * (p[j][i] / IMG.miu[j][i] - 1.0);
llf.I += IMG.MIU_1st_der_thetaI(i, j) * (p[j][i] / IMG.miu[j][i] - 1.0);
llf.bg += IMG.MIU_1st_der_thetabg(i, j) * (p[j][i] / IMG.miu[j][i] - 1.0);
}
if( i < *Npixels && j < *Npixels){
llf_.x += IMG.MIU_2nd_der_thetax(i, j) * (p[j][i] / IMG.miu[j][i] - 1.0) - powf(IMG.MIU_1st_der_thetax(i, j), 2) * (p[j][i] / powf(IMG.miu[j][i], 2));
llf_.y += IMG.MIU_2nd_der_thetay(i, j) * (p[j][i] / IMG.miu[j][i] - 1.0) - powf(IMG.MIU_1st_der_thetay(i, j), 2) * (p[j][i] / powf(IMG.miu[j][i], 2));
llf_.z += IMG.MIU_2nd_der_thetaz(i, j) * (p[j][i] / IMG.miu[j][i] - 1.0) - powf(IMG.MIU_1st_der_thetaz(i, j), 2) * (p[j][i] / powf(IMG.miu[j][i], 2));
llf_.I += IMG.MIU_2nd_der_thetaI(i, j) * (p[j][i] / IMG.miu[j][i] - 1.0) - powf(IMG.MIU_1st_der_thetaI(i, j), 2) * (p[j][i] / powf(IMG.miu[j][i], 2));
llf_.bg += IMG.MIU_2nd_der_thetabg(i, j) * (p[j][i] / IMG.miu[j][i] - 1.0) - powf(IMG.MIU_1st_der_thetabg(i, j), 2) * (p[j][i] / powf(IMG.miu[j][i], 2));
}
__syncthreads();
deltaX = llf.x / llf_.x;
deltaY = llf.y / llf_.y;
deltaZ = llf.z / llf_.z;
deltaI = llf.I / llf_.I;
deltaBg = llf.bg / llf_.bg;
__syncthreads();
*thetax -= deltaX;
*thetay -= deltaY;
*thetaz -= deltaZ;
*thetaI -= deltaI;
*thetabg -= deltaBg;
__syncthreads();
}
}
IMG_model Structure:
typedef struct IMG_model_t{
float miu[MAX_BUF][MAX_BUF];
theta_package thetas;
interpara_package paras;
interarray_package arrays;
int Npixels;
device float delta(int real_pos, float peak_pos){
return …;
}
device float prime(int degree){
return ...;
}
…
} IMG_model;
I have no experience using Nsight debugging CUDA program. I believe I have wrongly implemented some thread configuration but don’t know where it is. Actually, I am suspecting the miu and p which are used to save image pixels may be the problem.
If some of you guys can provide me any hints, I will be very very grateful!
Thanks,
Jason