Sorry for the confusing topic title, for I really don’t know how to summarize my problem. Anyway, I have the following kernel.

```
__global__ void myKernel(float *tss_in, int *seeds_in, int *pConfsSum_out, int *nConfsSum_out, int *pConfsAll, int *nConfsAll, double *distMtx,
const int numSeeds, const int numTrain, const int numP, const int numPLabeled, const int tsLen, const int sLen, const int nextTsIdx){
extern __shared__ float array[];
float *ts1 = (float*)array;
float *ts2 = (float*)&ts1[tsLen];
int tid = threadIdx.x;
int blockId = blockIdx.x;
int base = blockId * tsLen;
int blockSize = blockDim.x;
int numIters = ceil((double)tsLen / blockSize);
int *pConfs, *nConfs;
double *distVec;
bool isValid;
int idx, numThis, start;
int elms[3];
double term1, term2, s1, s1_2, s2, s2_2, mu2, sigma2, corr,
mu1[MAX_ITERS], sigma1[MAX_ITERS], dotPr[MAX_ITERS], nnCorr[MAX_ITERS];
for (int i = 0; i < MAX_ITERS; i++)
mu1[i] = sigma1[i] = dotPr[i] = nnCorr[i] = 0;
for (int i = 0; i < 3; i++)
elms[i] = 0;
for (int i = 0; i < numIters; i++){
numThis = (i == numIters - 1) ? tsLen - (i * blockSize) : blockSize;
if (tid < numThis){
start = i * blockSize + tid;
ts1[start] = tss_in[nextTsIdx * tsLen + base + start];
}
}
__syncthreads();
for (int i = 0; i < numIters; i++){
start = i * blockSize + tid;
isValid = start < tsLen - sLen + 1;
if (isValid){
s1 = s1_2 = 0;
for (int k = 0; k < sLen; k++){
term1 = ts1[start + k];
s1 += term1;
s1_2 += term1 * term1;
}
mu1[i] = s1 / sLen;
sigma1[i] = s1_2 / sLen > mu1[i] * mu1[i] ? sqrt(s1_2 / sLen - mu1[i] * mu1[i]) : 1;
}
}
for (int j = 0; j < numTrain; j++){
for (int i = 0; i < numIters; i++){
numThis = (i == numIters - 1) ? tsLen - (i * blockSize) : blockSize;
if (tid < numThis){
start = i * blockSize + tid;
ts2[start] = tss_in[j * tsLen + start];
}
}
__syncthreads();
for (int i = 0; i < numIters; i++){
start = i * blockSize + tid;
isValid = start < tsLen - sLen + 1;
if (isValid){
//initiation
s2 = s2_2 = 0;
for (int k = 0; k < sLen; k++){
/*term2 = ts2[k];
s2 += term2;
s2_2 += term2 * term2;*/
}
}
}
__syncthreads();
}
for (int i = 0; i < numSeeds; i++){
numIters = ceil((double)tsLen / blockSize);
for(int w = 0; w < numIters; w++){
start = w * blockSize + tid;
if (start < tsLen){
idx = (blockId * tsLen + start) * numTrain;
pConfs = &pConfsAll[idx];
nConfs = &nConfsAll[idx];
//for (k = 0; k < numTrain; k++){
for(int a = 0; a < numTrain; a++){
pConfs[a] = 100;
nConfs[a] = 100;
}
}
}
__syncthreads();
if (blockId + tid == 0 && i == 0){
int cnt = 0;
for (int w = 0; w < numTrain; w++){
for (int c = 0; c < tsLen; c++){
if (pConfsAll[w * c * numTrain + 70] != 100){
printf("s = %d, c = %d, val = %d\n", w, c, pConfsAll[w * c * numTrain + 70]);
cnt++;
}
if (cnt == 100)
break;
}
if (cnt == 100)
break;
}
}
}
}
```

The array pConfsAll is set to have a size of numTrain * tsLen * numTrain * sizeof(int). The grid size is numTrain and the block size is tsLen in this particular test case (although they can be set to values smaller than their current values).

Lines 100-114 are intended for testing. I thought that nothing should be printed since all values in pConfsAll are supposed to be set to 100. When the loop body in lines 69-71 was commented out, nothing was printed as expected. What baffles me is that when the loop body was included as functional code, something did get printed, indicating that certain values in pConfsAll got unset. This surprises me since it seems to me that lines 69-71 has nothing to do with lines 100-114.

Could someone help me with this problem? Thank you!