seem a problem of mem initialization output isn't always corect

Hi, guys

It seems that i encounter a problem of mem initialization(i guess), things happened to my program is as following:

  1. The first run of my program totally get a wrong output, then everything is all right when i rerun it repeatedly. according to a previous post–“first run” of the cuda program isn’t correct, the output of my program is also right when i repeatedly rerun it after running sevral other correct CUDA programs.

  2. I put this aside and continued extending my program, then another problem came out when i allocated a dynamic array and wrote to every element of the array in a kernel function. the output is totally wrong If i initialize the array with 0 using cudaMemset() but write nothing to the array in the kernel, the output is correct for preceding kernels before this kernel if i don’t initialize the array using cudaMemset() and write nothing to the array in the kernel, the output of the first run for all kernels is correct if i don’t initialize the array using cudaMemset() and write to the array in the kernel.

i looked up some old posts in the forum and tried to figure it out as possibly as i can, but the problem is still there, so hope someone who know why things happen like this or has encountered and solved such a problem can give me some tips

Thinks in advance

Hey Gimurk,

can you show some code snippets. please?

What means “totally wrong”? If I have “totally wrong” values, they are often very big. Sometimes this behavior is caused by exceeding index limits.

Hi, QD4,

i mean that the results computed and returned by all kernel function are wrong, For example,

the execution of the first run is terminated, and shows a message that “NVRM: Xid(0001:00): 13,2 000000 …”

the second run of my program is correct according to my check, the output is as following:

[codebox]Device :GeForce 8800 GT, compute capbility 1.1

parse is done

DASE_NUM_TRANS=9820, DBASE_MAXITEM=1000, DBASE_MAX_TRAN_LEN=25

blockDim.x<<1=256, DBASE_MAX_TRAN_LEN=25, blockDim.x<<1-DBASE_MAX_TRAN_LEN=231

time of generating frequent 1 itemsets: 14.506240

number of frequent 1 itemsets = 805

itemset_len=1, block_num=51, pitch=3264

time of connecting 1-itemsets: 1.401696

number of candidate 2-itemsets is 323610

time of compressing candidate 2-itemsets: 1.789952

cand_pitch=1294464

block_num=1265

time of counting candidate 2-itemsets: 3086.191650

number of frequent 2-itemsets: 1026

itemset_len+1=2, candidate_num=323610, cand_pitch>>2=323616, freq_pitch>>2=1040, gridDim.x=1265, blockDim.x=256

time of compresss_after_count is 1.055776

itemset_len=2, block_num=65, pitch=4160

time of connecting 2-itemsets: 2.064992

number of candidate 3-itemsets is 6253

time of compressing candidate 3-itemsets: 2.181760

cand_pitch=25024

block_num=49

total time of generating frequent itemsets spent on device is 3109.191895

total number of itemsets which exceed the threshold is 1831

time spent on the host is 3.281095 seconds.

Press ENTER to exit…[/codebox]

the wrong output got by addding cudaMemset() after cudaMalloc() and runing the program at second time is as following:

[codebox]Device :GeForce 8800 GT, compute capbility 1.1

parse is done

DASE_NUM_TRANS=9820, DBASE_MAXITEM=1000, DBASE_MAX_TRAN_LEN=25

blockDim.x<<1=256, DBASE_MAX_TRAN_LEN=25, blockDim.x<<1-DBASE_MAX_TRAN_LEN=231

time of generating frequent 1 itemsets: 14.496672

number of frequent 1 itemsets = 85

itemset_len=1, block_num=6, pitch=384

time of connecting 1-itemsets: 0.142688

number of candidate 2-itemsets is 3570

time of compressing candidate 2-itemsets: 0.080480

cand_pitch=14336

block_num=28

time of counting candidate 2-itemsets: 46.775745

number of frequent 2-itemsets: 5

itemset_len+1=2, candidate_num=3570, cand_pitch>>2=3584, freq_pitch>>2=16, gridDim.x=28, blockDim.x=128

time of compresss_after_count is 0.049504

itemset_len=2, block_num=1, pitch=64

time of connecting 2-itemsets: 0.031872

number of candidate 3-itemsets is 1

time of compressing candidate 3-itemsets: 0.026752

cand_pitch=64

block_num=1

get_kitemset_Count_stepback para: itemset_len+1=3, candidate_num=1, DBASE_NUM_TRANS=9820, (cand_pitch>>2)=16, threshold=20, (blockDim.x<<1)-DBASE_MAX_TRAN_LEN=231, blockDim.x=128

total time of generating frequent itemsets spent on device is 61.603714

total number of itemsets which exceed the threshold is 90

time spent on the host is 0.263325 seconds.

Press ENTER to exit…[/codebox]

i will give you some code snippets a few hours later since i have to go out now

Gimurk

there was an order depedency between blocks of the first kernel function, it was lucky to get right output after first run. i thought this is the problem, so i modified the kernel, then in release mode, the first kernel always gets a right output now, but the putput of the second kernel is partly correct, and the output of the first kernel is always wrong, not correct for one time in emurelease mode. it is abnormal, which seems that there is still a problem with the first kernel function…

“NVRM: Xid(0001:00): 13,2 000000 …”

Means there is something really wrong. Do you check for errors after your kernels? (with CUT_CHECK_ERROR e.g. and in debug mode)

i didn’t do some checks since this error message just came out once at the very first run, this should result from the order dependency between blocks of the first kernel function in my program. i modified my this kernel function to remove the dependency then there is no error emerged with the first kernel function for now, but the output of the second kernel function is partly correct.

according to my programming experience on cpu, a lot of mem errors will cause an odd behavior, which is hard to explain, far from where the error lies

but, the kernel with blocks execution depended with each other should result in a wrong output for the most times, why everything is all right only after the wrong execution once. it is unreasonable…

aha, it seems that i figured out what’s wrong with my program, both the version in release mode and one in emurelease mode run without any abnormal info or error and get a consistent output. in addition to the order dependency of block executions, the other problem with my problem i found is that the first kernel lacked a “_syncthread()” when threads was going to read data from shared mem, which was set by threads before, which resulted in a potential read-after-write,write-after-read or write-after-write hazard. it is interesting that release version worked well and got a correct output, while emu version discarded half of the output.

anyway, i got what i wanted and hope this is helpful for others

Gimurk