Why my code crashes?

Hi, all.

I am learning CUDA programming from scratch. The following is a (stupid and simple function) program, But the compiled binary file crashes on execution everytime. I think that is because there are read beyond the limitation of memory. However I cam’t fix the exact problem. Can anyone help me? Thanx.

my program environment:

  1. Winxp with sp3, Vc 9.0( cl.exe version 15.00.21022.08), Geforce GT 9800, Nvidia Driver 266.58

  2. Winxp with sp3, Vc 9.0( cl.exe version 15.00.21022.08), Geforce GT 9800, Nvidia Driver 270.61

  3. Win7, Vc 9.0( cl.exe version 15.00.21022.08), C2050 + Quodra FX3800, Nvidia Driver 270.61

My vc says, sizeof(int)=4 bytes, so I only allocated


in both host and cuda mem. I don’t think 0.8M exceeds the ability of ram


search the number x, where x*x can be splited into 2 parts: head, tail


for example

(20 + 25)*(20 + 25)=45*45=20 25

head=20, tail=25




#define NUM (65535*2) //if NUM is 65535, the program runs ok

__global__ void kernel(int *dev_num)


    int tid=threadIdx.x + blockIdx.x*blockDim.x;

    int head, tail, sq;

    while (tid<NUM)



head=sq / 100;

        tail=sq % 100;

        if ((head+tail)*(head+tail)==sq)


            //printf("head=%d, tail=%d", head, tail);












int main(void)


cudaEvent_t start, stop;

    float elapsedTime;



cudaEventRecord(start, 0);

int num[3*NUM];

    int *dev_num;

    int i=0;

cudaMalloc((void**)&dev_num, 3*NUM*sizeof(int));

memset(&num, 0, 3*NUM*sizeof(int));

kernel<<<128, 128>>>(dev_num);

cudaMemcpy(num, dev_num, 3*NUM*sizeof(int), cudaMemcpyDeviceToHost);

for (i=0;i<NUM;i++)


        if (num[3*i])


            printf("(%2d + %2d)^2 = %4d\n", num[3*i+1], num[3*i+2], num[3*i]) ;



cudaEventRecord(stop, 0);


    cudaEventElapsedTime(&elapsedTime, start, stop);

    printf("Time: %f sec(s)\n", elapsedTime);   








kernel<<<128, 128>>>(dev_num)

looks stange, usualy you will have to design the block/thread map. I have no idea what means 128,128 : 128 blocks of 128 threads each ? In this case threadIdx.y does not make sense at all.


Well, I can’t see anything wrong, and indeed the program works for me. Note that sq=tid*tid will overflow for tid>46340 though.

If the program crashes for you, it would be a good idea to check return codes for errors (actually that is a good idea in any case).

It’s a perfectly valid construct for a one-dimensional grid. Indeed threadIdx.y is always zero in that case, but as it isn’t used in the kernel at all, I can’t see a problem with that.


Note than thread 0 from block (0,0) will write in dev_num[3*tid] which lead to

#1 iteration: tid = 0; dev_num[0] // no problem

#2 iteration: tid = 16384; dev[49152]; // no problem

#3 iteration: tid = 32768: dev[98304]; // out of bound!

You will probably wondering why it works with N = 65535 and it works fine for tera with 65535*2. This is related to the amount of RAM on your video card and how the driver manage out of bound index. You need to reestructure your code and take care of this issue.

I’ve not gone deeper in your code but maybe this could help:

while (3*tid<NUM) // check before all computations


if (3*tid < NUM) { // check before write






The code allocates space for 3*NUM integers so I don’t see how this would lead to an out-of-bounds access.

You are right tera, i miss that. As i commented i didnt go deeper and focuses on the kernel part.

My apologies.

any hints?
I add 2 pictures in the first post, and add more info
“error.jpg” is shown whenever I run compiled binary file
if I push “ok”, I can read “stackoverflow.jpg”, so the problem is not seek-beyond-memory? But what is the stack? and what is the limitation of the stack? With this limitation, how can I fix my source?

Sorry, can’t help you with these error messages. Any chance you can produce them in English?