the launch timed out and was terminated

Hi everybody,

I am new with Cuda, and I’d like to know why I get this error:

What I am trying to do is just to give values to an array. On the device:

while (tid<TAM){
tid+=blockDim.x * gridDim.x;

If TAM is 10485760 (10MB) it works fine, but if I tried with 100MB I get this error:

cudaSafeCall() Runtime API error in file xxx, line 155 : the launch timed out and was terminated.

A is one of the parameters I give to the function, so I thought it would be global memory, but the GPU has 4 GB of global memory…

Anyone could explain me what is happening?, isn’t it global memory?.

Another doubt. If I want to declare a variable in the device without passing it as an argument of the function, i declare it outside the device and host function with device. For example:

device float B[TAM]

Is that correct??.

Thank you so much, and sorry for my poor english.

Which operating system is this under and is a display connected to the card?
The error message you get indicates that the watchdog timer triggered. As the display cannot be updated while a kernel is running, this timer is kind of a last resort if a kernel has an infinite loop or takes too long, where too long is about 2 to 5 seconds.
The solution is either to make the kernel do less work, or (under Linux) not to run X on the card.

I’m a bit surprised though that the card cannot write 400 Mbyte in under two seconds. What type is A and what card are you using?

Declaring a device variable as
[font=“Courier New”]device float B[TAM];[/font]
is correct as long as TAM is a constant.

Grid and block size are important, probably low gpu utilization.

I am working with linux, but I don’t know whether there is a display connected or not because it belongs to the university (I connect through ssh), but I’ll ask.

The GPU is a GPU NVIDIA Quadro FX5800.

As I am just testing, I am using only 8 threads, maybe it’s because of that.

Definitely at least use 32 threads per block to allow full coalescing of the memory accesses.

Better use enough threads to fully load the device. What’s the point in having most of the device lie idle during testing?

What’s the full execution configuration? I’d suggest 64 threads per block and something like at least 1000 blocks.

What type is A?

I just wanted to know if there were any problem to use hugh amount of global memory, like 1 or 2 GB.

A is float.

There shouldn’t, unless you are under Windows. Check return codes to see if the allocation was successful or not.

It would be helpful if you also answered questions and not just posted new ones. The execution configuration is not directly related to the amount of memory you allocate. You don’t have to allocate gigabytes of memory to run thousands of threads.

From what you write I’m under the impression that you are not aware of the difference between CPU threads and GPU threads. Check out the Programming Guide, particularly chapter 2, to learn about this.