5 seconds limitation? or a bug in my kernel?

Hi all,

I am using a 8800 GTX as the graphic card and cuda coprocessor in Linux Fedora. After browsing some threads in the forum and the cuda guide, I get puzzled by two things:

  1. Is the GPU clocked by 1.35GHz or 575MHz (or even according to the v0.8 programming guide, 675MHz)?

  2. Is there a max execution time for one kernel call if the GPU is used as graphic card as well as cuda computation? I get quite some different ideas. Some said it is only a limitation for Windows. Some said it holds for both windows and linux. Some said the limitation is 5 seconds. Some said the limitation is 5-7 seconds. Some said if you don’t use the X windows in Linux (like ssh to the GPU machine from another one), the limitation is removed.

I wrote the following simple code to test it. Since I haven’t figured out how to use the clock() function to measure running time, I set the CUDA_PROFILE=1 to record the time. The following code does 1.35G times addition with operand in shared memory in one block with one thread only. So if the GPU is clocked by 1.35GHz, I expect the program to run 3-7 seconds(depend on how many machine instruction is generated for the entire for loop). However, running this code gives me more questions:
3. It never output the correct sum (1350000000), but always 987. I don’t understand why.
4. If I decrease the number of loop iterations, say to 135000000, it most of the time output the correct result. But once a while output 987.
5. If I increase the number of loop iterations to exceed 5 second limitation, I NEVER experience that the machine is hang, and need to be reboot. (both using X windows or without X, using command line interface ssh from another machine).

I am very frustrated. Does anyone have any idea to any of the questions above? Many many thanks!

-------------------------beginning of the code------------------------------------------
#include<stdio.h>
#include<stdlib.h>

global void add(unsigned long *pt) {
shared unsigned long sum, i;
sum = 0;
for (i=0;i<1.35e9;i++)
sum+=1;
*pt = sum;
}

int main(void){
unsigned long * ptd;
unsigned long * pt;

    pt = (unsigned long*) malloc(sizeof(long));
    *pt = 987;
    cudaMalloc((void**)&ptd, sizeof(long));
    add<<<1,1>>>(ptd);
    cudaThreadSynchronize();
    cudaMemcpy(pt,ptd,sizeof(long),cudaMemcpyDeviceToHost);
    cudaFree(ptd);
    printf("sum is %lu\n", *pt);
    return 0;

}

----------------------end of the code---------------------------------------------

PS: Sorry for making the post so long.

Cheers,
Timtimac

I’m not the best person to answer all your questions but I may be able to help with a few. I modified your program a bit (shown below) to check error conditions. Indeed when looping to 1.35e9 you are exceeding the time limit. Reduce this a few orders of magnitude and all is fine. Also, this modified version shows you how to use the clock function.

I hear your frustration with the time limit. I too have kernels that have to be broken up because of the time limit. This limitation is present on both Windows and Linux. It may not be present when not running X on Linux… I don’t know. Best solution seems to be to get a second card and do your computation on the card that is not being used for graphics. If you go this route I believe both cards must be NVIDIA cards as NVIDIA and ATI drivers apparently will not coexist peacefully.

#include<stdio.h>

#include<stdlib.h>

#include <cutil.h>

#define CUDA_CHECK_ERROR() \

  {\

    cudaError_t ce = cudaGetLastError();\

    if(ce != cudaSuccess) {\

      printf("%s\n", cudaGetErrorString(ce));\

      exit(EXIT_FAILURE);\

    }\

  }

__global__ void add(unsigned long *pt) {

  //__shared__ unsigned long sum, i;

  unsigned long sum, i;

  sum = 0;

  //for (i=0;i<1.35e9;i++)

  for (i=0;i<1.35e6;i++)

    sum+=1;

  *pt = sum;

}

int main(void){

  unsigned int timer;

  cutCreateTimer(&timer);

  cutStartTimer(timer);

 unsigned long * ptd;

  unsigned long * pt;

 pt = (unsigned long*) malloc(sizeof(long));

 *pt = 987;

 cudaMalloc((void**)&ptd, sizeof(long));

  CUDA_CHECK_ERROR();

 add<<<1,1>>>(ptd);

  CUDA_CHECK_ERROR();

 cudaThreadSynchronize();

  CUDA_CHECK_ERROR();

 cudaMemcpy(pt,ptd,sizeof(long),cudaMemcpyDeviceToHost);

  CUDA_CHECK_ERROR();

 cudaFree(ptd);

  CUDA_CHECK_ERROR();

 cutStopTimer(timer);

  float cuda_time = cutGetTimerValue(timer);

  cutDeleteTimer(timer);

 printf("sum is %lu\n", *pt);

  printf("CUDA time: %f\n", cuda_time);

 return 0;

}

The multiprocessors (which do instruction decoding, warp management and other stuff) are 575 MHz, but the 16 processors (ALUs, essentially) on each multiprocessor are clocked at 1.35 GHz. Since threads are executed in blocks of 32, and instructions take 2 clocks to bubble through the pipeline, the multiprocessor has enough time to finish decoding the next instruction before the warp finishes. (This is a rough description, as there isn’t much more detailed information on the hardware.)

If you are using the GPU for both graphics and CUDA at the same time, there is a timeout for all platforms. There are reports that disabling X, but making sure the nvidia kernel module is loaded, will avoid the timeout. I find this locks up our system here, but I haven’t investigated it carefully since the 5 sec timeout is not a problem for us.

You should note that running a kernel with one thread and one block will give you very bad performance that does not scale linearly with more threads. (That is, if 1 thread does the job in 5 seconds, 64 threads will be more than 64 times faster usually.) Lots of speed-enhancing features of the GPU, such as instruction pipelining and block interleaving can’t be used when there is only one thread.

I’m not sure why this code doesn’t work (and both of our GTX cards are in use so I can’t test), but a minor comment: I don’t think you need to declare your index i or your sum as shared. The shared modifier is only need for variables or arrays which must be visible to all threads in the block. Since you only have one thread, there is nothing to share.