Incorrect results of CUDA on non-primary display


I am using CUDA under Windows XP. According to the following CUDA release notes:

I use a Geforce 6800GT as the primary graphics card for displaying and a 8800 GTX for computation. The Windows Display Driver version 97.73 for CUDA Toolkit Version 0.8 is installed for both of them.

Here is the host code:

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <cutil.h>

#include <>

#define BLOCKNUM 16

#define THREADNUM 32

void runTest( int argc, char** argv);


main( int argc, char** argv) 


    runTest( argc, argv);

   CUT_EXIT(argc, argv);



runTest( int argc, char** argv) 



	FILE* output = fopen("output", "w");


	int memsize_byte = sizeof(int) * BLOCKNUM * THREADNUM;

   int *d_output;

    CUDA_SAFE_CALL( cudaMalloc( (void**) &d_output, memsize_byte ) );

   dim3  grid(BLOCKNUM, 1, 1);

    dim3  threads(THREADNUM, 1, 1);


	//Set timer

    unsigned int timer = 0;

    CUT_SAFE_CALL( cutCreateTimer( &timer));

    CUT_SAFE_CALL( cutStartTimer( timer));


    printf("Begin testing...\n");


    testKernel<<<grid, threads>>>(d_output);

   CUT_CHECK_ERROR("Kernel execution failed");


    printf("Computation completed.\n");



    CUT_SAFE_CALL( cutStopTimer( timer));

    printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer));

    CUT_SAFE_CALL( cutDeleteTimer( timer));

   // allocate mem for the result on host side

    int* results = (int*) malloc(memsize_byte);

   // copy result from device to host

    CUDA_SAFE_CALL(cudaMemcpy(results, d_output, memsize_byte, cudaMemcpyDeviceToHost) );

	for (int i = 0; i < BLOCKNUM * THREADNUM; ++i)


  fprintf(output, "%d, %d\n", i, results[i]);






And the kernel:

__global__ void

testKernel(int* output)


	const int tid = threadIdx.x + blockIdx.x * blockDim.x;

	int tempValue = 0;


	for (int i = 0; i < 10000; ++i)


  for (int j = 0; j < 10000; ++ j)


  	tempValue = max( (i + j) % 4, (i + j) % 3 ) + 2;




	output[tid] = tempValue;

	tempValue = 0;


There are no compiling or running error issued. However, the computation results are not always correct. By tuning the upper limit of i or j in the kernel, we can get different run time. For the current value, it is about 6 sec. When the upper limit of j is changed to, such as 5000. The run time will within 3 sec. Now, my problem is when the kernel runtime is beyond 5 sec, the results are all 0, when it is within 5 sec, results are correct.

Any suggestions are appreciated.

This might mean that your g80 is still “controlled” by windows. Make sure that no monitor is connected to the g80 and that Windows desktop is not extended to it. I run WinXP with the same drivers and dual g80s. My program ran successfully on the secondary g80 for over a minute, while the entire system locks up if it’s run on the primary one.


Thanks Paulius. I want to know how do you install the driver for the primary and secondary card. In my case, I first plug the 6800 and 8800 cards in the board (6800 is plugged in the primary PCI_E slot). Then, I connect the monitor with the 6800. After turning on the machine and entering the VGA mode, I installed the 97_73 driver. Is my way installing the driver correct?

I have tried install different drivers for the primary 6800 (93_71) and 8800 (97_73). The results are still not correct when runtime is more than 5 sec.

Any suggestions are welcome.

I actually haven’t tried it with two different cards, mine are same type. But my installation process was the same as yours - plug in both cards, connect the monitor to one, boot, install the drivers, make sure SLI is off. It did seem that the drivers got installed twice (I had to click the same boxes twice), but I’m not going to swear to that. Perhaps someone else has experience with two different cards and cuda?