Speed problem on 295 gtx cards

Thanks again for your input. I made a very simple program to see if it crashed with that as well and to my surprise it did. I paste the code for this simple ( and totally useless ) program below. If I start one run in this code it finishes without any problem, but if I start a new run while the first one still is running the first one crashes with an unspecified launch failure. I would very grateful if someone could try this code on their computer (if they have more than one device for calculation) and launch at least 2 runs of the code at the same time. Perhaps Im doing something wrong in this simple code as well? Otherwise the problem should be somewhere else but in the code.

The reason for the massive loop is just that I want the execution to stay on the GPU for a while.

[codebox]#include

#include <assert.h>

using namespace std;

global void VecAdd(float a, float b, float c, float* kD) {

float d = 0;

for(int i=0; i<100000000; i++)

    d += a+b+c+ kD[8]+i;

kD[0]=d;

}

int main() {

float minarray[20] = {114,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20};

float *keeperDevice;

cudaError t1 = cudaMalloc((void**)&keeperDevice, 20*sizeof(float) );

assert(t1==cudaSuccess);

t1 = cudaMemcpy(keeperDevice, &minarray, 20*sizeof(float), cudaMemcpyHostToDevice);

assert(t1==cudaSuccess);

int cgd = 0;

int cgdc = 0;

t1 = cudaGetDevice(&cgd);

assert(t1==cudaSuccess);

t1 = cudaGetDeviceCount(&cgdc);

assert(t1==cudaSuccess);

cout << "cudaGetDevice: " << cgd << endl;

cout << "cudaGetDeviceCount: " << cgdc << endl;

VecAdd<<<128, 128>>>(5.0,6.0,7.0,keeperDevice);

t1 = cudaMemcpy(&minarray, keeperDevice, sizeof(float)*20, cudaMemcpyDeviceToHost);

assert(t1==cudaSuccess);

cout << “Done”;

}[/codebox]