cuda and cublas

I have an array and i want to retrieve its minimum element.
The array is filled inside a kernel module, and it all works fine. Then, outside this kernel module, i use the cublasIsamin function to get the index of the minimum element; then i call another kernel module to do some more computations.
The problem is that the cublas function is quite slow (the array has about 10 000 elements) and before i execute the second kernel module, precisely at the instruction

CUT_DEVICE_INIT();

i have a runtime error.
If i don’t execute the cublas function i don’t have the same error any more.
What could generate this error? Maybe there is no more place in the memory?

Why are you initializing the device again after calling the cublas function? You should only initialize the device once when your thread starts.

i tought i had to reinitialise it everytime…
but apparently the problem is not this. I still have a run time error at the next CUDA function (CUDA_SAFE_CALL(cudaFree(…)) or CUDA_SAFE_CALL(cudaMalloc(…))) i execute after the cublasIsamin()…

Hmm that is odd. I’ve never used cublas, but if I had to guess from the information you’ve given, maybe your call to cublasIsamin() is causing the problem (i.e. writing outside of a memory array or something). Have you double checked the values of all arguments being passed to the cublas function? Could you post a short code example that demonstrates the problem (preferably an example that can copied and pasted and then compiled with nvcc -o file file.cu).

my project is quite big and i’m using the Visual Studio 2005 environment so i tested a very simple code and i obtain the same problems (It’s not so clear to me how to write some code executable with ncvv… cut i think the code i’m posting here should be simply to test.).

The odd tiings i noticed are:

   1. The cublasIsamin(..) function takes a considerable amount of time to do its job (here the vector has only 10 elements but it's the same for a vector of 100 or 100 elements)

  2. The cublasIsamin(..) always returns the 0 index, so it apparently does not work. 

 3. No matter which CUDA instruction we use after it, we have a run time error
      float *dummy;

	float *er_vect = (float *)malloc(10*sizeof(float));

    

 for(int i=0; i<9; i++){

      er_vect[i]=15-i;

      printf("er_vect[%d]= %f\n",i,er_vect[i]);

}

    

    int indmin = cublasIsamin(10, er_vect, 1);

	printf(" minimum element= %f (indmin= %d)\n",er_vect[indmin], indmin);

 CUDA_SAFE_CALL( cudaMalloc( (void**) &dummy, 10*sizeof(float)));