Cublas not working for multi -gpu

Hi,I have created a program for multi GPU implementation .The code works fine on one GPU. the problem is when this code is run on Two GPU.The trouble is with an operation which requires taking the maximum of a lot of cosine values using cublasIsamax function in cublas.May anyone help me regarding this?

Perhaps if you post a minimum sized example reproducing your problem someone could compile and run the code and try to help you.

I’ve just finished coding some multi-gpu stuff that involved a lot of reductions, so if you can show me what calls were being made, I’ll have a look at it. Note: I generally coded my own reductions as I was having trouble getting the dot product functions to work as I wanted them. Also, the cublas reductions aren’t the fastest.

Here is the function where the problem is occured:-
void gpuAsyncSearchableCategorization( int M, int N, int numExamples, float* examples, float* results, float* alpha, float* beta, int deviceID ) {

cudaError_t error_id;
cublasStatus_t status;

gpuSetDevice(deviceID);

float maxCosine;
int maxIndex;	// store indices of max cosines


// compute dot product between example and searchable vectors
for(int i = 0; i < numExamples; i++) {

	// copy the next example to working memory
	error_id = cudaMemcpy(dev_working[deviceID], examples + M*i, M*sizeof(float), cudaMemcpyHostToDevice);

	if (cudaSuccess != error_id) 
	{
		printf ("gpuAsyncSearchableCategorization:		cudaMemcpy operation failed\n");
		gpuCublasDestroy();
		exit(EXIT_FAILURE);
	}

	printf("deviceID = %d, vectorID = %d\n", deviceID, i);

	// multiply transpose of MxN matrix A by M-dimensional vector x
	// placing results in results memory  
	cublasSgemv(handle, CUBLAS_OP_T, M, (deviceID == 0) ? N - numExamples : N, alpha,
                       (deviceID == 0) ? dev_data[deviceID] + numExamples*M : dev_data[deviceID], M,
                       dev_working[deviceID], 1,
                       beta, dev_results[deviceID], 1);

	error_id = cudaGetLastError(); 

	if (cudaSuccess != error_id) 
	{
		printf ("Sgemv operation failed\n");
		//if(status == CUBLAS_STATUS_NOT_INITIALIZED) printf("library not initialized\n");
		//if(status == CUBLAS_STATUS_INVALID_VALUE) printf("invalid parameter values\n");
		//if(status == CUBLAS_STATUS_EXECUTION_FAILED) printf("execution failed\n");
		//if(status == CUBLAS_STATUS_ARCH_MISMATCH) printf("architecture mismatch\n");
		gpuCublasDestroy();
		exit(EXIT_FAILURE);
	}

	//printf("completed Sgemv operation\n");

	// TODO: is the problem with maxIndex??
	// find the index of the maximum cosine value now in the results buffer
	//cublasIsamax(handle, (deviceID == 0) ? N - numExamples : N, dev_results[deviceID], 1, (deviceID == 0) ? &maxIndex : &maxIndex2);
	//status = cublasIsamax(handle, (deviceID == 0) ? N - numExamples : N, dev_results[deviceID], 1, &maxIndex);
	//vectorMaxKernel((deviceID == 0) ? N - numExamples : N, dev_results[deviceID], dev_results[deviceID] + N, block_size, streams[deviceID]);

if(maxIndex > 0)
maxIndex -= 1;

	// copy the maximum cosine value of this set to the host
	error_id = cudaMemcpyAsync(&maxCosine, dev_results[deviceID] + maxIndex, sizeof(float), cudaMemcpyDeviceToHost, streams[deviceID]);

	if (cudaSuccess != error_id) {
		printf ("gpuAsyncSearchableCategorization:		cudaMemcpyAsync operation failed\n");
		gpuCublasDestroy();
		exit(EXIT_FAILURE);
	}

// printf(“maxCosine = %f\n”, maxCosine);

	// save the max cosine value and its index within the searchable vectors to the host,
	results[2*i + 2*numExamples*deviceID] = (float)maxIndex;
	results[2*i + 2*numExamples*deviceID + 1] = maxCosine;
}

}

Yes, but to be useful (at least to me) you should provide a full compilable and executable code. You are also not telling which kind of error you receive… :-)