How to extract results from device? Cublas and cuda

Hallo everyone!

I am trying to do de Dot Product of 2 Vectors using a cublas function. My problem is, that i don’t know how to get the calculated result out of the device. Can someone tell me how to continue with this code please?






int main(int argc, char** argv) {

	float *h_vx, *d_vx, *h_vy, *d_vy;

	int i = 0;

	const int N = 10;

	//Speicher für Device allokieren

	size_t size = sizeof(float) * N;

	cudaMalloc((void **) &d_vx, size);

	cudaMalloc((void **) &d_vy, size);

	//Speicher für Host allokieren

	h_vx = (float *) malloc( size );

	h_vy = (float *) malloc( size );

	//Vektor füllen

	for(i = 0; i < N; i++) {

		h_vx[i] = (i+1)*2;

		h_vy[i] = (i+2)*2;





	//Elemente in Device kopieren

	cudaMemcpy(d_vx, h_vx, size, cudaMemcpyHostToDevice);

	cudaMemcpy(d_vy, h_vy, size, cudaMemcpyHostToDevice);


	cublasSdot(N, d_vx, 1, d_vy, 1);




	//Speicher freigeben

	free( h_vx );

	free( h_vy );

	cudaFree( d_vx );

	cudaFree( d_vy );


	return 0;




float result = cublasSdot(N, d_vx, 1, d_vy, 1); ?

Haven’t used it myself so I’m not sure if result is supposed to be host memory, just a guess…


Just as simple :o. It worked, ty.

A thing i not really understand is, now how to use multithreading with CUBLAS. How is the cublasSdot executed in my program? I guess there is only one thread working on it. Because the cublasSdot is a Host-function I can’t put it in a kernel where i can work on thread operations. So how can multithreading be realised with cublas functions (I hope I didn’t skip this part when reading the cublas-manual).


I seriously doubt that there’s only a single thread working on it, my guess is that it’s implemented as a reduction algorithm, but I’m not sure…


I think the CUBLAS function executes the operation on the device and then copies it back to the host.

There is no indication of this in the documentation, however, and the source code is no longer available.

Try using the CUBLAS dot product function on two very large vectors and compare it with a host version. It should be much faster if CUBLAS is using multi-threading.

Thank you for your hints. I used a 10000x10000 matrix and executed the cublasSaxpy() a hundred times in a loop. The elapsed time using cublas was 33 ms and using a self-programmed function doing the same operations it took around 170000 ms, so it seems to be multi-threaded.

Thank you for your help.