How to extract results from device? Cublas and cuda

alexnovice · July 17, 2009, 8:48am

Hallo everyone!

I am trying to do de Dot Product of 2 Vectors using a cublas function. My problem is, that i don’t know how to get the calculated result out of the device. Can someone tell me how to continue with this code please?

#include<stdio.h>

#include<stdlib.h>

#include<cuda.h>

#include<string.h>

#include<cublas.h>

int main(int argc, char** argv) {

	float *h_vx, *d_vx, *h_vy, *d_vy;

	int i = 0;

	const int N = 10;

	//Speicher fÃ¼r Device allokieren

	size_t size = sizeof(float) * N;

	cudaMalloc((void **) &d_vx, size);

	cudaMalloc((void **) &d_vy, size);

	//Speicher fÃ¼r Host allokieren

	h_vx = (float *) malloc( size );

	h_vy = (float *) malloc( size );

	//Vektor fÃ¼llen

	for(i = 0; i < N; i++) {

		h_vx[i] = (i+1)*2;

		h_vy[i] = (i+2)*2;

	}

	

	puts("\n");

	puts("\n");

	//Elemente in Device kopieren

	cudaMemcpy(d_vx, h_vx, size, cudaMemcpyHostToDevice);

	cudaMemcpy(d_vy, h_vy, size, cudaMemcpyHostToDevice);

	

	cublasSdot(N, d_vx, 1, d_vy, 1);

	////////////////

	//HOW TO GO ON?

		////////////////

	//Speicher freigeben

	free( h_vx );

	free( h_vy );

	cudaFree( d_vx );

	cudaFree( d_vy );

	

	return 0;

}

Regards,

Alex

Nico · July 17, 2009, 8:58am

float result = cublasSdot(N, d_vx, 1, d_vy, 1); ?

Haven’t used it myself so I’m not sure if result is supposed to be host memory, just a guess…

N.

alexnovice · July 17, 2009, 9:27am

Just as simple :o. It worked, ty.

A thing i not really understand is, now how to use multithreading with CUBLAS. How is the cublasSdot executed in my program? I guess there is only one thread working on it. Because the cublasSdot is a Host-function I can’t put it in a kernel where i can work on thread operations. So how can multithreading be realised with cublas functions (I hope I didn’t skip this part when reading the cublas-manual).

Regards.

Nico · July 17, 2009, 10:35am

I seriously doubt that there’s only a single thread working on it, my guess is that it’s implemented as a reduction algorithm, but I’m not sure…

N.

jacoblyles · July 17, 2009, 4:41pm

I think the CUBLAS function executes the operation on the device and then copies it back to the host.

There is no indication of this in the documentation, however, and the source code is no longer available.

Try using the CUBLAS dot product function on two very large vectors and compare it with a host version. It should be much faster if CUBLAS is using multi-threading.

alexnovice · July 20, 2009, 8:22am

Thank you for your hints. I used a 10000x10000 matrix and executed the cublasSaxpy() a hundred times in a loop. The elapsed time using cublas was 33 ms and using a self-programmed function doing the same operations it took around 170000 ms, so it seems to be multi-threaded.

Thank you for your help.

Regards

Topic		Replies	Views
Result of a CUBLAS function CUDA Programming and Performance	7	10859	April 22, 2010
The return of cublasSdot The return must be in CPU memory? CUDA Programming and Performance	5	6530	March 20, 2009
Cublas, sum and dot. Newbie question. CUDA Programming and Performance	6	5558	November 29, 2012
cublasSdot of CUBLAS 3.1 with streams CUDA Programming and Performance	1	5779	July 17, 2010
Cublas, keep results on device CUDA Programming and Performance	1	8723	June 25, 2010
Is there a way to make a CUBLAS function write its return value to a device variable CUDA Programming and Performance	0	935	May 1, 2009
Cublas matrix dot product ? CUDA Programming and Performance	5	15608	January 7, 2011
Must the result of cublasSdot be host variable? GPU-Accelerated Libraries	4	1327	October 20, 2016
Help required with CUBLAS CUDA Programming and Performance	2	1448	March 26, 2009
Returning value of cublasSdot to Matlab CUDA Programming and Performance	3	4221	October 17, 2008

How to extract results from device? Cublas and cuda

Related topics