Questions about cublasSaxpy

huilinhui · June 23, 2009, 10:32am

Hi,

 CUBLAS_Library_2.2.pdf describes cublasSaxpy() as follows, "void cublasSaxpy(int n, float alpha, const float *x, int incx, float *y, int incy) multiplies single-precision vector x by single-precision scalar alpha and adds the result to single-precision vector y; that is, it overwrites single-precision y with singlr-precision alpha*x + y.

for i = 0 to n-1, it replaces y[ly + iincy] with alphax[lx + iincx]+y[ly + iincy], where incx is the storage spacing between elements of x, y is single-precision vector with n elements, incy is the storage spacing between elements of y."

according to the above description, when n=3, alpha=1.0f, x=1, incx = 1, y = 0, incy = 1, then after calling cublasSaxpy() at least y[1] becomes 1. But after running the codes as follows, all y are still 0, none becomes 1. Why? Thanks!!!

#include <stdio.h>
#include <cublas.h>
#include <cuda.h>
#include <cuda_runtime.h>

int main(){
int N = 3;
float x[3],y[3];
for(int i = 0; i < N; i++){
x[i] = 1.0;
y[i] = 0;
}
cublasSaxpy ( N, 1.0f, x, 1, y, 1 ) ;
for(int i = 0; i < N; i++){
printf("x[%d] = %f ", i, x[i]);
printf("y[%d] = %f ", i, y[i]);
}
return 0;
}

avidday · June 23, 2009, 10:35am

You can’t operate on host memory in CUDA. You are going to have to allocate and copy your data to memory on the device, then execute the SAXPY, then copy back the results to the host. Chapters 4 & 5 of the programming guide and the “simpleCUBLAS” example in the SDK contain everything you need to know.

huilinhui · June 23, 2009, 12:58pm

Thank you but as a matter of fact I have implemented another version as follows, and which still does not work. That is, all y are still 0.

int main(){

int N = 3;

float x[3],y[3];

for(int i = 0; i < N; i++){

	x[i] = 1.0;

	y[i] = 0;

}

float *xd, *yd;

cudaMalloc((void**)&xd, N*sizeof(float));

cudaMalloc((void**)&yd, N*sizeof(float));

cudaMemcpy(xd,&x,N*sizeof(float),cudaMemcpyHostToDevice);

cudaMemcpy(yd,&y,N*sizeof(float),cudaMemcpyHostToDevice);

cublasSaxpy ( N, 1.0f, xd, 1, yd, 1 ) ;

cudaMemcpy(&x,xd,N*sizeof(float),cudaMemcpyDeviceToHost);

cudaMemcpy(&y,yd,N*sizeof(float),cudaMemcpyDeviceToHost);

for(int i = 0; i < N; i++){

	printf("x[%d] = %f ", i, x[i]);

	printf("y[%d] = %f ", i, y[i]);

}

cudaFree(xd); cudaFree(yd);

return 0;

}

In fact I saw the original calling method above from a third party tool based on CUDA.

mfatica · June 23, 2009, 1:14pm

Nothing wrong with your code.

#include "cublas.h"

#include "cuda_runtime.h"

#include "stdio.h"

int main(){

int i,N = 3;

float x[3],y[3];

for(i = 0; i < N; i++){

x[i] = 1.0;

y[i] = 0;

}

float *xd, *yd;

cudaMalloc((void**)&xd, N*sizeof(float));

cudaMalloc((void**)&yd, N*sizeof(float));

cudaMemcpy(xd,&x,N*sizeof(float),cudaMemcpyHostToDevice);

cudaMemcpy(yd,&y,N*sizeof(float),cudaMemcpyHostToDevice);

cublasSaxpy ( N, 1.0f, xd, 1, yd, 1 );

cudaMemcpy(&x,xd,N*sizeof(float),cudaMemcpyDeviceToHost);

cudaMemcpy(&y,yd,N*sizeof(float),cudaMemcpyDeviceToHost);

for(i = 0; i < N; i++){

printf("x[%d] = %f ", i, x[i]);

printf("y[%d] = %f \n", i, y[i]);

}

cudaFree(xd); cudaFree(yd);

return 0;

}

gcc sax.c -I/usr/local/cuda/include -L/usr/local/cuda/lib -lcublas -lcudart

./a.out

x[0] = 1.000000 y[0] = 1.000000

x[1] = 1.000000 y[1] = 1.000000

x[2] = 1.000000 y[2] = 1.000000

huilinhui · June 24, 2009, 3:35pm

Thanks! But I still get the wrong answer using the code below!:( Maybe the problem lies in the linking process. I use VS2005 under windows. But I cannot figure out the problem right now. Any suggestions?? Thanks!

By the way, the output is “!!! kernel execution error.” when I run the SDK example simpleCUBLAS using VS2005 under windows. Why?? What do I need to do to get the right answer for simpleCUBLAS example? Thanks again!

Nothing wrong with your code.

#include "cublas.h"

#include "cuda_runtime.h"

#include "stdio.h"

int main(){

int i,N = 3;

float x[3],y[3];

for(i = 0; i < N; i++){

x[i] = 1.0;

y[i] = 0;

}

float *xd, *yd;

cudaMalloc((void**)&xd, N*sizeof(float));

cudaMalloc((void**)&yd, N*sizeof(float));

cudaMemcpy(xd,&x,N*sizeof(float),cudaMemcpyHostToDevice);

cudaMemcpy(yd,&y,N*sizeof(float),cudaMemcpyHostToDevice);

cublasSaxpy ( N, 1.0f, xd, 1, yd, 1 );

cudaMemcpy(&x,xd,N*sizeof(float),cudaMemcpyDeviceToHost);

cudaMemcpy(&y,yd,N*sizeof(float),cudaMemcpyDeviceToHost);

for(i = 0; i < N; i++){

printf("x[%d] = %f ", i, x[i]);

printf("y[%d] = %f \n", i, y[i]);

}

cudaFree(xd); cudaFree(yd);

return 0;

}

gcc sax.c -I/usr/local/cuda/include -L/usr/local/cuda/lib -lcublas -lcudart

./a.out

x[0] = 1.000000 y[0] = 1.000000

x[1] = 1.000000 y[1] = 1.000000

x[2] = 1.000000 y[2] = 1.000000

avidday · June 24, 2009, 3:41pm

Are you certain your CUDA installation actually works? Can you build and run the SDK deviceQuery example?

huilinhui · June 24, 2009, 3:46pm

Yes, the following is the output after running deviceQuery example. Thanks!

"

Device 0: “GeForce GT 130M”

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 536543232 bytes

Number of multiprocessors: 4

Number of cores: 32

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.50 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: Yes

Support host page-locked memory mapping: Yes

Compute mode: Unknown

Test PASSED

Press ENTER to exit…

"

Nico · June 24, 2009, 5:21pm

Hmm, never encountered this before. I have no idea why it’s unknown…

N.

MisterAnderson42 · June 24, 2009, 5:36pm

One possibility is that you have the CUDA 2.2 toolkit/SDK, but an older driver. That would cause lots of weird random behavior from CUDA apps, including an “unknown” compute mode.

It is always good to program defensively. With CUDA 2.2’s new driver/runtime version checking calls, every program should check that the result form cudaDriverGetVersion is >= the result from cudaRuntimeGetVersion.

neocortex · June 24, 2009, 5:42pm

try using using x, y, instead of &x, &y, in cudaMemcpy. The two appears to be the same but who knows
printf( " %x ", x)
printf( " %x ", &x)

huilinhui · June 25, 2009, 1:03am

The driver used is downloaded together with CUDA 2.2 toolkit and SDK from the nvidia site. And in fact this notebook is made in May this year, so i think its driver should be new:)

I’m not quite clear about what you call “program defensively”.:(Anyway, thank you!

MisterAnderson42 · June 25, 2009, 12:10pm

It never hurts to double check. Maybe the driver you downloaded didn’t install fully?

I mean to check for error conditions regularly. Imagine every possible situation the user could try to run your program under and check if it will work. Specifically here: adding a check in the code to verify that the driver version is up to date. It also means checking for error return values from any cuda function.

A simple:

int driverVersion;

cudaDriverGetVersion(&driverVersion);

cout << "CUDA Driver version: " << driverVersion << endl;

would confirm that your driver is indeed CUDA 2.2 capable. The output should be “2020”.

huilinhui · June 26, 2009, 3:40am

It never hurts to double check. Maybe the driver you downloaded didn’t install fully?

I mean to check for error conditions regularly. Imagine every possible situation the user could try to run your program under and check if it will work. Specifically here: adding a check in the code to verify that the driver version is up to date. It also means checking for error return values from any cuda function.

A simple:
int driverVersion;

cudaDriverGetVersion(&driverVersion);

cout << "CUDA Driver version: " << driverVersion << endl;
would confirm that your driver is indeed CUDA 2.2 capable. The output should be “2020”.

Yes, you are right. The output of my “driverVersion” is 0. So I uninstalled the original driver and downloaded new drivers from http://www.nvidia.com/object/cuda_get.html. But unfortunately, all drivers here from 1.0 to 2.2 cannot be installed properly. I’m sure that the version 2.1 and 2.2 is for notebook, they are 181.22_notebook_winxp_32bit_english_beta.exe and cudadriver_2.2_winxp_32_185.85_notebook.exe respectively. Other versions have no special drivers for notebook.

My GPU is GeForce GT 130M, which is on the list of NVIDIA CUDA-enabled products(http://www.nvidia.com/object/cuda_learn_products.html). Sign:( What shall I do??

MisterAnderson42 · June 26, 2009, 11:51am

Basically, you should ignore the drivers linked on the CUDA pages. They are beta drivers only there for the interim between the release of a CUDA version and the next actual driver release on the main page www.nvidia.com. I have no idea why they never update these links.

As for laptop drivers… Well, I thought that NVIDIA was offering laptop compatible drivers at www.nvidia.com. They has this big press release about it and everything a while back. But now I don’t see the link :( Check out http://laptopvideo2go.com/, they have NVIDIA drivers modified to install on laptops. The latest non-beta version is: 186.18.

huilinhui · June 26, 2009, 3:37pm

Thank you very much! 186.18 works finally. Although I doubted it at the beginning. Because my GPU GeForce GT 130M is not on the list that it supports.

Then all problems are solved. SDK example “simpleCUBLAS” passes the test and function cublasSaxpy works properly.

Thanks all!!

YDD · June 26, 2009, 4:59pm

Indeed - I’ve considered opening a bug ticket about this…