Array multiplied by a constant

HickoryDock · July 15, 2010, 4:56pm

Greetings all,

I’m trying to multiply an array A by a constant c.

If A is [1 1 1;1 1 1;1 1 1] and c=3, then

matMultConst(C,A,c);

should give:

C=[3 3 3;3 3 3;3 3 3]

Here is the code sofar:

__global__ void matMultConst_kernel(float *C,const float *A,const float *c,int N)

{

	int i=blockDim.x*blockIdx.x+threadIdx.x;

	if (i<N)

		C[i]=A[i]*c[0];

}

void matMultConst(Matrix C,Matrix A,Matrix c)

{

		int N=A.h*A.w;

		int threadsPerBlock=256;

		int blocksPerGrid=(N+threadsPerBlock-1)/threadsPerBlock;

		printf("%d %d %d\n",N,threadsPerBlock,blocksPerGrid);

		matMultConst_kernel<<<blocksPerGrid, threadsPerBlock>>>(C.data,A.data,c.data,N);

}

But when I run this, nothing happens! C remains totally unchanged.

I thought this would be really simple, but I can’t for the life of me figure out what’s wrong…

seibert · July 15, 2010, 5:20pm

If nothing happens, then it is quite likely your kernel aborted due to an error (you have to check the return codes from cuda functions after your kernel call to catch this). Since you don’t show any calls to cudaMalloc()/cudaMemcpy() in your example, the first thing to verify is that the pointers you are passing to your kernel are device pointers, not host pointers. Accessing a host pointer on the device causes an immediate kernel abort.

BlahCuda · July 15, 2010, 5:24pm

Instead of multiplying, you seem to be adding arrays A[i] and c[i] together and storing the content in C[i]. Also, shouldn’t “c” be a scalar?

HickoryDock · July 15, 2010, 6:43pm

Yeah, I’m pretty sure the mallocs are ok.

I modelled my kernel from NVIDIA’s funtion to add two matrices. The addition works perfectly fine.

I’m using a Matrix struct:

struct{

int w,

int h,

float *data

} Matrix;

And yes: that should be a multiplication operator and not an addition… (modified original post)

seibert · July 15, 2010, 7:19pm

Check the return code from cudaThreadSynchronize() after your kernel call. The synchronization function is not required for correctness, but that is the easiest way to ensure you catch all kernel execution errors during debugging.

HickoryDock · July 15, 2010, 7:51pm

Hi,

Many thanks for that. I tried cudaThreadSynchronize() and didn’t get any errors…

How does one check the return code?

seibert · July 15, 2010, 8:08pm

No CUDA function prints errors to the screen automatically (as this is terrible behavior for a library), so unless you check return codes, you will never know what is failing. :)

A simple error handling scheme that prints the error and aborts is found in a shared header in the CUDA SDK:

cudaError err = cudaThreadSynchronize(); // Put whatever call here

if( cudaSuccess != err) {

	fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err) );

	exit(EXIT_FAILURE);

}

HickoryDock · July 15, 2010, 8:36pm

No CUDA function prints errors to the screen automatically (as this is terrible behavior for a library), so unless you check return codes, you will never know what is failing. :)

A simple error handling scheme that prints the error and aborts is found in a shared header in the CUDA SDK:
cudaError err = cudaThreadSynchronize(); // Put whatever call here

if( cudaSuccess != err) {

	fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err) );

	exit(EXIT_FAILURE);

}

Cool. Many thanks for that.

I just ran that inside the kernel and no errors were returned…

I’m thinking of just taking the rest of the evening off and resuming in the morning when my head is clear…

Here’s the full code: (I’m importing a .csv file called “X.txt” which is basically a 512x4800 matrix).

__global__ void matMultConst_kernel(float *C,const float *A,const float *d,int N)

{

	int i=blockDim.x*blockIdx.x+threadIdx.x;

	if (i<N)

	{

		C[i]=A[i]*d[0];

	}

	cudaError err = cudaThreadSynchronize(); // Put whatever call here

		if(cudaSuccess != err)

		{

			  fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err) );

			  exit(EXIT_FAILURE);

		}

}

void matMultConst(Matrix C,Matrix A,Matrix d)

{

	int N=A.h*A.w;

	int threadsPerBlock=256;

	int blocksPerGrid=(N+threadsPerBlock-1)/threadsPerBlock;

	printf("---%d %d %d\n",N,threadsPerBlock,blocksPerGrid);

	matMultConst_kernel<<<blocksPerGrid, threadsPerBlock>>>(C.data,A.data,d.data,N);

}

#include <stdio.h>

#include <stdlib.h>

#include <sys/time.h>

#include <omp.h>

#include <string.h>

#include <sys/types.h>

#include <unistd.h>

#include "cublas.h"

typedef struct{

		int w;

		int h;

		float* data;

}Matrix;

include "./matMultConst.cu"

void read_matrix_col_major(Matrix M,const char *fn)

{

	long i,j;

		char line[M.w*50];   //enough space

		FILE *IN;

		IN=fopen(fn,"r");

		i=0;

		while(fgets(line,sizeof(line),IN)!=NULL)

		{

				char *result;

				result=strtok(line,",");

		j=0;

				while(result)

				{

						//printf("%s ",result);

			//printf("%ld,%ld\n",i,j);

						M.data[j*M.h+i]=(float)atof(result);

			/*if(A.data[j*A.h+i]>1)

				printf("A.data[%ld*%ld+%ld=%ld]=%f\n",j,A.h,i,(j*A.h+i),atof(result));*/

			//printf("%ld\n",A.w);

			//printf("%s,%ld,%ld,%ld,%ld,%ld\n",result,M.w,M.h,i,j,M.h*j+i);

			//A.data[j*A.w]=atof(result);

			//printf("%f ",A.data[j*A.h+i]);

						result=strtok(NULL,",");

			//if(j==10)

			//	exit(0);

			j++;

				}

		//printf("%s,%ld,%ld,%ld,%ld\n",result,A.w,A.h,j,A.w*j);

				//printf("%ld \n",i);

		i++;

		}

	fclose(IN);

}

void read_matrix_row_major(Matrix A,const char *fn)

{

	long i;

	char line[A.w*10];   //enough space

		FILE *IN;

		IN=fopen(fn,"r");

	i=0;

		while(fgets(line,sizeof(line),IN)!=NULL)

		{

				char *result;

		result=strtok(line,",");

				while(result)

		{

			//printf("%s ",result);

			A.data[i]=atof(result);

			result=strtok(NULL,",");

			i++;

		}

		//printf("\n");

	}	

}

void printUL_col_major(Matrix A)   //prints the 10x10 upper-left corner of A

{

		int i,j;

		for(i=0;i<A.w;i++)

		{

				for(j=0;j<A.h;j++)

				{

						printf("%f ",A.data[i*A.h+j]);

						if(j>10)

								break;

				}

				printf("\n");

				if(i>10)

					break;

		}

		printf("\n\n");

}

int main(int argc, char* argv[])

{

	long num_dims,num_hid,batch_len;

	struct timeval start,finish;

	

	num_dims=9600;

	num_hid=4800;

	batch_len=512;	

	

	Matrix X;

	X.h=batch_len;X.w=num_hid;

	X.data=(float*)calloc(X.h*X.w,sizeof(float));

	Matrix Y;

	Y.h=batch_len;Y.w=num_hid;

	Y.data=(float*)calloc(Y.h*Y.w,sizeof(float));

	

	Matrix c;

	c.h=1;c.w=1;

	c.data=(float*)calloc(c.h*c.w,sizeof(float));

	

	read_matrix_col_major(X,"X.txt");

	c.data[0]=(float)0.34;

	

	printUL_col_major(X);

	printUL_col_major(Y);

	printUL_col_major(c);

	

	/////

	//load matrices

	/////

	gettimeofday(&start,NULL);

	Matrix X_d,Y_d,c_d;

	X_d.h=X.h;X_d.w=X.w;

		cudaMalloc((void**)&X_d.data,X_d.h*X_d.w*sizeof(float));

		cudaMemcpy(X_d.data,X.data,X.h*X.w*sizeof(float),cudaMemcpyHostToDevice);

	Y_d.h=Y.h;Y_d.w=Y.w;

		cudaMalloc((void**)&Y_d.data,Y_d.h*Y_d.w*sizeof(float));

		cudaMemcpy(Y_d.data,Y.data,Y.h*Y.w*sizeof(float),cudaMemcpyHostToDevice);

	c_d.h=c.h;c_d.w=c.w;

		cudaMalloc((void**)&c_d.data,c_d.h*c_d.w*sizeof(float));

		cudaMemcpy(c_d.data,c.data,c.h*c.w*sizeof(float),cudaMemcpyHostToDevice);

	gettimeofday(&finish,NULL);

		printf("Load time: %f seconds\n",finish.tv_sec-start.tv_sec+1e-6*(finish.tv_usec - start.tv_usec));

	

	/////

	//perform op

	/////

	gettimeofday(&start,NULL);

	

	matMultConst(Y_d,X_d,c_d);

	gettimeofday(&finish,NULL);

		printf("GPU time: %f seconds\n",finish.tv_sec-start.tv_sec+1e-6*(finish.tv_usec - start.tv_usec));

	/////

	//read result:

	/////

	gettimeofday(&start,NULL);

	cudaMemcpy(Y.data,Y_d.data,Y_d.h*Y_d.w*sizeof(float),cudaMemcpyDeviceToHost);

	printUL_col_major(Y);

	gettimeofday(&finish,NULL);

		printf("Read time: %f seconds\n",finish.tv_sec-start.tv_sec+1e-6*(finish.tv_usec - start.tv_usec));

	

	return 0;

}

It’s a little messy, but I hope you get the drift…

Yes, I’m compiling in emulation mode.

tera · July 15, 2010, 9:15pm

The block that Seibert gave you belongs into host code, after the kernel call. I’m surprised that even compiled, but try this instead:

[codebox]

global void matMultConst_kernel(float *C,const float *A,const float *d,int N)

{

int i=blockDim.x*blockIdx.x+threadIdx.x;

if (i<N)

{

    C[i]=A[i]*d[0];

}

}

void matMultConst(Matrix C,Matrix A,Matrix d)

{

int N=A.h*A.w;

int threadsPerBlock=256;

int blocksPerGrid=(N+threadsPerBlock-1)/threadsPerBlock;

printf(“—%d %d %d\n”,N,threadsPerBlock,blocksPerGrid);

matMultConst_kernel<<<blocksPerGrid, threadsPerBlock>>>(C.data,A.data,d.data,N);

cudaError err = cudaThreadSynchronize(); // Put whatever call here

    if(cudaSuccess != err)

    {

          fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err) );

          exit(EXIT_FAILURE);

    }

}[/codebox]

seibert · July 15, 2010, 11:50pm

Ah, this is interesting. I haven’t used emulation mode in years, but as I recall, it is incredibly permissive. (i.e., it will let you do things that are not allowed on real CUDA devices, like pass host pointers to kernels.) Most of the failure modes I was imagining you having would not even occur in emulation mode…

Ken_Domino · July 16, 2010, 2:43am

Seems to work for me. Took out the timing stuff (doesn’t compile on windows), collapsed all the code into one file, hardwired the size of the automatic arrays “char line[5000]”–nvcc windows doesn’t like non-constant autos, and hardwired the sizes to 3 by 3 (but also tried with 4800 x 512), and removed num_dims. Input was:

1, 1, 1
1, 1, 1
1, 1, 1

just as you suggested.

Output:

1.000000 1.000000 1.000000
1.000000 1.000000 1.000000
1.000000 1.000000 1.000000

0.000000 0.000000 0.000000
0.000000 0.000000 0.000000
0.000000 0.000000 0.000000

0.340000

—9 256 1
0.340000 0.340000 0.340000
0.340000 0.340000 0.340000
0.340000 0.340000 0.340000

Note, you don’t call cudaFree of the cudaMalloc pointer’s. Ran it on a GeForce 470 and 9800. With these kinds of problems, always good to simplify the code as much as possible (I can hardly read read_matrix_col_major with all the commented out code) and always check return codes. You don’t do any checking anywhere–a bad habit that creeps into real code. Unfortunately, most code I’ve seen posted doesn’t do any return code checking. Also good to check your build to make sure you are compiling and linking it as you expect. If you still cannot get anything to work, try going back to a “helloworld.cu” example (one kernel call with one parameter pointer assignment, one block in grid, one thread in block) and verify you have everything installed right.

HickoryDock · July 16, 2010, 1:31pm

Seems to work for me. Took out the timing stuff (doesn’t compile on windows), collapsed all the code into one file, hardwired the size of the automatic arrays “char line[5000]”–nvcc windows doesn’t like non-constant autos, and hardwired the sizes to 3 by 3 (but also tried with 4800 x 512), and removed num_dims. Input was:

1, 1, 1

1, 1, 1

1, 1, 1

just as you suggested.

Output:

1.000000 1.000000 1.000000

1.000000 1.000000 1.000000

1.000000 1.000000 1.000000

0.000000 0.000000 0.000000

0.000000 0.000000 0.000000

0.000000 0.000000 0.000000

0.340000

—9 256 1

0.340000 0.340000 0.340000

0.340000 0.340000 0.340000

0.340000 0.340000 0.340000

Note, you don’t call cudaFree of the cudaMalloc pointer’s. Ran it on a GeForce 470 and 9800. With these kinds of problems, always good to simplify the code as much as possible (I can hardly read read_matrix_col_major with all the commented out code) and always check return codes. You don’t do any checking anywhere–a bad habit that creeps into real code. Unfortunately, most code I’ve seen posted doesn’t do any return code checking. Also good to check your build to make sure you are compiling and linking it as you expect. If you still cannot get anything to work, try going back to a “helloworld.cu” example (one kernel call with one parameter pointer assignment, one block in grid, one thread in block) and verify you have everything installed right.

Hi,

Thanks for running that code. It’s good to hear that the basic program logic is working on your end.

I came in this morning and it’s now working. I don’t know how. Must have been something simple.

Thanks for everything,