Point-wise multiplication

Hello,

I have two matrix and I need to multiply element by element.

Like that :

__global__ void pixelbypixelmultiplication_kernel(float* d_Data,float* d_Data2,float* d_Product,int data1H, int data1W)

{

	int offset = threadIdx.x + blockIdx.x*blockDim.x;

	if(offset<data1H*data1W)

	{

		d_Product[offset]=d_Data[offset]*d_Data2[offset];

	}

}

Is there any way to do that faster ?

There is a lot of set up overhead for 1 FLOP of “real” work in that code. Try having each thread do multiple calculations rather than just one.

Okay, thank you !!

I’ll edit my post later.

EDIT :

I tried

__global__ void pixelbypixelmultiplication_kernel(float* d_Data,float* d_Data2,float* d_Product,int data1H, int data1W)

{

	int offset = threadIdx.x + blockIdx.x*blockDim.x;

	while(offset<data1H*data1W)

	{

		d_Product[offset]=d_Data[offset]*d_Data2[offset];

		offset+=gridDim.x*blockDim.x;

	}

}

and launching

const int N= data0W*data0H/8;  // dimension

	int T=512; // number of threads

	const int B = (N+T-1)/T;

instead of N= data0W*data0H; (so 8 times less blocks), but it almost change nothing : (