RyuKa
1
Hello,
I have two matrix and I need to multiply element by element.
Like that :
__global__ void pixelbypixelmultiplication_kernel(float* d_Data,float* d_Data2,float* d_Product,int data1H, int data1W)
{
int offset = threadIdx.x + blockIdx.x*blockDim.x;
if(offset<data1H*data1W)
{
d_Product[offset]=d_Data[offset]*d_Data2[offset];
}
}
Is there any way to do that faster ?
There is a lot of set up overhead for 1 FLOP of “real” work in that code. Try having each thread do multiple calculations rather than just one.
RyuKa
3
Okay, thank you !!
I’ll edit my post later.
EDIT :
I tried
__global__ void pixelbypixelmultiplication_kernel(float* d_Data,float* d_Data2,float* d_Product,int data1H, int data1W)
{
int offset = threadIdx.x + blockIdx.x*blockDim.x;
while(offset<data1H*data1W)
{
d_Product[offset]=d_Data[offset]*d_Data2[offset];
offset+=gridDim.x*blockDim.x;
}
}
and launching
const int N= data0W*data0H/8; // dimension
int T=512; // number of threads
const int B = (N+T-1)/T;
instead of N= data0W*data0H; (so 8 times less blocks), but it almost change nothing : (