mway
May 24, 2011, 1:49pm
1
Hi all,
So I have two kernels running on the same data, but organised in a different way. To put it simply these trivial kernels just add four vectors together.
Four float arrays
__global__ void addKernelfloats(float *out, float *a, float *b, float *c, float *d)
{
int pos = threadIdx.x + blockIdx.x * blockDim.x;
int totalDim = blockDim.x * gridDim.x;
while(pos < ARRAYSIZE){
out[pos] = a[pos] + b[pos] + c[pos] + d[pos];
pos += totalDim;
}
}
One array of float4’s, using shared mem to achieve coalescing.
#define THREADCOUNT 128
__global__ void addKernelstructs(float *out, float *indata)
{
__shared__ float sdata[THREADCOUNT*4];
int valBlock = blockIdx.x * blockDim.x;
int outpos = threadIdx.x + valBlock;
int pos = (valBlock * 4) + threadIdx.x;
int totalDim = blockDim.x * gridDim.x;
int totalDimQuad = totalDim * 4;
float* num1 = &(sdata[threadIdx.x]);
float* num2 = &(sdata[threadIdx.x + THREADCOUNT]);
float* num3 = &(sdata[threadIdx.x + (THREADCOUNT*2)]);
float* num4 = &(sdata[threadIdx.x + (THREADCOUNT*3)]);
float4 *a = &((float4*)sdata)[threadIdx.x];
while(outpos < ARRAYSIZE){
(*num1) = indata[pos];
(*num2) = indata[pos + THREADCOUNT];
(*num3) = indata[pos + (THREADCOUNT*2)];
(*num4) = indata[pos + (THREADCOUNT*3)];
__syncthreads();
out[outpos] = a->w + a->x + a->y + a->z;
outpos += totalDim;
pos += totalDimQuad;
}
}
Now I have tested both kernels and they work correctly (10M+ values), but what I want to know is…why the hell is kernel 2 always faster than kernel 1? I have tried playing with block and thread dimensions, using one single float array in 1. for all four vectors. No matter what I do, kernel 2 is still faster. Any ideas why, as intuitively I would have thought kernel 1 would be faster.
Also note, while 2 is faster, the performance difference is tiny so this isn’t too big of a deal, I am just interested. Am I doing something wrong?
Also: I’m using one GPU from my GTX590
brano
May 24, 2011, 3:54pm
2
Hi all,
So I have two kernels running on the same data, but organised in a different way. To put it simply these trivial kernels just add four vectors together.
Four float arrays
__global__ void addKernelfloats(float *out, float *a, float *b, float *c, float *d)
{
int pos = threadIdx.x + blockIdx.x * blockDim.x;
int totalDim = blockDim.x * gridDim.x;
while(pos < ARRAYSIZE){
out[pos] = a[pos] + b[pos] + c[pos] + d[pos];
pos += totalDim;
}
}
One array of float4’s, using shared mem to achieve coalescing.
#define THREADCOUNT 128
__global__ void addKernelstructs(float *out, float *indata)
{
__shared__ float sdata[THREADCOUNT*4];
int valBlock = blockIdx.x * blockDim.x;
int outpos = threadIdx.x + valBlock;
int pos = (valBlock * 4) + threadIdx.x;
int totalDim = blockDim.x * gridDim.x;
int totalDimQuad = totalDim * 4;
float* num1 = &(sdata[threadIdx.x]);
float* num2 = &(sdata[threadIdx.x + THREADCOUNT]);
float* num3 = &(sdata[threadIdx.x + (THREADCOUNT*2)]);
float* num4 = &(sdata[threadIdx.x + (THREADCOUNT*3)]);
float4 *a = &((float4*)sdata)[threadIdx.x];
while(outpos < ARRAYSIZE){
(*num1) = indata[pos];
(*num2) = indata[pos + THREADCOUNT];
(*num3) = indata[pos + (THREADCOUNT*2)];
(*num4) = indata[pos + (THREADCOUNT*3)];
__syncthreads();
out[outpos] = a->w + a->x + a->y + a->z;
outpos += totalDim;
pos += totalDimQuad;
}
}
Now I have tested both kernels and they work correctly (10M+ values), but what I want to know is…why the hell is kernel 2 always faster than kernel 1? I have tried playing with block and thread dimensions, using one single float array in 1. for all four vectors. No matter what I do, kernel 2 is still faster. Any ideas why, as intuitively I would have thought kernel 1 would be faster.
Also note, while 2 is faster, the performance difference is tiny so this isn’t too big of a deal, I am just interested. Am I doing something wrong?
Also: I’m using one GPU from my GTX590
Hi,
I have tested the two kernels and i get better performance on kernel1 :D. As you said it is just a tiny difference.
On the CPU side, are you launching kernel1 first followed by kernel 2?
If you want to have a slightly faster solution you could use the same concept as in kernel1 but pass in a float4* instead of 4 float*
__global__ void addKernelf4(float *out, float4 *a){
int pos = threadIdx.x + blockIdx.x * blockDim.x;
int totalDim = blockDim.x * gridDim.x;
while(pos < ARRAYSIZE){
out[pos] = a[pos].x + a[pos].y + a[pos].z + a[pos].w;
pos += totalDim;
}
}
I got better performance on the kernel above.
I profiled all three kernels with Visual profiler running on GTX480.
The limitation for the kernels is the memory bandwidth
mway
May 24, 2011, 11:20pm
3
Hi,
I have tested the two kernels and i get better performance on kernel1 :D. As you said it is just a tiny difference.
On the CPU side, are you launching kernel1 first followed by kernel 2?
If you want to have a slightly faster solution you could use the same concept as in kernel1 but pass in a float4* instead of 4 float*
I got better performance on the kernel above.
I profiled all three kernels with Visual profiler running on GTX480.
The limitation for the kernels is the memory bandwidth
Hi, thanks for testing. I ended up playing with the third kernel, and here is the fastest one I could come up with.
__global__ void addKernelfloat4(float *out, float4 *data)
{
__shared__ float4 sdata[THREADCOUNT];
int pos = threadIdx.x + blockIdx.x * blockDim.x;
int totalDim = blockDim.x * gridDim.x;
float4 *a = &sdata[threadIdx.x];
while(pos < ARRAYSIZE){
(*a) = data[pos];
__syncthreads();
out[pos] = a->w + a->x + a->y + a->z;
pos += totalDim;
}
}
Here is the interesting thing. The kernel above runs fine without the __syncthreads() call, however if I add it, I get a small performance increase. I was of the opinion that you should try and avoid __syncthreads() where possible. Any ideas why adding it would improve performance for me?
tera
May 25, 2011, 12:29am
4
With the threads in sync it might be that you have a slightly higher chance for consecutive reads and writes to hit the same row that is already open.