# Median Filter (time dimension) for images (bad performance, no memory coalescence

There are lots of examples of 2D median CUDA filters. But we have a stream of images (1024x1024) that we are trying to process fast. Since there is a lot of noise, we so stack 21 images and do a median filter (1x1) for the 21 frames in time. This time based median filter really helps us get the background and remove the noise. Since the stride is very large, we are getting no memory coalescence. So performance is really bad. We cannot reshape the images since this is a real time stream.

So imagine you had to handle 20FPS images that are (1024x1024) and do a simple 1x1 (a thread per pixel, but each thread looks at 21 consecutive images

the frames data that is being passed is

``````frames = bytes[1,048,576 ; 21]  (2D)
result = bytes [1,048,576] (1D)
frameSize = 1,048,576 // (our frame grabber gives us a 1D array for the 1024,1024 image)

extern "C" __global__  void filter( unsigned char* frames,  int frameSize,  unsigned char* result)
{
int x = blockIdx.x;
int y = blockIdx.y;
int num = x + y * gridDim.x;
__shared__ unsigned char array[21];

int arrayLen0 = 21;
if (num < frameSize)
{
for (int i = 0; i < 21; i++)
{
array[(i)] = frames[(i) * framesLen1 + ( num)];
}
for (int j = 0; j < 11; j++)
{
int num2 = j;
for (int k = j + 1; k < 21; k++)
{
if (array[(k)] < array[(num2)])
{
num2 = k;
}
}
unsigned char b = array[(j)];
array[(j)] = array[(num2)];
array[(num2)] = b;
}
result[(num)] = array[(10)];
}
}
``````

Lol, it took me a moment to figure it out. I was looking at this line of code thinking â€śwell, that should coalesce nicelyâ€ť:

``````array[(i)] = frames[(i) * framesLen1 + ( num)];
``````

You are launching this kernel like this, arenâ€™t you:

``````filter<<<some_2D_value, 1>>>(...);
``````

Launching kernels of 1 thread per block is leaving 97% of the performance of your GPU on the table. For every single instruction dispatched, 31 out of 32 units are idle, on every cycle, everywhere on the GPU. And yes, you are correct, it is impossible to get coalesced loads in that configuration. It has nothing to do with a large stride.

Rework your kernel to launch a grid including multiple threads per block. You will then be able to get coalesced loads from global memory here:

``````array[(i)] = frames[(i) * framesLen1 + ( num)];
``````

and stores here:

``````result[(num)] = array[(10)];
``````

and your performance will go up dramatically, for several reasons, one of which is memory usage efficiency. Youâ€™ll need to re-work your kernel indexing a bit for both global and shared memory, and increase the amount of shared memory items used by the kernel, to at least 21*number_of_threads_per_block. This additional shared memory usage should be no big deal from an occupancy/performance perspective, because even at 128 threads per block, youâ€™re only using 2688 bytes per threadblock, so youâ€™re unlikely to run into significant occupancy/performance constraints due to shared memory usage. And the resultant increase in machine utilization will make a dramatic performance impact.

From a performance perspective, it is never a good idea to launch kernels with either of these grid configurations:

``````kernel<<<...,1>>>(...);
``````
``````kernel<<<1,...>>>(...);
``````

For my own amusement purposes, I decided to take the code you have shown and re-cast the kernel somewhat, into a form that can be used (via templating) to demonstrate the timing difference when processing the 21-image stack in 3 different cases:

The results of the fully worked example below suggest that there is a ~50x speedup to be gained from case 1, to case 2, and no additional benefit (somewhat surprisingly, to me) going from case 2 to case 3.

``````\$ cat t650.cu
#include <stdio.h>
#include <time.h>
#include <sys/time.h>

#define IMW 1024
#define IMH 1024
#define BX 32
#define BY 8
#define nTPB (BX*BY)
#define CHECK_VAL 10

// IMN is assumed to be odd, for convenience of median filter
#define IMN 21

#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)

unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}

int validate(unsigned char *data){
int pass = 1;
for (int i = 0; i < IMH; i++)
for (int j = 0; j < IMW; j++)
if (data[i*IMW+j] != CHECK_VAL) pass = 0;
return pass;
}

// template parameter is expected to be either unsigned char or uchar4 only

template <typename T>
__global__  void filter( unsigned char* frames, unsigned char* result)
{
__shared__ unsigned char array[IMN*nTPB*sizeof(T)];
int num = x + y * (IMW/sizeof(T));

T *my_frames = reinterpret_cast<T *>(frames);
T *my_array  = reinterpret_cast<T *>(array);

if ((x < IMW/sizeof(T)) && (y < IMH)){
for (int i = 0; i < IMN; i++){
}
for (int bpp = 0; bpp < (sizeof(T)); bpp++)
for (int j = 0; j < ((IMN/2)+1); j++){
int num2 = j;
for (int k = j + 1; k < IMN; k++){
num2 = k;
}
}
}
T *my_result = reinterpret_cast<T *>(result);
}
}

int main(){

unsigned char *din, *dout1, *dout2, *dout3;
unsigned char *hin, *hout1, *hout2, *hout3;
hin = (unsigned char *)malloc(IMW*IMH*IMN*sizeof(unsigned char));
hout1 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
hout2 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
hout3 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
cudaMalloc(&din, IMW*IMH*IMN*sizeof(unsigned char));
cudaMalloc(&dout1, IMW*IMH*sizeof(unsigned char));
cudaMalloc(&dout2, IMW*IMH*sizeof(unsigned char));
cudaMalloc(&dout3, IMW*IMH*sizeof(unsigned char));
for (int i = 0; i < IMN; i++)
for (int j = 0; j< IMH*IMW; j++)
hin[i*IMH*IMW+j] = (i+5)%IMN;
memset(hout1, 0, IMW*IMH*sizeof(unsigned char));
memset(hout2, 0, IMW*IMH*sizeof(unsigned char));
memset(hout3, 0, IMW*IMH*sizeof(unsigned char));
cudaMemcpy(din, hin, IMW*IMH*IMN*sizeof(unsigned char), cudaMemcpyHostToDevice);
cudaMemset(dout1, 0, IMW*IMH*sizeof(unsigned char));
cudaMemset(dout2, 0, IMW*IMH*sizeof(unsigned char));
cudaMemset(dout3, 0, IMW*IMH*sizeof(unsigned char));
cudaCheckErrors("setup");
unsigned long long my_time = dtime_usec(0);
filter<unsigned char><<<dim3(IMW,IMH), 1>>>(din, dout1);
cudaCheckErrors("kernel");
my_time = dtime_usec(my_time);
printf("time 1 = %fs\n", my_time/(float)USECPSEC);
cudaMemcpy(hout1, dout1, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
if (!validate(hout1)) {printf("image 1 fail\n"); return 1;}
my_time = dtime_usec(0);
filter<unsigned char><<<dim3(IMW/BX,IMH/BY), dim3(BX,BY)>>>(din, dout2);
cudaCheckErrors("kernel");
my_time = dtime_usec(my_time);
printf("time 2 = %fs\n", my_time/(float)USECPSEC);
cudaMemcpy(hout2, dout2, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
if (!validate(hout2)) {printf("image 2 fail\n"); return 1;}
my_time = dtime_usec(0);
filter<uchar4><<<dim3(IMW/(4*BX),IMH/BY), dim3(BX,BY)>>>(din, dout3);
cudaCheckErrors("kernel");
my_time = dtime_usec(my_time);
printf("time 3 = %fs\n", my_time/(float)USECPSEC);
cudaMemcpy(hout3, dout3, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
if (!validate(hout3)) {printf("image 3 fail\n"); return 1;}
return 0;
}
\$ nvcc -O3 -arch=sm_20 -o t650 t650.cu
\$ ./t650
time 1 = 0.292929s
time 2 = 0.005511s
time 3 = 0.006170s
\$
``````

If the conclusion is reached that there is no additional benefit from the uchar4 loads, then the kernel could be somewhat simplified.

Iâ€™ve been playing a bit more with the code and added 2 modified approaches. The first 3 implementations are essentially the same as above, but I am using random data. My image validation is comparing image pairs for identity between the methods. The added 2 implementations modify the sorting method from a bubble sort to a rank-order sort (time 4) and a brute-force sorting networks based approach (time 5). The added two implementations both extensively use macros to assist with code generation. Hereâ€™s a fully worked example:

``````\$ cat t650.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

#define IMW 1024
#define IMH 1024
#define BX 32
#define BY 8
#define nTPB (BX*BY)
#define CHECK_VAL 10

// IMN is assumed to be odd, for convenience of median filter
#define IMN 21

#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)

unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}

int validate(unsigned char *data1, unsigned char *data2){
int pass = 1;
for (int i = 0; i < IMH; i++)
for (int j = 0; j < IMW; j++)
if (data1[i*IMW+j] != data2[i*IMW+j])
if (pass) { printf("image error: (%d,%d): %d:%d\n", i,j,data1[i*IMW+j], data2[i*IMW+j]); pass = 0;}
return pass;
}

// template parameter is expected to be either unsigned char or uchar4 only

template <typename T>
__global__  void filter( unsigned char* frames, unsigned char* result)
{
__shared__ unsigned char array[IMN*nTPB*sizeof(T)];
int num = x + y * (IMW/sizeof(T));

T *my_frames = reinterpret_cast<T *>(frames);
T *my_array  = reinterpret_cast<T *>(array);

if ((x < IMW/sizeof(T)) && (y < IMH)){
for (int i = 0; i < IMN; i++){
}
for (int bpp = 0; bpp < (sizeof(T)); bpp++)
for (int j = 0; j < ((IMN/2)+1); j++){
int num2 = j;
for (int k = j + 1; k < IMN; k++){
num2 = k;
}
}
}
T *my_result = reinterpret_cast<T *>(result);
}
}

__global__  void filter2( unsigned char* frames, unsigned char* result)
{
__shared__ unsigned char array[IMN*nTPB];
int num = x + y * (IMW);

if ((x < IMW) && (y < IMH)){
for (int i = 0; i < IMN; i++)
int ot;
if (10 == G(0,1) +G(0,2) +G(0,3) +G(0,4) +G(0,5) +G(0,6) +G(0,7) +G(0,8) +G(0,9) +G(0,10) +G(0,11) +G(0,12) +G(0,13) +G(0,14) +G(0,15) +G(0,16) +G(0,17) +G(0,18) +G(0,19) +G(0,20))
ot = 0;
else if (10 == E(1,0) +G(1,2) +G(1,3) +G(1,4) +G(1,5) +G(1,6) +G(1,7) +G(1,8) +G(1,9) +G(1,10) +G(1,11) +G(1,12) +G(1,13) +G(1,14) +G(1,15) +G(1,16) +G(1,17) +G(1,18) +G(1,19) +G(1,20))
ot = 1;
else if (10 == E(2,0) +E(2,1) +G(2,3) +G(2,4) +G(2,5) +G(2,6) +G(2,7) +G(2,8) +G(2,9) +G(2,10) +G(2,11) +G(2,12) +G(2,13) +G(2,14) +G(2,15) +G(2,16) +G(2,17) +G(2,18) +G(2,19) +G(2,20))
ot = 2;
else if (10 == E(3,0) +E(3,1) +E(3,2) +G(3,4) +G(3,5) +G(3,6) +G(3,7) +G(3,8) +G(3,9) +G(3,10) +G(3,11) +G(3,12) +G(3,13) +G(3,14) +G(3,15) +G(3,16) +G(3,17) +G(3,18) +G(3,19) +G(3,20))
ot = 3;
else if (10 == E(4,0) +E(4,1) +E(4,2) +E(4,3) +G(4,5) +G(4,6) +G(4,7) +G(4,8) +G(4,9) +G(4,10) +G(4,11) +G(4,12) +G(4,13) +G(4,14) +G(4,15) +G(4,16) +G(4,17) +G(4,18) +G(4,19) +G(4,20))
ot = 4;
else if (10 == E(5,0) +E(5,1) +E(5,2) +E(5,3) +E(5,4) +G(5,6) +G(5,7) +G(5,8) +G(5,9) +G(5,10) +G(5,11) +G(5,12) +G(5,13) +G(5,14) +G(5,15) +G(5,16) +G(5,17) +G(5,18) +G(5,19) +G(5,20))
ot = 5;
else if (10 == E(6,0) +E(6,1) +E(6,2) +E(6,3) +E(6,4) +E(6,5) +G(6,7) +G(6,8) +G(6,9) +G(6,10) +G(6,11) +G(6,12) +G(6,13) +G(6,14) +G(6,15) +G(6,16) +G(6,17) +G(6,18) +G(6,19) +G(6,20))
ot = 6;
else if (10 == E(7,0) +E(7,1) +E(7,2) +E(7,3) +E(7,4) +E(7,5) +E(7,6) +G(7,8) +G(7,9) +G(7,10) +G(7,11) +G(7,12) +G(7,13) +G(7,14) +G(7,15) +G(7,16) +G(7,17) +G(7,18) +G(7,19) +G(7,20))
ot = 7;
else if (10 == E(8,0) +E(8,1) +E(8,2) +E(8,3) +E(8,4) +E(8,5) +E(8,6) +E(8,7) +G(8,9) +G(8,10) +G(8,11) +G(8,12) +G(8,13) +G(8,14) +G(8,15) +G(8,16) +G(8,17) +G(8,18) +G(8,19) +G(8,20))
ot = 8;
else if (10 == E(9,0) +E(9,1) +E(9,2) +E(9,3) +E(9,4) +E(9,5) +E(9,6) +E(9,7) +E(9,8) +G(9,10) +G(9,11) +G(9,12) +G(9,13) +G(9,14) +G(9,15) +G(9,16) +G(9,17) +G(9,18) +G(9,19) +G(9,20))
ot = 9;
else if (10 == E(10,0)+E(10,1)+E(10,2)+E(10,3)+E(10,4)+E(10,5)+E(10,6)+E(10,7)+E(10,8)+E(10,9) +G(10,11)+G(10,12)+G(10,13)+G(10,14)+G(10,15)+G(10,16)+G(10,17)+G(10,18)+G(10,19)+G(10,20))
ot = 10;
else if (10 == E(11,0)+E(11,1)+E(11,2)+E(11,3)+E(11,4)+E(11,5)+E(11,6)+E(11,7)+E(11,8)+E(11,9) +E(11,10)+G(11,12)+G(11,13)+G(11,14)+G(11,15)+G(11,16)+G(11,17)+G(11,18)+G(11,19)+G(11,20))
ot = 11;
else if (10 == E(12,0)+E(12,1)+E(12,2)+E(12,3)+E(12,4)+E(12,5)+E(12,6)+E(12,7)+E(12,8)+E(12,9) +E(12,10)+E(12,11)+G(12,13)+G(12,14)+G(12,15)+G(12,16)+G(12,17)+G(12,18)+G(12,19)+G(12,20))
ot = 12;
else if (10 == E(13,0)+E(13,1)+E(13,2)+E(13,3)+E(13,4)+E(13,5)+E(13,6)+E(13,7)+E(13,8)+E(13,9) +E(13,10)+E(13,11)+E(13,12)+G(13,14)+G(13,15)+G(13,16)+G(13,17)+G(13,18)+G(13,19)+G(13,20))
ot = 13;
else if (10 == E(14,0)+E(14,1)+E(14,2)+E(14,3)+E(14,4)+E(14,5)+E(14,6)+E(14,7)+E(14,8)+E(14,9) +E(14,10)+E(14,11)+E(14,12)+E(14,13)+G(14,15)+G(14,16)+G(14,17)+G(14,18)+G(14,19)+G(14,20))
ot = 14;
else if (10 == E(15,0)+E(15,1)+E(15,2)+E(15,3)+E(15,4)+E(15,5)+E(15,6)+E(15,7)+E(15,8)+E(15,9) +E(15,10)+E(15,11)+E(15,12)+E(15,13)+E(15,14)+G(15,16)+G(15,17)+G(15,18)+G(15,19)+G(15,20))
ot = 15;
else if (10 == E(16,0)+E(16,1)+E(16,2)+E(16,3)+E(16,4)+E(16,5)+E(16,6)+E(16,7)+E(16,8)+E(16,9) +E(16,10)+E(16,11)+E(16,12)+E(16,13)+E(16,14)+E(16,15)+G(16,17)+G(16,18)+G(16,19)+G(16,20))
ot = 16;
else if (10 == E(17,0)+E(17,1)+E(17,2)+E(17,3)+E(17,4)+E(17,5)+E(17,6)+E(17,7)+E(17,8)+E(17,9) +E(17,10)+E(17,11)+E(17,12)+E(17,13)+E(17,14)+E(17,15)+E(17,16)+G(17,18)+G(17,19)+G(17,20))
ot = 17;
else if (10 == E(18,0)+E(18,1)+E(18,2)+E(18,3)+E(18,4)+E(18,5)+E(18,6)+E(18,7)+E(18,8)+E(18,9) +E(18,10)+E(18,11)+E(18,12)+E(18,13)+E(18,14)+E(18,15)+E(18,16)+E(18,17)+G(18,19)+G(18,20))
ot = 18;
else if (10 == E(19,0)+E(19,1)+E(19,2)+E(19,3)+E(19,4)+E(19,5)+E(19,6)+E(19,7)+E(19,8)+E(19,9) +E(19,10)+E(19,11)+E(19,12)+E(19,13)+E(19,14)+E(19,15)+E(19,16)+E(19,17)+E(19,18)+G(19,20))
ot = 19;
else
ot = 20;
}
}

__global__  void filter3( unsigned char* frames, unsigned char* result)
{
__shared__ unsigned char array[IMN*nTPB];
int num = x + y * (IMW);

if ((x < IMW) && (y < IMH)){
for (int i = 0; i < IMN; i++)
SWAP(0, 1);
SWAP(3, 4);
SWAP(2, 4);
SWAP(2, 3);
SWAP(0, 3);
SWAP(0, 2);
SWAP(1, 4);
SWAP(1, 3);
SWAP(1, 2);
SWAP(5, 6);
SWAP(8, 9);
SWAP(7, 9);
SWAP(7, 8);
SWAP(5, 8);
SWAP(5, 7);
SWAP(6, 9);
SWAP(6, 8);
SWAP(6, 7);
SWAP(0, 5);
SWAP(1, 6);
SWAP(1, 5);
SWAP(2, 7);
SWAP(3, 8);
SWAP(4, 9);
SWAP(4, 8);
SWAP(3, 7);
SWAP(4, 7);
SWAP(2, 5);
SWAP(3, 6);
SWAP(4, 6);
SWAP(3, 5);
SWAP(4, 5);
SWAP(10, 11);
SWAP(13, 14);
SWAP(12, 14);
SWAP(12, 13);
SWAP(10, 13);
SWAP(10, 12);
SWAP(11, 14);
SWAP(11, 13);
SWAP(11, 12);
SWAP(16, 17);
SWAP(15, 17);
SWAP(15, 16);
SWAP(19, 20);
SWAP(18, 20);
SWAP(18, 19);
SWAP(15, 18);
SWAP(16, 19);
SWAP(17, 20);
SWAP(17, 19);
SWAP(16, 18);
SWAP(17, 18);
SWAP(10, 16);
SWAP(10, 15);
SWAP(11, 17);
SWAP(11, 16);
SWAP(11, 15);
SWAP(12, 18);
SWAP(13, 19);
SWAP(14, 20);
SWAP(14, 19);
SWAP(13, 18);
SWAP(14, 18);
SWAP(12, 15);
SWAP(13, 16);
SWAP(14, 17);
SWAP(14, 16);
SWAP(13, 15);
SWAP(14, 15);
SWAP(0, 11);
SWAP(0, 10);
SWAP(1, 12);
SWAP(1, 11);
SWAP(1, 10);
SWAP(2, 13);
SWAP(3, 14);
SWAP(4, 15);
SWAP(4, 14);
SWAP(3, 13);
SWAP(4, 13);
SWAP(2, 10);
SWAP(3, 11);
SWAP(4, 12);
SWAP(4, 11);
SWAP(3, 10);
SWAP(4, 10);
SWAP(5, 16);
SWAP(6, 17);
SWAP(6, 16);
SWAP(7, 18);
SWAP(8, 19);
SWAP(9, 20);
SWAP(9, 19);
SWAP(8, 18);
SWAP(9, 18);
SWAP(7, 16);
SWAP(8, 17);
SWAP(9, 17);
SWAP(8, 16);
SWAP(9, 16);
SWAP(5, 11);
SWAP(5, 10);
SWAP(6, 12);
SWAP(6, 11);
SWAP(6, 10);
SWAP(7, 13);
SWAP(8, 14);
SWAP(9, 15);
SWAP(9, 14);
SWAP(8, 13);
SWAP(9, 13);
SWAP(7, 10);
SWAP(8, 11);
SWAP(9, 12);
SWAP(9, 11);
SWAP(8, 10);
SWAP(9, 10);
}
}

int main(){

unsigned char *din, *dout1, *dout2, *dout3, *dout4, *dout5;
unsigned char *hin, *hout1, *hout2, *hout3, *hout4, *hout5;
hin = (unsigned char *)malloc(IMW*IMH*IMN*sizeof(unsigned char));
hout1 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
hout2 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
hout3 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
hout4 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
hout5 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
cudaMalloc(&din, IMW*IMH*IMN*sizeof(unsigned char));
cudaMalloc(&dout1, IMW*IMH*sizeof(unsigned char));
cudaMalloc(&dout2, IMW*IMH*sizeof(unsigned char));
cudaMalloc(&dout3, IMW*IMH*sizeof(unsigned char));
cudaMalloc(&dout4, IMW*IMH*sizeof(unsigned char));
cudaMalloc(&dout5, IMW*IMH*sizeof(unsigned char));
for (int i = 0; i < IMN; i++)
for (int j = 0; j< IMH*IMW; j++)
//hin[i*IMH*IMW+j] = (i+11)%IMN;
hin[i*IMH*IMW+j] = rand()%255;
memset(hout1, 0, IMW*IMH*sizeof(unsigned char));
memset(hout2, 0, IMW*IMH*sizeof(unsigned char));
memset(hout3, 0, IMW*IMH*sizeof(unsigned char));
memset(hout4, 0, IMW*IMH*sizeof(unsigned char));
memset(hout5, 0, IMW*IMH*sizeof(unsigned char));
cudaMemcpy(din, hin, IMW*IMH*IMN*sizeof(unsigned char), cudaMemcpyHostToDevice);
cudaMemset(dout1, 0, IMW*IMH*sizeof(unsigned char));
cudaMemset(dout2, 0, IMW*IMH*sizeof(unsigned char));
cudaMemset(dout3, 0, IMW*IMH*sizeof(unsigned char));
cudaMemset(dout4, 0, IMW*IMH*sizeof(unsigned char));
cudaMemset(dout5, 0, IMW*IMH*sizeof(unsigned char));
cudaCheckErrors("setup");
unsigned long long my_time = dtime_usec(0);
filter<unsigned char><<<dim3(IMW,IMH), 1>>>(din, dout1);
cudaCheckErrors("kernel");
my_time = dtime_usec(my_time);
printf("time 1 = %fs\n", my_time/(float)USECPSEC);
cudaMemcpy(hout1, dout1, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
// if (!validate(hout1)) {printf("image 1 fail\n"); }
my_time = dtime_usec(0);
filter<unsigned char><<<dim3(IMW/BX,IMH/BY), dim3(BX,BY)>>>(din, dout2);
cudaCheckErrors("kernel");
my_time = dtime_usec(my_time);
printf("time 2 = %fs\n", my_time/(float)USECPSEC);
cudaMemcpy(hout2, dout2, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
if (!validate(hout1, hout2)) {printf("image 2 fail\n"); }
my_time = dtime_usec(0);
filter<uchar4><<<dim3(IMW/(4*BX),IMH/BY), dim3(BX,BY)>>>(din, dout3);
cudaCheckErrors("kernel");
my_time = dtime_usec(my_time);
printf("time 3 = %fs\n", my_time/(float)USECPSEC);
cudaMemcpy(hout3, dout3, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
if (!validate(hout1, hout3)) {printf("image 3 fail\n"); }
my_time = dtime_usec(0);
filter2<<<dim3(IMW/BX,IMH/BY), dim3(BX,BY)>>>(din, dout4);
cudaCheckErrors("kernel");
my_time = dtime_usec(my_time);
printf("time 4 = %fs\n", my_time/(float)USECPSEC);
cudaMemcpy(hout4, dout4, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
if (!validate(hout1, hout4)) {printf("image 4 fail\n"); }
my_time = dtime_usec(0);
filter3<<<dim3(IMW/BX,IMH/BY), dim3(BX,BY)>>>(din, dout5);
cudaCheckErrors("kernel");
my_time = dtime_usec(my_time);
printf("time 5 = %fs\n", my_time/(float)USECPSEC);
cudaMemcpy(hout5, dout5, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
if (!validate(hout1, hout5)) {printf("image 5 fail\n"); }
return 0;
}
\$ nvcc -O3 -arch=sm_20 -o t650 t650.cu
\$ ./t650
time 1 = 0.293081s
time 2 = 0.006445s
time 3 = 0.006166s
time 4 = 0.005773s
time 5 = 0.003592s
\$
``````

Speedup over 1 thread per block approach (#1):
2: 45x
3: 47x
4: 50x
5: 81x

1 Like

Wow, you certainly put your teeth into this. Lots of worked out examples. Thank you.

As you might guess, yes, I am new to CUDA, but certainly not to massive parallel computation, or programming (80+ languages, 40 years, many many os, and assembler going back to Macro-11).

Do you have the patience for a few questions for my education (and perhaps others reading this). Feel free to castigate any of the assumptions and reasoning below, donâ€™t worry about my ego, I want to learn. Just read the whole post.

Let me first make the problem context easier. There is a real-time frame grabber getting raw imagery in image frames of size [1024, 1024]. it returns a byte array of the form:

uchar[1048576]

and places using a â€ścircular queueâ€ť in host memory in the form:

uchar[21][1048576]

So unless I want transforming of the axis, (either in the host, or on the gpu) there is no hope for not forcing each thread to have to stride through memory in 1,048,576 byte strides. [I am consider doing the reshape of the matrix on the GPU, but letâ€™s not discuss that for the moment, nor about how to deliver frame slices over time to the GPU from the host] If we focus on the kernel I assumed:

1. Since each thread touches completely different places in global memory, there was no hope to get any performance improvement by clustering threads into blocks (so yes, I know fully well that warp processing would be killed by this, but I assumed there was no hope (false, see below, but explain why?).

3. using uchar4 is out of the question, since the cost of restructuring the data so from the frame grabber is going to kill real-time performance. (30FPS)

4. Since the sort phase is using only L1 (shared mem) and only has 21 values, and we only have to sort 1/2 of the data, it is good enough to use bubble sort.

OK, but I was blown away by how much effort you put into it. So I tried this with a large thread block. (I can afford large ones since there is not much else going on).

I am also waiting for our Tesla K80, and just doing some testing with an old GTX480.

By using a block thread size of 1024, I got a 60X improvement (down to 2.98ms). Why? that is really the question that I need to understand to get wiser.

My guess is that the L2 cache mechanism is smarter than I thought. And that even though the stride is off by 1MB between threads, it is placing enough into the L2 cache, so that the warps are finding the data in L2 and loading it into L1. Does that make sense?

Below is my source code. Which is written in C# using CudaFy.NET which makes for much more succint solutions and cuts out a lot of the C macro and other work. It should be pretty clear to any C/C++ programmer.

``````[Cudafy]
{
int idx = x + (y * thread.blockDim.x);
if (idx < frameSize)
{
for (int i = 0; i < frameCount; i++)
{
window[i] = frames[i, idx];
}
for (int j = 0; j < (frameCount / 2) + 1; j++)
{
int min = j;
for (int l = j + 1; l < frameCount; l++)
{
if (window[l] < window[min])
min = l;
}
byte temp = window[j];
window[j] = window[min];
window[min] = temp;
}
result[idx] = window[frameCount / 2];
}
}
``````
``````public static void setup(int height, int width)
{
CudaWorker.width = width;
CudaWorker.height = height;
CudaWorker.FrameSize = width * height;
CudaWorker.frames = new byte[frameCount, FrameSize];
CudaWorker.result = new byte[FrameSize];
CudaWorker.tickRate = (1000L * 1000L * 1000L) / (Stopwatch.Frequency);
CudafyModule km = CudafyTranslator.Cudafy();
CudafyModes.Target = eGPUType.Cuda;                 // emulation = eGPUType.Emulator
//CudafyModes.Target = eGPUType.Emulator;
CudafyModes.DeviceId = 1;
CudafyTranslator.Language = eLanguage.Cuda;
CudafyTranslator.GenerateDebug = true;
gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
}
``````
``````public static void run()
{
result = new byte[FrameSize];
runTime = "";
gframes = gpu.CopyToDevice(frames);
gresult = gpu.Allocate<byte>(result);
Stopwatch watch = new Stopwatch();
watch.Start();
gpu.StartTimer();
gpu.Launch(height, width).filter(gframes, FrameSize, gresult);
runTime = watch.Elapsed.ToString();
gpu.Synchronize();
CudaWorker.msTime = gpu.StopTimer().ToString();
watch.Stop();
totalRunTime = watch.Elapsed.ToString();
gpu.CopyFromDevice(gresult, result);
gpu.FreeAll();
}
``````

Acknowledged. I think I understand the context. And your observation about the fundamental nature of each thread striding through a set of images is completely correct. But that is not the source of the problem, as both you and I have demonstrated.

This seems to be the crux of your knowledge gap, and may require some unpacking. Iâ€™ll address it below where you ask a simiilar question.

I wouldnâ€™t describe it this way. See below, especially the â€śdiagramâ€ť illustrating how adjacent threads could access adjacent elements in memory per read transaction.

Thatâ€™s OK, donâ€™t use uchar4. Iâ€™m not sure you really grokâ€™ed my code or results, but the uchar4 evolution yielded no significant improvement. I thought it might, but it did not. Having said that, I donâ€™t actually think using uchar4 to assist in global loads and stores requires any modification to your data storage, and I believe my code demonstrates that. Again, not sure you grokâ€™ed my code. But even that is irrelevant, if you now understand how to get 60x speedup.

I agree that the bubble sort is fine. Mostly for my own amusement, I was interested in trying other methods. The rank-order sort yielded no significant benefit, and the brute-force sorting networks approach yielded ~1.5x improvement (but suffers in other ways, such as code flexibility and maintainability). In some situations, folks would kill for an additional 1.5x speedup. But if you just figured out how to speed up your code by 60x, an additional 1.5x probably seems like small potatoes.

There are 2 related issues that the 1-thread-per-block structure has.

When we combine the above 2 mechanisms together, we can start to explain a 50-60x speedup. There appears to be one additional piece of information you donâ€™t seem to be grasping. How can adjacent threads request adjacent items in memory? Why is the stride not an issue?

Letsâ€™ consider the first 32 pixels in your image, across each of the 21 images, and their respective addresses in memory:

``````Image 1:

pixel:        0       1       2       3       4 ...      31
addrs:        0       1       2       3       4 ...      31

Image 2:

pixel:        0       1       2       3       4 ...      31
addrs:  1048576 1048577 1048578 1048579 1048580 ... 1048607

Image 3:

pixel:        0       1       2       3       4 ...      31
addrs:  2097152 2097152 2097152 2097152 2097152 ... 2097152

...
``````

To compute the median filter, thread 0 will need to load pixel 0 from each image. Thread 1 will need to load pixel 1 from each image. If I do this in separate blocks, then a given block will be requesting a single byte from image 1 (but 32 bytes will be retrieved) and another block will be requesting a single byte (at some other offset) from each image, but each transaction will be â€śthrowing awayâ€ť 31 bytes.

Now letâ€™s suppose I have 32 threads executing together in a warp. Letâ€™s suppose that thread 0 is responsible for pixel 0, thread 1 is responsible for pixel 1,etc. Now, the load instruction for the pixel from image 1 occurs across all threads in the warp simultaneously, and a single 32 byte read from DRAM satisfies the needs of all 32 threads. No bytes are wasted.

The combination of these two effects (execution unit utilization, memory bandwidth utilization) is what can give rise to the dramatic 50-60x speedup.

Yes, now I grok this. I did read in detail (cover to cover) (and work the examples) of Programming in Cuda (Sanders & Kandrot) but I did notice they left out a lot about warps and warp scheduling. This fills in the gap.

So I had one other performance improvement I was considering. Since the median of all image frames is done as a sliding window (across all 21 frames). There might be an additional improvement.

If when new frame arrives in the GPU, I transform the axis so that it align the â€śZâ€ť axis with the prior transformed data. Would that lead to a performance boost?

In effect, I would be creating a uchar21 (fictional) that would sequentially hold all the successor frames with the frameID as the lowest byte.

This transform and insert (or destructive update) into the [1024x1024, 21] array (now ordered along the z axis) could be done by a dynamically scheduled kernel that would overlap the other computation. (I am thinking about the Tesla K80 when it arrives)

``````Image 1:

pixel:        0       1            2

Image 2:

pixel:        0       1            2

Image 3:

pixel:        0       1             2

...
``````

In this case, I would block the threads out differently. So that successive warps of threads in a block would be grabbing adjacent data addresses.

But I am also thinking that is is not a win, since right now the warp scheduler is probably fetching data in optimal successive waves through memory (in the 48 byte DDR5 cache read lines)

At the moment, I donâ€™t see any benefit. The data reorganization will not improve load-from-global efficiency, because weâ€™ve already demonstrated that the existing arrangement can yield 100% efficiency (fully coalesced loads, and no extra loads required, i.e. 100% utilization).

Shared memory will still be required, since the bubble-sorting process â€śdependsâ€ť on it, and the remainder of the algorithm is essentially unaffected.

As an aside, Iâ€™m familiar with â€śCuda by Exampleâ€ť (Sanders,Kandrot) but not â€śProgramming in CUDAâ€ť (Sanders, Kandrot). One expository method they use is to disconnect the concept of threads and blocks, and teach each separately/independently. The approach has merit, and is a common approach to introduction of the concepts, but it has a possible drawback in that students may think that itâ€™s OK to write programs that consist of one threadblock, or that itâ€™s OK to write programs that use only one thread per block (since programs of both types are covered â€śby exampleâ€ť in the book).

It is not OK (from a performance perspective). Ever. Any such material should have a large disclaimer:

â€śDonâ€™t ever do this in the real world. If you ever find yourself constructing a performance-sensitive kernel launch like this: kernel<<<1, â€¦>>>(â€¦); or this: kernel<<<â€¦,1>>>(â€¦); you should immediately stop, step away from the computer, and re-think your life.â€ť

A lot of detailed posts. My scenario is probably covered in here somewhere but in case it hasnâ€™t been, here is my \$0.02f:

During ingest, warp reads 32(or a multiple thereof) pixels from each of the 21 consecutive frames into smem. This will lead to coalesced access within each frame. There will not be a need for __syncthreads() since the threads in a warp are simt synchronous.
In a loop of 32(or a multiple) iterations, the warp performs a Kepplerâ€™s shuffle based bitonic sort and determines the median of the given pixel position. This leads to a suboptimal utilization of the warp as only 21 elements are put through the network in each iteration.
Following the loop, the warp writes the results out; again in a coalesced manner.
The obvious question is: does the performance gain from coalesced I/O compensate for the underutilization of the warp during the bitonic sort phase?

TxBob:

FYI: we are hoping to use the P2P RDMA from the video frame grabber to the Tesla K80 to move the data over the PCIe bus without going to the host. Exploiting the UM model. If that fails, we will DMA to host and then DMA to GPU in frame size slices. (I am doing the 2nd option now).

You also seem to say that the only real improvement we can explore is switching to a Batcher Bitonic sort instead of bubble.

I just wanted to inform you that the hardware guys tell us that we can get by with a 15 frame median filter which would be 120 loops on the bubble sort.

Your 21 frame network sort was also 119 steps, and this is all with shared memory (L1). So I am wondering if changing the sort algorithm helps very much. We are probably talking about shaving micro-seconds.

It looked to me like switching from the bubble sort to the sorting networks approach gave a ~1.5x speedup. Iâ€™m assuming that would cut 3ms down to 2ms or something like that. For the GPU I was using (a little slower than your GTX480) it went from ~6ms to ~3.6ms (data is above in a previous posting of mine).

I didnâ€™t say anything about a bitonic sort. I donâ€™t know what sort is best. I tried the bubble, rank-order, and sorting-networks (Bose-Nelson).

If youâ€™ve decided that 15 frames is right for the median filter, the sorting networks approach can easily be cut down to a 15-element sort instead of a 21-element sort, and Iâ€™m sure it would run quicker still.

You can auto-generate the 15-element sort SWAP macros here:

http://pages.ripco.net/~jgamble/nw.html

You read my mind as to the next question: (gernating the swap list).

At that url he calls the bitonic sort the Batcher Merge-Exchange. Back in the 1980s I knew it simply as the Batcher Sort (way before itâ€™s current GPU popularity).

This is an excellent example for image filter using cuda and I would like to know how this could be utilised in python using pycuda.

I am new to C/C++ so getting frames from camera into this filter and back again for use by OpenCV is not totally clear.

Assuming I had a python program that stacked sequential images into a numpy array, what would I need to do to get the format correct to utilise the t650 filter script above.

I am currently using numpy (median = np.median(buffer,axis=0) for median filter in python with a stacked buffer size of 50 images in HD (1920x1080) using opencv to acquire images (cam=cv2.VideoCapture(0)) and getting cycle times of 2 to 3 seconds to generate the median.

How would I use the example given to replace this function to get faster cycle times. Any examples would be greatly appreciated.

Regards h

numba allows you to write CUDA kernels in python code, and to utilize numpy data. For relatively simple kernels such as the ones I have shown, the translation from CUDA-C to numba CUDA-python is pretty much mechanical.

If you want to learn how to use numba, it should be pretty straightforward at that point to convert what I have shown to equivalent numba CUDA kernels.

Since you mention pycuda, that is perhaps even easier. In pycuda, kernels are written in CUDA C/C++, meaning you should be able to use this code exactly, to duplicate what I have done in CUDA C/C++.

Hi txbob, thanks for your response.

I received the following error when incorporating into pycuda script,

kernel.cu(36): error: this declaration may not have extern â€śCâ€ť linkage

kernel.cu(68): warning: linkage specification is not allowed

Is there something extra required and is this referring to the include statements ?

My dummy script to test pycuda as follows:

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule("""
#include <stdio.h>
#include <time.h>
#include <sys/time.h>

#define IMW 1024
#define IMH 1024
#define BX 32
#define BY 8
#define nTPB (BX*BY)
#define CHECK_VAL 10

// IMN is assumed to be odd, for convenience of median filter
#define IMN 21

#define cudaCheckErrors(msg)
do {
cudaError_t __err = cudaGetLastError();
if (__err != cudaSuccess) {
exit(1);
}
} while (0)

unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}

int validate(unsigned char *data){
int pass = 1;
for (int i = 0; i < IMH; i++)
for (int j = 0; j < IMW; j++)
if (data[i*IMW+j] != CHECK_VAL) pass = 0;
return pass;
}

// template parameter is expected to be either unsigned char or uchar4 only

template
global void filter( unsigned char* frames, unsigned char* result)
{
shared unsigned char array[IMNnTPBsizeof(T)];
blockIdx.y;
int num = x + y * (IMW/sizeof(T));

T *my_frames = reinterpret_cast<T *>(frames);
T *my_array = reinterpret_cast<T *>(array);

if ((x < IMW/sizeof(T)) && (y < IMH)){
for (int i = 0; i < IMN; i++){
}
for (int bpp = 0; bpp < (sizeof(T)); bpp++)
for (int j = 0; j < ((IMN/2)+1); j++){
int num2 = j;
for (int k = j + 1; k < IMN; k++){
if (array[(k
num2 = k;
}
}
unsigned char b = array[(j
}
T *my_result = reinterpret_cast<T *>(result);
}
}

int main(){
unsigned char din, dout;
unsigned char hin, hout;
hin = (unsigned char )malloc(IMWIMH
IMN
sizeof(unsigned char));
hout = (unsigned char )malloc(IMWIMH
sizeof(unsigned char));
cudaMalloc(&din, IMW
IMHIMNsizeof(unsigned char));
cudaMalloc(&dout, IMWIMHsizeof(unsigned char));
for (int i = 0; i < IMN; i++)
for (int j = 0; j< IMHIMW; j++)
hin[i*IMH*IMW+j] = (i+5)%IMN;
memset(hout, 0, IMW
IMHsizeof(unsigned char));
cudaMemcpy(din, hin, IMW
IMHIMNsizeof(unsigned char), cudaMemcpyHostToDevice);
cudaMemset(dout, 0, IMWIMHsizeof(unsigned char));
cudaCheckErrors(â€śsetupâ€ť);
unsigned long long my_time = dtime_usec(0);
filter<<<dim3(IMW/BX,IMH/BY), dim3(BX,BY)>>>(din, dout);
cudaCheckErrors(â€śkernelâ€ť);
my_time = dtime_usec(my_time);
cudaMemcpy(hout, dout, IMWIMHsizeof(unsigned char), cudaMemcpyDeviceToHost);
my_time = dtime_usec(0);

return 0;
}
â€śâ€ť")

median = mod.get_function(â€śfilterâ€ť)

We have also made good experiences with employing sorting networks for the CUDA implementation of small (e.g., 3x3) rank-order-filters, showing great speedups when compared to our CPU implementation.

In addition to the provided link by txbob for auto-generating the sequences of SWAP() calls , one can use also the (meta) template class from http://stackoverflow.com/questions/19790522/very-fast-sorting-of-fixed-length-arrays-using-comparator-networks for sorting N elements with a sorting network (Bose-Nelson algorithm). N is the template paramter.

Solving my own question, errors are resolved by adding â€śno_extern_c=1â€ť as shown belowâ€¦

return 0;
}
â€śâ€ť", no_extern_c=1)

But to get the function assignment to work I needed to remove â€śimport pycuda.driver as drvâ€ť as it creates an import loop.

correction: this was not relevant (But to get the function assignment to work I needed to remove â€śimport pycuda.driver as drvâ€ť as it creates an import loop.)

Actual issue is mangled names from JIT compiler (REF: https://stackoverflow.com/questions/21405901/pycuda-either-fails-to-find-function-in-nvidia-source-code-or-throws-may-not-ha)

Not yet getting output from this function using pycuda.

You have some things to learn about pycuda.

With pycuda, you would normally only write the kernel code in c/c++, while all other code would be python (i.e. everything that runs on the host, including the code that invokes the kernel).

So itâ€™s not simply a matter of taking an entire C/C++ program (with the main function, and everything else) and incorporating it into a pycuda SourceModule statement.

Please study some pycuda sample codes. Your SourceModule statement, to a first order approximation, should only have in it the kernel code (and things needed to support compilation of it, such as macros). Referring to the t650 code previously posted, the kernel is the code beginning with global void filterâ€¦