Median Filter (time dimension) for images (bad performance, no memory coalescence

DoctorG · December 22, 2014, 6:42pm

There are lots of examples of 2D median CUDA filters. But we have a stream of images (1024x1024) that we are trying to process fast. Since there is a lot of noise, we so stack 21 images and do a median filter (1x1) for the 21 frames in time. This time based median filter really helps us get the background and remove the noise. Since the stride is very large, we are getting no memory coalescence. So performance is really bad. We cannot reshape the images since this is a real time stream.

So imagine you had to handle 20FPS images that are (1024x1024) and do a simple 1x1 (a thread per pixel, but each thread looks at 21 consecutive images

the frames data that is being passed is

frames = bytes[1,048,576 ; 21]  (2D)
result = bytes [1,048,576] (1D)
frameSize = 1,048,576 // (our frame grabber gives us a 1D array for the 1024,1024 image)

extern "C" __global__  void filter( unsigned char* frames,  int frameSize,  unsigned char* result)
{
	int x = blockIdx.x;
	int y = blockIdx.y;
	int num = x + y * gridDim.x;
	__shared__ unsigned char array[21];

	int arrayLen0 = 21;
	if (num < frameSize)
	{
		for (int i = 0; i < 21; i++)
		{
			array[(i)] = frames[(i) * framesLen1 + ( num)];
		}
		for (int j = 0; j < 11; j++)
		{
			int num2 = j;
			for (int k = j + 1; k < 21; k++)
			{
				if (array[(k)] < array[(num2)])
				{
					num2 = k;
				}
			}
			unsigned char b = array[(j)];
			array[(j)] = array[(num2)];
			array[(num2)] = b;
		}
		result[(num)] = array[(10)];
	}
}

Robert_Crovella · December 22, 2014, 7:04pm

Lol, it took me a moment to figure it out. I was looking at this line of code thinking “well, that should coalesce nicely”:

array[(i)] = frames[(i) * framesLen1 + ( num)];

You are launching this kernel like this, aren’t you:

filter<<<some_2D_value, 1>>>(...);

Launching kernels of 1 thread per block is leaving 97% of the performance of your GPU on the table. For every single instruction dispatched, 31 out of 32 units are idle, on every cycle, everywhere on the GPU. And yes, you are correct, it is impossible to get coalesced loads in that configuration. It has nothing to do with a large stride.

Rework your kernel to launch a grid including multiple threads per block. You will then be able to get coalesced loads from global memory here:

array[(i)] = frames[(i) * framesLen1 + ( num)];

and stores here:

result[(num)] = array[(10)];

and your performance will go up dramatically, for several reasons, one of which is memory usage efficiency. You’ll need to re-work your kernel indexing a bit for both global and shared memory, and increase the amount of shared memory items used by the kernel, to at least 21*number_of_threads_per_block. This additional shared memory usage should be no big deal from an occupancy/performance perspective, because even at 128 threads per block, you’re only using 2688 bytes per threadblock, so you’re unlikely to run into significant occupancy/performance constraints due to shared memory usage. And the resultant increase in machine utilization will make a dramatic performance impact.

From a performance perspective, it is never a good idea to launch kernels with either of these grid configurations:

kernel<<<...,1>>>(...);

kernel<<<1,...>>>(...);

For my own amusement purposes, I decided to take the code you have shown and re-cast the kernel somewhat, into a form that can be used (via templating) to demonstrate the timing difference when processing the 21-image stack in 3 different cases:

launching 1-thread-per-block, and using unsigned char loads/stores
launching 256 threads per block, and using unsigned char loads/stores
launching 256 threads per block, but using uchar4 loads/stores

The results of the fully worked example below suggest that there is a ~50x speedup to be gained from case 1, to case 2, and no additional benefit (somewhat surprisingly, to me) going from case 2 to case 3.

$ cat t650.cu
#include <stdio.h>
#include <time.h>
#include <sys/time.h>

#define IMW 1024
#define IMH 1024
#define BX 32
#define BY 8
#define nTPB (BX*BY)
#define CHECK_VAL 10

// IMN is assumed to be odd, for convenience of median filter
#define IMN 21

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
  timeval tv1;
  gettimeofday(&tv1,0);
  return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}

int validate(unsigned char *data){
  int pass = 1;
  for (int i = 0; i < IMH; i++)
    for (int j = 0; j < IMW; j++)
      if (data[i*IMW+j] != CHECK_VAL) pass = 0;
  return pass;
}

// template parameter is expected to be either unsigned char or uchar4 only

template <typename T>
__global__  void filter( unsigned char* frames, unsigned char* result)
{
  __shared__ unsigned char array[IMN*nTPB*sizeof(T)];
  int x = threadIdx.x+blockDim.x*blockIdx.x;
  int y = threadIdx.y+blockDim.y*blockIdx.y;
  int num = x + y * (IMW/sizeof(T));

  T *my_frames = reinterpret_cast<T *>(frames);
  T *my_array  = reinterpret_cast<T *>(array);

  if ((x < IMW/sizeof(T)) && (y < IMH)){
    for (int i = 0; i < IMN; i++){
      my_array[(i*nTPB)+(threadIdx.y*BX)+threadIdx.x] = my_frames[(i * (IMW*IMH/sizeof(T))) + ( num)];
      }
    for (int bpp = 0; bpp < (sizeof(T)); bpp++)
      for (int j = 0; j < ((IMN/2)+1); j++){
        int num2 = j;
        for (int k = j + 1; k < IMN; k++){
          if (array[(k*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp] < array[(num2*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp]){
            num2 = k;
            }
          }
        unsigned char b = array[(j*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp];
        array[(j*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp] = array[(num2*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp];
        array[(num2*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp] = b;
      }
    T *my_result = reinterpret_cast<T *>(result);
    my_result[(num)] = my_array[((IMN/2)*nTPB) + (threadIdx.y*BX)+threadIdx.x];
  }
}

int main(){

  unsigned char *din, *dout1, *dout2, *dout3;
  unsigned char *hin, *hout1, *hout2, *hout3;
  hin = (unsigned char *)malloc(IMW*IMH*IMN*sizeof(unsigned char));
  hout1 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
  hout2 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
  hout3 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
  cudaMalloc(&din, IMW*IMH*IMN*sizeof(unsigned char));
  cudaMalloc(&dout1, IMW*IMH*sizeof(unsigned char));
  cudaMalloc(&dout2, IMW*IMH*sizeof(unsigned char));
  cudaMalloc(&dout3, IMW*IMH*sizeof(unsigned char));
  for (int i = 0; i < IMN; i++)
    for (int j = 0; j< IMH*IMW; j++)
      hin[i*IMH*IMW+j] = (i+5)%IMN;
  memset(hout1, 0, IMW*IMH*sizeof(unsigned char));
  memset(hout2, 0, IMW*IMH*sizeof(unsigned char));
  memset(hout3, 0, IMW*IMH*sizeof(unsigned char));
  cudaMemcpy(din, hin, IMW*IMH*IMN*sizeof(unsigned char), cudaMemcpyHostToDevice);
  cudaMemset(dout1, 0, IMW*IMH*sizeof(unsigned char));
  cudaMemset(dout2, 0, IMW*IMH*sizeof(unsigned char));
  cudaMemset(dout3, 0, IMW*IMH*sizeof(unsigned char));
  cudaCheckErrors("setup");
  unsigned long long my_time = dtime_usec(0);
  filter<unsigned char><<<dim3(IMW,IMH), 1>>>(din, dout1);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel");
  my_time = dtime_usec(my_time);
  printf("time 1 = %fs\n", my_time/(float)USECPSEC);
  cudaMemcpy(hout1, dout1, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
  if (!validate(hout1)) {printf("image 1 fail\n"); return 1;}
  my_time = dtime_usec(0);
  filter<unsigned char><<<dim3(IMW/BX,IMH/BY), dim3(BX,BY)>>>(din, dout2);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel");
  my_time = dtime_usec(my_time);
  printf("time 2 = %fs\n", my_time/(float)USECPSEC);
  cudaMemcpy(hout2, dout2, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
  if (!validate(hout2)) {printf("image 2 fail\n"); return 1;}
  my_time = dtime_usec(0);
  filter<uchar4><<<dim3(IMW/(4*BX),IMH/BY), dim3(BX,BY)>>>(din, dout3);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel");
  my_time = dtime_usec(my_time);
  printf("time 3 = %fs\n", my_time/(float)USECPSEC);
  cudaMemcpy(hout3, dout3, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
  if (!validate(hout3)) {printf("image 3 fail\n"); return 1;}
  return 0;
}
$ nvcc -O3 -arch=sm_20 -o t650 t650.cu
$ ./t650
time 1 = 0.292929s
time 2 = 0.005511s
time 3 = 0.006170s
$

If the conclusion is reached that there is no additional benefit from the uchar4 loads, then the kernel could be somewhat simplified.

Robert_Crovella · December 23, 2014, 5:49pm

I’ve been playing a bit more with the code and added 2 modified approaches. The first 3 implementations are essentially the same as above, but I am using random data. My image validation is comparing image pairs for identity between the methods. The added 2 implementations modify the sorting method from a bubble sort to a rank-order sort (time 4) and a brute-force sorting networks based approach (time 5). The added two implementations both extensively use macros to assist with code generation. Here’s a fully worked example:

$ cat t650.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>


#define IMW 1024
#define IMH 1024
#define BX 32
#define BY 8
#define nTPB (BX*BY)
#define CHECK_VAL 10


#define G(A,B) (array[(A*nTPB)+(threadIdx.y*BX)+threadIdx.x]>array[(B*nTPB)+(threadIdx.y*BX)+threadIdx.x])
#define E(A,B) (array[(A*nTPB)+(threadIdx.y*BX)+threadIdx.x]>=array[(B*nTPB)+(threadIdx.y*BX)+threadIdx.x])

#define SWAP(A,B) if (array[(A*nTPB)+(threadIdx.y*BX)+threadIdx.x]<array[(B*nTPB)+(threadIdx.y*BX)+threadIdx.x])\
     { unsigned char tmp = array[(A*nTPB)+(threadIdx.y*BX)+threadIdx.x]; \
       array[(A*nTPB)+(threadIdx.y*BX)+threadIdx.x] = array[(B*nTPB)+(threadIdx.y*BX)+threadIdx.x]; \
       array[(B*nTPB)+(threadIdx.y*BX)+threadIdx.x] = tmp;}


// IMN is assumed to be odd, for convenience of median filter
#define IMN 21

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
  timeval tv1;
  gettimeofday(&tv1,0);
  return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}

int validate(unsigned char *data1, unsigned char *data2){
  int pass = 1;
  for (int i = 0; i < IMH; i++)
    for (int j = 0; j < IMW; j++)
      if (data1[i*IMW+j] != data2[i*IMW+j])
        if (pass) { printf("image error: (%d,%d): %d:%d\n", i,j,data1[i*IMW+j], data2[i*IMW+j]); pass = 0;}
  return pass;
}

// template parameter is expected to be either unsigned char or uchar4 only

template <typename T>
__global__  void filter( unsigned char* frames, unsigned char* result)
{
  __shared__ unsigned char array[IMN*nTPB*sizeof(T)];
  int x = threadIdx.x+blockDim.x*blockIdx.x;
  int y = threadIdx.y+blockDim.y*blockIdx.y;
  int num = x + y * (IMW/sizeof(T));

  T *my_frames = reinterpret_cast<T *>(frames);
  T *my_array  = reinterpret_cast<T *>(array);

  if ((x < IMW/sizeof(T)) && (y < IMH)){
    for (int i = 0; i < IMN; i++){
      my_array[(i*nTPB)+(threadIdx.y*BX)+threadIdx.x] = my_frames[(i * (IMW*IMH/sizeof(T))) + ( num)];
      }
    for (int bpp = 0; bpp < (sizeof(T)); bpp++)
      for (int j = 0; j < ((IMN/2)+1); j++){
        int num2 = j;
        for (int k = j + 1; k < IMN; k++){
          if (array[(k*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp] < array[(num2*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp]){
            num2 = k;
            }
          }  
	unsigned char b = array[(j*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp];
        array[(j*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp] = array[(num2*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp];
        array[(num2*nTPB*sizeof(T))+(threadIdx.y*BX*sizeof(T))+threadIdx.x*sizeof(T)+bpp] = b;
      }
    T *my_result = reinterpret_cast<T *>(result);
    my_result[(num)] = my_array[((IMN/2)*nTPB) + (threadIdx.y*BX)+threadIdx.x];
  }
}

__global__  void filter2( unsigned char* frames, unsigned char* result)
{
  __shared__ unsigned char array[IMN*nTPB];
  int x = threadIdx.x+blockDim.x*blockIdx.x;
  int y = threadIdx.y+blockDim.y*blockIdx.y;
  int num = x + y * (IMW);

  if ((x < IMW) && (y < IMH)){
    for (int i = 0; i < IMN; i++)
      array[(i*nTPB)+(threadIdx.y*BX)+threadIdx.x] = frames[(i * (IMW*IMH)) + ( num)];
    int ot;
    if (10 == G(0,1) +G(0,2) +G(0,3) +G(0,4) +G(0,5) +G(0,6) +G(0,7) +G(0,8) +G(0,9) +G(0,10) +G(0,11) +G(0,12) +G(0,13) +G(0,14) +G(0,15) +G(0,16) +G(0,17) +G(0,18) +G(0,19) +G(0,20))
      ot = 0;
    else if (10 == E(1,0) +G(1,2) +G(1,3) +G(1,4) +G(1,5) +G(1,6) +G(1,7) +G(1,8) +G(1,9) +G(1,10) +G(1,11) +G(1,12) +G(1,13) +G(1,14) +G(1,15) +G(1,16) +G(1,17) +G(1,18) +G(1,19) +G(1,20))
      ot = 1;
    else if (10 == E(2,0) +E(2,1) +G(2,3) +G(2,4) +G(2,5) +G(2,6) +G(2,7) +G(2,8) +G(2,9) +G(2,10) +G(2,11) +G(2,12) +G(2,13) +G(2,14) +G(2,15) +G(2,16) +G(2,17) +G(2,18) +G(2,19) +G(2,20))
      ot = 2;
    else if (10 == E(3,0) +E(3,1) +E(3,2) +G(3,4) +G(3,5) +G(3,6) +G(3,7) +G(3,8) +G(3,9) +G(3,10) +G(3,11) +G(3,12) +G(3,13) +G(3,14) +G(3,15) +G(3,16) +G(3,17) +G(3,18) +G(3,19) +G(3,20))
      ot = 3;
    else if (10 == E(4,0) +E(4,1) +E(4,2) +E(4,3) +G(4,5) +G(4,6) +G(4,7) +G(4,8) +G(4,9) +G(4,10) +G(4,11) +G(4,12) +G(4,13) +G(4,14) +G(4,15) +G(4,16) +G(4,17) +G(4,18) +G(4,19) +G(4,20))
      ot = 4;
    else if (10 == E(5,0) +E(5,1) +E(5,2) +E(5,3) +E(5,4) +G(5,6) +G(5,7) +G(5,8) +G(5,9) +G(5,10) +G(5,11) +G(5,12) +G(5,13) +G(5,14) +G(5,15) +G(5,16) +G(5,17) +G(5,18) +G(5,19) +G(5,20))
      ot = 5;
    else if (10 == E(6,0) +E(6,1) +E(6,2) +E(6,3) +E(6,4) +E(6,5) +G(6,7) +G(6,8) +G(6,9) +G(6,10) +G(6,11) +G(6,12) +G(6,13) +G(6,14) +G(6,15) +G(6,16) +G(6,17) +G(6,18) +G(6,19) +G(6,20))
      ot = 6;
    else if (10 == E(7,0) +E(7,1) +E(7,2) +E(7,3) +E(7,4) +E(7,5) +E(7,6) +G(7,8) +G(7,9) +G(7,10) +G(7,11) +G(7,12) +G(7,13) +G(7,14) +G(7,15) +G(7,16) +G(7,17) +G(7,18) +G(7,19) +G(7,20))
      ot = 7;
    else if (10 == E(8,0) +E(8,1) +E(8,2) +E(8,3) +E(8,4) +E(8,5) +E(8,6) +E(8,7) +G(8,9) +G(8,10) +G(8,11) +G(8,12) +G(8,13) +G(8,14) +G(8,15) +G(8,16) +G(8,17) +G(8,18) +G(8,19) +G(8,20))
      ot = 8;
    else if (10 == E(9,0) +E(9,1) +E(9,2) +E(9,3) +E(9,4) +E(9,5) +E(9,6) +E(9,7) +E(9,8) +G(9,10) +G(9,11) +G(9,12) +G(9,13) +G(9,14) +G(9,15) +G(9,16) +G(9,17) +G(9,18) +G(9,19) +G(9,20))
      ot = 9;
    else if (10 == E(10,0)+E(10,1)+E(10,2)+E(10,3)+E(10,4)+E(10,5)+E(10,6)+E(10,7)+E(10,8)+E(10,9) +G(10,11)+G(10,12)+G(10,13)+G(10,14)+G(10,15)+G(10,16)+G(10,17)+G(10,18)+G(10,19)+G(10,20))
      ot = 10;
    else if (10 == E(11,0)+E(11,1)+E(11,2)+E(11,3)+E(11,4)+E(11,5)+E(11,6)+E(11,7)+E(11,8)+E(11,9) +E(11,10)+G(11,12)+G(11,13)+G(11,14)+G(11,15)+G(11,16)+G(11,17)+G(11,18)+G(11,19)+G(11,20))
      ot = 11;
    else if (10 == E(12,0)+E(12,1)+E(12,2)+E(12,3)+E(12,4)+E(12,5)+E(12,6)+E(12,7)+E(12,8)+E(12,9) +E(12,10)+E(12,11)+G(12,13)+G(12,14)+G(12,15)+G(12,16)+G(12,17)+G(12,18)+G(12,19)+G(12,20))
      ot = 12;
    else if (10 == E(13,0)+E(13,1)+E(13,2)+E(13,3)+E(13,4)+E(13,5)+E(13,6)+E(13,7)+E(13,8)+E(13,9) +E(13,10)+E(13,11)+E(13,12)+G(13,14)+G(13,15)+G(13,16)+G(13,17)+G(13,18)+G(13,19)+G(13,20))
      ot = 13;
    else if (10 == E(14,0)+E(14,1)+E(14,2)+E(14,3)+E(14,4)+E(14,5)+E(14,6)+E(14,7)+E(14,8)+E(14,9) +E(14,10)+E(14,11)+E(14,12)+E(14,13)+G(14,15)+G(14,16)+G(14,17)+G(14,18)+G(14,19)+G(14,20))
      ot = 14;
    else if (10 == E(15,0)+E(15,1)+E(15,2)+E(15,3)+E(15,4)+E(15,5)+E(15,6)+E(15,7)+E(15,8)+E(15,9) +E(15,10)+E(15,11)+E(15,12)+E(15,13)+E(15,14)+G(15,16)+G(15,17)+G(15,18)+G(15,19)+G(15,20))
      ot = 15;
    else if (10 == E(16,0)+E(16,1)+E(16,2)+E(16,3)+E(16,4)+E(16,5)+E(16,6)+E(16,7)+E(16,8)+E(16,9) +E(16,10)+E(16,11)+E(16,12)+E(16,13)+E(16,14)+E(16,15)+G(16,17)+G(16,18)+G(16,19)+G(16,20))
      ot = 16;
    else if (10 == E(17,0)+E(17,1)+E(17,2)+E(17,3)+E(17,4)+E(17,5)+E(17,6)+E(17,7)+E(17,8)+E(17,9) +E(17,10)+E(17,11)+E(17,12)+E(17,13)+E(17,14)+E(17,15)+E(17,16)+G(17,18)+G(17,19)+G(17,20))
      ot = 17;
    else if (10 == E(18,0)+E(18,1)+E(18,2)+E(18,3)+E(18,4)+E(18,5)+E(18,6)+E(18,7)+E(18,8)+E(18,9) +E(18,10)+E(18,11)+E(18,12)+E(18,13)+E(18,14)+E(18,15)+E(18,16)+E(18,17)+G(18,19)+G(18,20))
      ot = 18;
    else if (10 == E(19,0)+E(19,1)+E(19,2)+E(19,3)+E(19,4)+E(19,5)+E(19,6)+E(19,7)+E(19,8)+E(19,9) +E(19,10)+E(19,11)+E(19,12)+E(19,13)+E(19,14)+E(19,15)+E(19,16)+E(19,17)+E(19,18)+G(19,20))
      ot = 19;
    else
      ot = 20;
    result[(num)] = array[(ot*nTPB) + (threadIdx.y*BX)+threadIdx.x];
  }
}

__global__  void filter3( unsigned char* frames, unsigned char* result)
{
  __shared__ unsigned char array[IMN*nTPB];
  int x = threadIdx.x+blockDim.x*blockIdx.x;
  int y = threadIdx.y+blockDim.y*blockIdx.y;
  int num = x + y * (IMW);

  if ((x < IMW) && (y < IMH)){
    for (int i = 0; i < IMN; i++)
      array[(i*nTPB)+(threadIdx.y*BX)+threadIdx.x] = frames[(i * (IMW*IMH)) + ( num)];
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(2, 4);
    SWAP(2, 3);
    SWAP(0, 3);
    SWAP(0, 2);
    SWAP(1, 4);
    SWAP(1, 3);
    SWAP(1, 2);
    SWAP(5, 6);
    SWAP(8, 9);
    SWAP(7, 9);
    SWAP(7, 8);
    SWAP(5, 8);
    SWAP(5, 7);
    SWAP(6, 9);
    SWAP(6, 8);
    SWAP(6, 7);
    SWAP(0, 5);
    SWAP(1, 6);
    SWAP(1, 5);
    SWAP(2, 7);
    SWAP(3, 8);
    SWAP(4, 9);
    SWAP(4, 8);
    SWAP(3, 7);
    SWAP(4, 7);
    SWAP(2, 5);
    SWAP(3, 6);
    SWAP(4, 6);
    SWAP(3, 5);
    SWAP(4, 5);
    SWAP(10, 11);
    SWAP(13, 14);
    SWAP(12, 14);
    SWAP(12, 13);
    SWAP(10, 13);
    SWAP(10, 12);
    SWAP(11, 14);
    SWAP(11, 13);
    SWAP(11, 12);
    SWAP(16, 17);
    SWAP(15, 17);
    SWAP(15, 16);
    SWAP(19, 20);
    SWAP(18, 20);
    SWAP(18, 19);
    SWAP(15, 18);
    SWAP(16, 19);
    SWAP(17, 20);
    SWAP(17, 19);
    SWAP(16, 18);
    SWAP(17, 18);
    SWAP(10, 16);
    SWAP(10, 15);
    SWAP(11, 17);
    SWAP(11, 16);
    SWAP(11, 15);
    SWAP(12, 18);
    SWAP(13, 19);
    SWAP(14, 20);
    SWAP(14, 19);
    SWAP(13, 18);
    SWAP(14, 18);
    SWAP(12, 15);
    SWAP(13, 16);
    SWAP(14, 17);
    SWAP(14, 16);
    SWAP(13, 15);
    SWAP(14, 15);
    SWAP(0, 11);
    SWAP(0, 10);
    SWAP(1, 12);
    SWAP(1, 11);
    SWAP(1, 10);
    SWAP(2, 13);
    SWAP(3, 14);
    SWAP(4, 15);
    SWAP(4, 14);
    SWAP(3, 13);
    SWAP(4, 13);
    SWAP(2, 10);
    SWAP(3, 11);
    SWAP(4, 12);
    SWAP(4, 11);
    SWAP(3, 10);
    SWAP(4, 10);
    SWAP(5, 16);
    SWAP(6, 17);
    SWAP(6, 16);
    SWAP(7, 18);
    SWAP(8, 19);
    SWAP(9, 20);
    SWAP(9, 19);
    SWAP(8, 18);
    SWAP(9, 18);
    SWAP(7, 16);
    SWAP(8, 17);
    SWAP(9, 17);
    SWAP(8, 16);
    SWAP(9, 16);
    SWAP(5, 11);
    SWAP(5, 10);
    SWAP(6, 12);
    SWAP(6, 11);
    SWAP(6, 10);
    SWAP(7, 13);
    SWAP(8, 14);
    SWAP(9, 15);
    SWAP(9, 14);
    SWAP(8, 13);
    SWAP(9, 13);
    SWAP(7, 10);
    SWAP(8, 11);
    SWAP(9, 12);
    SWAP(9, 11);
    SWAP(8, 10);
    SWAP(9, 10);
    result[(num)] = array[(10*nTPB) + (threadIdx.y*BX)+threadIdx.x];
  }
}

int main(){

  unsigned char *din, *dout1, *dout2, *dout3, *dout4, *dout5;
  unsigned char *hin, *hout1, *hout2, *hout3, *hout4, *hout5;
  cudaDeviceSetCacheConfig(cudaFuncCachePreferShared);
  hin = (unsigned char *)malloc(IMW*IMH*IMN*sizeof(unsigned char));
  hout1 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
  hout2 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
  hout3 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
  hout4 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
  hout5 = (unsigned char *)malloc(IMW*IMH*sizeof(unsigned char));
  cudaMalloc(&din, IMW*IMH*IMN*sizeof(unsigned char));
  cudaMalloc(&dout1, IMW*IMH*sizeof(unsigned char));
  cudaMalloc(&dout2, IMW*IMH*sizeof(unsigned char));
  cudaMalloc(&dout3, IMW*IMH*sizeof(unsigned char));
  cudaMalloc(&dout4, IMW*IMH*sizeof(unsigned char));
  cudaMalloc(&dout5, IMW*IMH*sizeof(unsigned char));
  for (int i = 0; i < IMN; i++)
    for (int j = 0; j< IMH*IMW; j++)
      //hin[i*IMH*IMW+j] = (i+11)%IMN;
      hin[i*IMH*IMW+j] = rand()%255;
  memset(hout1, 0, IMW*IMH*sizeof(unsigned char));
  memset(hout2, 0, IMW*IMH*sizeof(unsigned char));
  memset(hout3, 0, IMW*IMH*sizeof(unsigned char));
  memset(hout4, 0, IMW*IMH*sizeof(unsigned char));
  memset(hout5, 0, IMW*IMH*sizeof(unsigned char));
  cudaMemcpy(din, hin, IMW*IMH*IMN*sizeof(unsigned char), cudaMemcpyHostToDevice);
  cudaMemset(dout1, 0, IMW*IMH*sizeof(unsigned char));
  cudaMemset(dout2, 0, IMW*IMH*sizeof(unsigned char));
  cudaMemset(dout3, 0, IMW*IMH*sizeof(unsigned char));
  cudaMemset(dout4, 0, IMW*IMH*sizeof(unsigned char));
  cudaMemset(dout5, 0, IMW*IMH*sizeof(unsigned char));
  cudaCheckErrors("setup");
  unsigned long long my_time = dtime_usec(0);
  filter<unsigned char><<<dim3(IMW,IMH), 1>>>(din, dout1);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel");
  my_time = dtime_usec(my_time);
  printf("time 1 = %fs\n", my_time/(float)USECPSEC);
  cudaMemcpy(hout1, dout1, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
  // if (!validate(hout1)) {printf("image 1 fail\n"); }
  my_time = dtime_usec(0);
  filter<unsigned char><<<dim3(IMW/BX,IMH/BY), dim3(BX,BY)>>>(din, dout2);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel");
  my_time = dtime_usec(my_time);
  printf("time 2 = %fs\n", my_time/(float)USECPSEC);
  cudaMemcpy(hout2, dout2, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
  if (!validate(hout1, hout2)) {printf("image 2 fail\n"); }
  my_time = dtime_usec(0);
  filter<uchar4><<<dim3(IMW/(4*BX),IMH/BY), dim3(BX,BY)>>>(din, dout3);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel");
  my_time = dtime_usec(my_time);
  printf("time 3 = %fs\n", my_time/(float)USECPSEC);
  cudaMemcpy(hout3, dout3, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
  if (!validate(hout1, hout3)) {printf("image 3 fail\n"); }
  my_time = dtime_usec(0);
  filter2<<<dim3(IMW/BX,IMH/BY), dim3(BX,BY)>>>(din, dout4);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel");
  my_time = dtime_usec(my_time);
  printf("time 4 = %fs\n", my_time/(float)USECPSEC);
  cudaMemcpy(hout4, dout4, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
  if (!validate(hout1, hout4)) {printf("image 4 fail\n"); }
  my_time = dtime_usec(0);
  filter3<<<dim3(IMW/BX,IMH/BY), dim3(BX,BY)>>>(din, dout5);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel");
  my_time = dtime_usec(my_time);
  printf("time 5 = %fs\n", my_time/(float)USECPSEC);
  cudaMemcpy(hout5, dout5, IMW*IMH*sizeof(unsigned char), cudaMemcpyDeviceToHost);
  if (!validate(hout1, hout5)) {printf("image 5 fail\n"); }
  return 0;
}
$ nvcc -O3 -arch=sm_20 -o t650 t650.cu
$ ./t650
time 1 = 0.293081s
time 2 = 0.006445s
time 3 = 0.006166s
time 4 = 0.005773s
time 5 = 0.003592s
$

Speedup over 1 thread per block approach (#1):
2: 45x
3: 47x
4: 50x
5: 81x

DoctorG · December 24, 2014, 5:15pm

Wow, you certainly put your teeth into this. Lots of worked out examples. Thank you.

As you might guess, yes, I am new to CUDA, but certainly not to massive parallel computation, or programming (80+ languages, 40 years, many many os, and assembler going back to Macro-11).

Do you have the patience for a few questions for my education (and perhaps others reading this). Feel free to castigate any of the assumptions and reasoning below, don’t worry about my ego, I want to learn. Just read the whole post.

Let me first make the problem context easier. There is a real-time frame grabber getting raw imagery in image frames of size [1024, 1024]. it returns a byte array of the form:

uchar[1048576]

and places using a “circular queue” in host memory in the form:

uchar[21][1048576]

So unless I want transforming of the axis, (either in the host, or on the gpu) there is no hope for not forcing each thread to have to stride through memory in 1,048,576 byte strides. [I am consider doing the reshape of the matrix on the GPU, but let’s not discuss that for the moment, nor about how to deliver frame slices over time to the GPU from the host] If we focus on the kernel I assumed:

Since each thread touches completely different places in global memory, there was no hope to get any performance improvement by clustering threads into blocks (so yes, I know fully well that warp processing would be killed by this, but I assumed there was no hope (false, see below, but explain why?).
Also, since each thread in the thread block would be touching completely different memory (completely different from adjacent threads in the warp, it really only made sense to place into the shared-mem variables the 21 bytes that that thread was reading. Why stuff things into L1 cache when the adjacent thread is not going to touch it (it is off by a 1MB stride). Yes, it is really smart if you are doing a 2D median filter where adjacent threads touch data that is off by a row or column. But not here were we are using time.
using uchar4 is out of the question, since the cost of restructuring the data so from the frame grabber is going to kill real-time performance. (30FPS)
Since the sort phase is using only L1 (shared mem) and only has 21 values, and we only have to sort 1/2 of the data, it is good enough to use bubble sort.

OK, but I was blown away by how much effort you put into it. So I tried this with a large thread block. (I can afford large ones since there is not much else going on).

I am also waiting for our Tesla K80, and just doing some testing with an old GTX480.

By using a block thread size of 1024, I got a 60X improvement (down to 2.98ms). Why? that is really the question that I need to understand to get wiser.

My guess is that the L2 cache mechanism is smarter than I thought. And that even though the stride is off by 1MB between threads, it is placing enough into the L2 cache, so that the warps are finding the data in L2 and loading it into L1. Does that make sense?

Below is my source code. Which is written in C# using CudaFy.NET which makes for much more succint solutions and cuts out a lot of the C macro and other work. It should be pretty clear to any C/C++ programmer.

[Cudafy]
        public static void filter(GThread thread, byte[,] frames, int frameSize, byte[] result)
        {
            int y = thread.blockIdx.x;
            int x = thread.threadIdx.x;
            int idx = x + (y * thread.blockDim.x);
            byte[] window = thread.AllocateShared<byte>("window", frameCount);
            if (idx < frameSize)
            {
                for (int i = 0; i < frameCount; i++)
                {
                    window[i] = frames[i, idx];
                }
                for (int j = 0; j < (frameCount / 2) + 1; j++)
                {
                    int min = j;
                    for (int l = j + 1; l < frameCount; l++)
                    {
                        if (window[l] < window[min])
                            min = l;
                    }
                    byte temp = window[j];
                    window[j] = window[min];
                    window[min] = temp;
                }
                result[idx] = window[frameCount / 2];
            }
        }

public static void setup(int height, int width)
        {
            CudaWorker.width = width;
            CudaWorker.height = height;
            CudaWorker.FrameSize = width * height;
            CudaWorker.frames = new byte[frameCount, FrameSize];
            CudaWorker.result = new byte[FrameSize];
            CudaWorker.tickRate = (1000L * 1000L * 1000L) / (Stopwatch.Frequency);
            CudafyModule km = CudafyTranslator.Cudafy();
            CudafyModes.Target = eGPUType.Cuda;                 // emulation = eGPUType.Emulator
            //CudafyModes.Target = eGPUType.Emulator;
            CudafyModes.DeviceId = 1;
            CudafyTranslator.Language = eLanguage.Cuda;
            CudafyTranslator.GenerateDebug = true;
            gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
            gpu.LoadModule(km);
        }

public static void run()
        {
            result = new byte[FrameSize];
            runTime = "";
            gframes = gpu.CopyToDevice(frames);
            gresult = gpu.Allocate<byte>(result);
            Stopwatch watch = new Stopwatch();
            watch.Start();
            gpu.StartTimer();
            gpu.Launch(height, width).filter(gframes, FrameSize, gresult);
            runTime = watch.Elapsed.ToString();
            gpu.Synchronize();
            CudaWorker.msTime = gpu.StopTimer().ToString();
            watch.Stop();
            totalRunTime = watch.Elapsed.ToString();
            gpu.CopyFromDevice(gresult, result);
            gpu.FreeAll();
        }

Robert_Crovella · December 24, 2014, 6:20pm

Acknowledged. I think I understand the context. And your observation about the fundamental nature of each thread striding through a set of images is completely correct. But that is not the source of the problem, as both you and I have demonstrated.

This seems to be the crux of your knowledge gap, and may require some unpacking. I’ll address it below where you ask a simiilar question.

I wouldn’t describe it this way. See below, especially the “diagram” illustrating how adjacent threads could access adjacent elements in memory per read transaction.

That’s OK, don’t use uchar4. I’m not sure you really grok’ed my code or results, but the uchar4 evolution yielded no significant improvement. I thought it might, but it did not. Having said that, I don’t actually think using uchar4 to assist in global loads and stores requires any modification to your data storage, and I believe my code demonstrates that. Again, not sure you grok’ed my code. But even that is irrelevant, if you now understand how to get 60x speedup.

I agree that the bubble sort is fine. Mostly for my own amusement, I was interested in trying other methods. The rank-order sort yielded no significant benefit, and the brute-force sorting networks approach yielded ~1.5x improvement (but suffers in other ways, such as code flexibility and maintainability). In some situations, folks would kill for an additional 1.5x speedup. But if you just figured out how to speed up your code by 60x, an additional 1.5x probably seems like small potatoes.

There are 2 related issues that the 1-thread-per-block structure has.

Execution unit utilization - The execution units of a GPU are not individually schedulable (ie. on a per-thread basis). Threads execute in groups of 32 (a warp) and the execution units are scheduled in groups of 32 also. It does not matter how many active threads there are in a warp. Let’s divorce ourselves from memory for the moment and consider an instruction that does a floating-point multiply. Let’s pretend that a floating point multiply takes one clock to execute on a GPU (it does not, but that does not matter to make my point). That means if my thread code encounters a floating-point multiply instruction, that instruction will require 32 floating-point units to be scheduled, even if my warp consists of only 1 active thread. This general requirement is true for many other aspects of functional unit scheduling on the GPU. I can choose to get one floating point multiply done per clock, or by reorganizing my threadblock to consist of more active threads, I can get up to 32 floating point multiply operations done in the same clock, from the same instruction, using the exact same resources. So it’s just generally a bad idea to create threadblocks of 1 thread, if you care about performance. For many aspects of execution, you are leaving 31/32 or about 97% of the horsepower of the GPU unutilized. This factor could explain a ~32x speedup, but not 60x. It is only part of the overall answer.
Memory utilization - Modern DRAM subsystems do not allow you to access a byte at a time, or a word at a time. If you want something from DRAM, the smallest item you can request is a 32-byte chunk. If you need all 32 bytes, great, you got them in a single transaction/request. If you only needed 1 byte, you got the byte you wanted, plus 31 others that you don’t care about. Therefore, efficient code on a GPU does not use a byte at a time. It uses 32 bytes or 128 bytes. How is this retrieved? It is retrieved when a memory operation/instruction (load, store) is encountered during the execution of your thread code. Remember that threads are not executing individually, but in the context of a warp. Therefore, memory instructions are encountered in the context of a warp. The memory controller looks at a single load operation, and considers the needs (i.e. respective load addresses) of the entire warp. Based on these load addresses, the memory controller will decide which segments to retrieve from DRAM. A fully coalesced load will result in a minimum number of segments that need to be retrieved, in order to satisfy the needs of the warp. For a warp where each thread is requesting 4 bytes, and all 32 threads are active, and the 4 bytes needed by each thread are part of a contiguous, 128-byte aligned segment in DRAM, then the memory controller need only issue 4 segment load requests to DRAM, and all 32 threads will get what they need. This is “perfect” utilization of the available memory bandwidth, and is referred to as a 100% coalesced load. On the other hand, if the addresses needed by adjacent threads in the warp were fully scattered, then up to 32 segment load requests might have to be issued. The effective bandwidth utilization in this case would drop to something like 3%. Memory requests emanating from different warps or different instructions within the same warp are never coalesced by the memory controller. They always result in separate transactions (although, due to cache, those transactions may not all result in DRAM transactions.) But what about the “cache” ? (you may ask). Yes, the GPU has caches. A good principle for beginning programmers is to assume they don’t. Force your memory utilization to be efficient, even in the absence of a cache. The GPU caches tend to be small (768Kbytes for your GTX480, I believe) and they really exist for a separate purpose (which I’ll not go into here,) other than bulk, long-term data cacheing. For very small problem sets, you might be able to get away with pretending the cache will provide bulk long term storage of the working set, but that’s not the general purpose of GPU caches.

When we combine the above 2 mechanisms together, we can start to explain a 50-60x speedup. There appears to be one additional piece of information you don’t seem to be grasping. How can adjacent threads request adjacent items in memory? Why is the stride not an issue?

Lets’ consider the first 32 pixels in your image, across each of the 21 images, and their respective addresses in memory:

Image 1:

pixel:        0       1       2       3       4 ...      31
addrs:        0       1       2       3       4 ...      31

Image 2:

pixel:        0       1       2       3       4 ...      31
addrs:  1048576 1048577 1048578 1048579 1048580 ... 1048607

Image 3:

pixel:        0       1       2       3       4 ...      31
addrs:  2097152 2097152 2097152 2097152 2097152 ... 2097152

...

To compute the median filter, thread 0 will need to load pixel 0 from each image. Thread 1 will need to load pixel 1 from each image. If I do this in separate blocks, then a given block will be requesting a single byte from image 1 (but 32 bytes will be retrieved) and another block will be requesting a single byte (at some other offset) from each image, but each transaction will be “throwing away” 31 bytes.

Now let’s suppose I have 32 threads executing together in a warp. Let’s suppose that thread 0 is responsible for pixel 0, thread 1 is responsible for pixel 1,etc. Now, the load instruction for the pixel from image 1 occurs across all threads in the warp simultaneously, and a single 32 byte read from DRAM satisfies the needs of all 32 threads. No bytes are wasted.

The combination of these two effects (execution unit utilization, memory bandwidth utilization) is what can give rise to the dramatic 50-60x speedup.

DoctorG · December 24, 2014, 8:25pm

Yes, now I grok this. I did read in detail (cover to cover) (and work the examples) of Programming in Cuda (Sanders & Kandrot) but I did notice they left out a lot about warps and warp scheduling. This fills in the gap.

But your diagram nails it.

So I had one other performance improvement I was considering. Since the median of all image frames is done as a sliding window (across all 21 frames). There might be an additional improvement.

If when new frame arrives in the GPU, I transform the axis so that it align the “Z” axis with the prior transformed data. Would that lead to a performance boost?

In effect, I would be creating a uchar21 (fictional) that would sequentially hold all the successor frames with the frameID as the lowest byte.

This transform and insert (or destructive update) into the [1024x1024, 21] array (now ordered along the z axis) could be done by a dynamically scheduled kernel that would overlap the other computation. (I am thinking about the Tesla K80 when it arrives)

To re-use your diagram:

Image 1:

pixel:        0       1            2           
addrs:        0      1048576       2097152 

Image 2:

pixel:        0       1            2       
addrs:        1       1048577      2097153

Image 3:

pixel:        0       1             2       
addrs:        2       1048578       2097154 

...

In this case, I would block the threads out differently. So that successive warps of threads in a block would be grabbing adjacent data addresses.

But I am also thinking that is is not a win, since right now the warp scheduler is probably fetching data in optimal successive waves through memory (in the 48 byte DDR5 cache read lines)

Robert_Crovella · December 24, 2014, 8:49pm

At the moment, I don’t see any benefit. The data reorganization will not improve load-from-global efficiency, because we’ve already demonstrated that the existing arrangement can yield 100% efficiency (fully coalesced loads, and no extra loads required, i.e. 100% utilization).

Shared memory will still be required, since the bubble-sorting process “depends” on it, and the remainder of the algorithm is essentially unaffected.

If what you’re after is overlap of copy and compute (the copy being the loading of the “next” new frame, the compute being the computation on the “current” 21 frames) I don’t think you have to reorganize your data in any way to achieve that. If you’re concerned about rotating through the circular image buffer, this can be handled via address indexing according to a modulo pattern, with little if any loss of efficiency (remember, the long pole in the loading process is the memory loads themselves, not a few bits of address calculation here or there). Define a circular image buffer of 22 images. One of those images is being loaded, the other 21 are being computed. After the compute cycle is complete, advance the ring pointer, and use that to start your next computation cycle. Use modulo arithmetic on the ring index calculation as you are loading images during computation.

As an aside, I’m familiar with “Cuda by Example” (Sanders,Kandrot) but not “Programming in CUDA” (Sanders, Kandrot). One expository method they use is to disconnect the concept of threads and blocks, and teach each separately/independently. The approach has merit, and is a common approach to introduction of the concepts, but it has a possible drawback in that students may think that it’s OK to write programs that consist of one threadblock, or that it’s OK to write programs that use only one thread per block (since programs of both types are covered “by example” in the book).

It is not OK (from a performance perspective). Ever. Any such material should have a large disclaimer:

“Don’t ever do this in the real world. If you ever find yourself constructing a performance-sensitive kernel launch like this: kernel<<<1, …>>>(…); or this: kernel<<<…,1>>>(…); you should immediately stop, step away from the computer, and re-think your life.”

Vectorizer · December 25, 2014, 1:33am

A lot of detailed posts. My scenario is probably covered in here somewhere but in case it hasn’t been, here is my $0.02f:

During ingest, warp reads 32(or a multiple thereof) pixels from each of the 21 consecutive frames into smem. This will lead to coalesced access within each frame. There will not be a need for __syncthreads() since the threads in a warp are simt synchronous.
In a loop of 32(or a multiple) iterations, the warp performs a Keppler’s shuffle based bitonic sort and determines the median of the given pixel position. This leads to a suboptimal utilization of the warp as only 21 elements are put through the network in each iteration.
Following the loop, the warp writes the results out; again in a coalesced manner.
The obvious question is: does the performance gain from coalesced I/O compensate for the underutilization of the warp during the bitonic sort phase?

DoctorG · December 25, 2014, 8:39pm

TxBob:

FYI: we are hoping to use the P2P RDMA from the video frame grabber to the Tesla K80 to move the data over the PCIe bus without going to the host. Exploiting the UM model. If that fails, we will DMA to host and then DMA to GPU in frame size slices. (I am doing the 2nd option now).

You also seem to say that the only real improvement we can explore is switching to a Batcher Bitonic sort instead of bubble.

I just wanted to inform you that the hardware guys tell us that we can get by with a 15 frame median filter which would be 120 loops on the bubble sort.

Your 21 frame network sort was also 119 steps, and this is all with shared memory (L1). So I am wondering if changing the sort algorithm helps very much. We are probably talking about shaving micro-seconds.

Robert_Crovella · December 26, 2014, 5:27am

It looked to me like switching from the bubble sort to the sorting networks approach gave a ~1.5x speedup. I’m assuming that would cut 3ms down to 2ms or something like that. For the GPU I was using (a little slower than your GTX480) it went from ~6ms to ~3.6ms (data is above in a previous posting of mine).

I didn’t say anything about a bitonic sort. I don’t know what sort is best. I tried the bubble, rank-order, and sorting-networks (Bose-Nelson).

If you’ve decided that 15 frames is right for the median filter, the sorting networks approach can easily be cut down to a 15-element sort instead of a 21-element sort, and I’m sure it would run quicker still.

You can auto-generate the 15-element sort SWAP macros here:

[url]http://pages.ripco.net/~jgamble/nw.html[/url]

DoctorG · December 26, 2014, 3:01pm

You read my mind as to the next question: (gernating the swap list).

At that url he calls the bitonic sort the Batcher Merge-Exchange. Back in the 1980s I knew it simply as the Batcher Sort (way before it’s current GPU popularity).

henrym0mdt · June 28, 2018, 7:29am

This is an excellent example for image filter using cuda and I would like to know how this could be utilised in python using pycuda.

I am new to C/C++ so getting frames from camera into this filter and back again for use by OpenCV is not totally clear.

Assuming I had a python program that stacked sequential images into a numpy array, what would I need to do to get the format correct to utilise the t650 filter script above.

I am currently using numpy (median = np.median(buffer,axis=0) for median filter in python with a stacked buffer size of 50 images in HD (1920x1080) using opencv to acquire images (cam=cv2.VideoCapture(0)) and getting cycle times of 2 to 3 seconds to generate the median.

How would I use the example given to replace this function to get faster cycle times. Any examples would be greatly appreciated.

Regards h

Robert_Crovella · June 28, 2018, 1:35pm

numba allows you to write CUDA kernels in python code, and to utilize numpy data. For relatively simple kernels such as the ones I have shown, the translation from CUDA-C to numba CUDA-python is pretty much mechanical.

If you want to learn how to use numba, it should be pretty straightforward at that point to convert what I have shown to equivalent numba CUDA kernels.

Since you mention pycuda, that is perhaps even easier. In pycuda, kernels are written in CUDA C/C++, meaning you should be able to use this code exactly, to duplicate what I have done in CUDA C/C++.

henrym0mdt · July 3, 2018, 5:27am

Hi txbob, thanks for your response.

I received the following error when incorporating into pycuda script,

kernel.cu(36): error: this declaration may not have extern “C” linkage

kernel.cu(68): warning: linkage specification is not allowed

Is there something extra required and is this referring to the include statements ?

My dummy script to test pycuda as follows:

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule(“”"
#include <stdio.h>
#include <time.h>
#include <sys/time.h>

#define IMW 1024
#define IMH 1024
#define BX 32
#define BY 8
#define nTPB (BX*BY)
#define CHECK_VAL 10

// IMN is assumed to be odd, for convenience of median filter
#define IMN 21

#define cudaCheckErrors(msg)
do {
cudaError_t __err = cudaGetLastError();
if (__err != cudaSuccess) {
exit(1);
}
} while (0)

unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}

int validate(unsigned char *data){
int pass = 1;
for (int i = 0; i < IMH; i++)
for (int j = 0; j < IMW; j++)
if (data[i*IMW+j] != CHECK_VAL) pass = 0;
return pass;
}

// template parameter is expected to be either unsigned char or uchar4 only

template
global void filter( unsigned char* frames, unsigned char* result)
{
shared unsigned char array[IMNnTPBsizeof(T)];
int x = threadIdx.x+blockDim.xblockIdx.x;
int y = threadIdx.y+blockDim.yblockIdx.y;
int num = x + y * (IMW/sizeof(T));

T *my_frames = reinterpret_cast<T *>(frames);
T *my_array = reinterpret_cast<T *>(array);

if ((x < IMW/sizeof(T)) && (y < IMH)){
for (int i = 0; i < IMN; i++){
my_array[(inTPB)+(threadIdx.yBX)+threadIdx.x] = my_frames[(i * (IMWIMH/sizeof(T))) + ( num)];
}
for (int bpp = 0; bpp < (sizeof(T)); bpp++)
for (int j = 0; j < ((IMN/2)+1); j++){
int num2 = j;
for (int k = j + 1; k < IMN; k++){
if (array[(knTPBsizeof(T))+(threadIdx.yBXsizeof(T))+threadIdx.xsizeof(T)+bpp] < array[(num2nTPBsizeof(T))+(threadIdx.yBXsizeof(T))+threadIdx.xsizeof(T)+bpp]){
num2 = k;
}
}
unsigned char b = array[(jnTPBsizeof(T))+(threadIdx.yBXsizeof(T))+threadIdx.xsizeof(T)+bpp];
array[(jnTPBsizeof(T))+(threadIdx.yBXsizeof(T))+threadIdx.xsizeof(T)+bpp] = array[(num2nTPBsizeof(T))+(threadIdx.yBXsizeof(T))+threadIdx.xsizeof(T)+bpp];
array[(num2nTPBsizeof(T))+(threadIdx.yBXsizeof(T))+threadIdx.x*sizeof(T)+bpp] = b;
}
T *my_result = reinterpret_cast<T *>(result);
my_result[(num)] = my_array[((IMN/2)nTPB) + (threadIdx.yBX)+threadIdx.x];
}
}

int main(){
unsigned char din, dout;
unsigned char hin, hout;
hin = (unsigned char )malloc(IMWIMHIMNsizeof(unsigned char));
hout = (unsigned char )malloc(IMWIMHsizeof(unsigned char));
cudaMalloc(&din, IMWIMHIMNsizeof(unsigned char));
cudaMalloc(&dout, IMWIMHsizeof(unsigned char));
for (int i = 0; i < IMN; i++)
for (int j = 0; j< IMHIMW; j++)
hin[i*IMH*IMW+j] = (i+5)%IMN;
memset(hout, 0, IMWIMHsizeof(unsigned char));
cudaMemcpy(din, hin, IMWIMHIMNsizeof(unsigned char), cudaMemcpyHostToDevice);
cudaMemset(dout, 0, IMWIMHsizeof(unsigned char));
cudaCheckErrors(“setup”);
unsigned long long my_time = dtime_usec(0);
filter<<<dim3(IMW/BX,IMH/BY), dim3(BX,BY)>>>(din, dout);
cudaDeviceSynchronize();
cudaCheckErrors(“kernel”);
my_time = dtime_usec(my_time);
cudaMemcpy(hout, dout, IMWIMHsizeof(unsigned char), cudaMemcpyDeviceToHost);
my_time = dtime_usec(0);

return 0;
}
“”")

median = mod.get_function(“filter”)

HannesF99 · July 3, 2018, 9:54am

We have also made good experiences with employing sorting networks for the CUDA implementation of small (e.g., 3x3) rank-order-filters, showing great speedups when compared to our CPU implementation.

In addition to the provided link by txbob for auto-generating the sequences of SWAP() calls , one can use also the (meta) template class from c++ - Very fast sorting of fixed length arrays using comparator networks - Stack Overflow for sorting N elements with a sorting network (Bose-Nelson algorithm). N is the template paramter.

henrym0mdt · July 3, 2018, 10:59am

Solving my own question, errors are resolved by adding “no_extern_c=1” as shown below…

return 0;
}
“”", no_extern_c=1)

But to get the function assignment to work I needed to remove “import pycuda.driver as drv” as it creates an import loop.

henrym0mdt · July 3, 2018, 11:10am

correction: this was not relevant (But to get the function assignment to work I needed to remove “import pycuda.driver as drv” as it creates an import loop.)

Actual issue is mangled names from JIT compiler (REF: cuda - PyCUDA either fails to find function in NVIDIA source code or throws 'may not have extern "C" Linkage' error - Stack Overflow)

Not yet getting output from this function using pycuda.

Robert_Crovella · July 3, 2018, 12:34pm

You have some things to learn about pycuda.

With pycuda, you would normally only write the kernel code in c/c++, while all other code would be python (i.e. everything that runs on the host, including the code that invokes the kernel).

So it’s not simply a matter of taking an entire C/C++ program (with the main function, and everything else) and incorporating it into a pycuda SourceModule statement.

Please study some pycuda sample codes. Your SourceModule statement, to a first order approximation, should only have in it the kernel code (and things needed to support compilation of it, such as macros). Referring to the t650 code previously posted, the kernel is the code beginning with global void filter…

lyudalev · November 5, 2024, 7:07am

What a great discussion! It is very useful to me as I have a similar problem.
I’m new to CUDA so I have several question related to the brute-force sorting networks based approach (filter3) implementation if you don’t mind:

Why there is no __syncthreads() before all SWAP operations?
Can the array indexing like (i * nTPB) + (threadIdx.y * BX) + threadIdx.x cause non-coalesced memory access?
Can the array access pattern lead to bank conflicts if multiple threads in a warp are accessing shared memory in a non-optimal way?
Can global memory access (writing to the output array result[(num)] = array[(10*nTPB) + (threadIdx.y*BX)+threadIdx.x]) be improved in a way that result is written in a coalesced manner?
Thank you very much!

Robert_Crovella · November 5, 2024, 6:42pm

If you study the shared memory accesses carefully, you will note that a given thread is populating all the elements of shared memory that it will subsequently touch during the SWAP operations. There is no inter-thread communication. And therefore no need for synchronization.

Generally, no. In fact, when I am teaching CUDA, I usually suggest a simple heuristic to determine if a given complex indexing arrangement will coalesce “nicely” or not: if the index has threadIdx.x with no multiplier on it, as an additive component of the index, then the pattern should coalesce across threads in a warp.

In general, if the shared access pattern is non-optimal, then a basic definition of that “non-optimality” could be bank conflicts. That is probably the principal definition of “optimal” access to shared memory. However in this particular case, the same rule about the additive component of threadIdx.x leading to good coalescing patterns (a term applied to global space access), it also happens to lead to non-bank-conflicted access in shared memory.

In what way is the writing of result not coalesced? You may wish to apply the previous heuristic I suggested. Stated another way, is threadIdx.x an additive component only in the makeup of the num index? What does that suggest according to my previous heuristic?

Topic		Replies	Views
Bank conflict hell: (could a shuffle based sort help)? CUDA Programming and Performance	17	4646	April 17, 2015
Median Filter on a large image CUDA Programming and Performance	7	6217	March 31, 2010
Optimizing memory coalescence doesn't make my program faster CUDA Programming and Performance	12	648	August 6, 2021
Sorting bottleneck in rendering fractals a call for ideas CUDA Programming and Performance	19	4066	November 18, 2009
What are you guys doing with cuda? just wanna find a way to go CUDA Programming and Performance	81	57056	February 7, 2013
Help improving performance CUDA Programming and Performance	19	14435	June 10, 2009
How would you do this? CUDA Programming and Performance	12	4598	August 5, 2008
Volumetric data processing with Cuda CUDA Programming and Performance	35	9308	June 15, 2011
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6303	November 16, 2010
CUDA 11.4.19 : median on ~40 elements integer vector + if statement CUDA Programming and Performance cuda	27	1885	November 8, 2023

Median Filter (time dimension) for images (bad performance, no memory coalescence

Related topics