1d convolution performance

s002wjh · December 4, 2015, 2:07pm

is there a good example for kernel on 1d conv with good performance? the one I saw in example file are all done in frequency domain. is it because its faster that way?

Robert_Crovella · December 4, 2015, 2:39pm

Yes, frequently the recommendation is to use FFTs because they are faster (it’s O(nlogn) instead of O(n2) or or something like that.)

[url]https://en.wikipedia.org/wiki/Convolution_theorem[/url]

“With the help of the convolution theorem and the fast Fourier transform, the complexity of the convolution can be reduced to O(n log n).”

In practice, actual benefits of using frequency domain methods will vary substantially based on the sizes of the signals being convolved.

Some convolution resources:

cuda sample codes - there are many, click here:

[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#axzz3tMVeN432[/url]

and then search for “convolution”

this stackoverflow question may be of interest:

[url]c - Parallelizing a for loop (1D Naive Convolution) in CUDA - Stack Overflow

NVIDIA has an online class you can take that focuses on 1D convolution, looking at both time-domain and frequency-domain versions:

[url]https://nvidia.qwiklab.com/focuses/preview/74[/url]

Vectorizer · December 4, 2015, 4:08pm

Another benefit of FFT is the (potential) improvement in accuracy due to the decrease in the number of floating point operations.

scottgray · December 4, 2015, 5:28pm

If your filter size is small like 3 (which is very common in deep learning) then I’d also consider the winograd transform:

The current best imagenet result uses a lot of stacked 1x3 and 3x1 1D convolution filters so I’ll be building winograd kernels for that soon.

s002wjh · December 4, 2015, 7:12pm

thanks

seem the lab require purchase, though

Robert_Crovella · December 4, 2015, 7:28pm

I can make some credits available to you so you can take one of the paid labs without paying for it.

If you would like the details, send me a private message on this board.

s002wjh · December 7, 2015, 6:30pm

quick question on the stackoverflow conv from previous post. I try to use different tap and sample length but its not working correct.

if (idx < (2*N)-1){
int my_sum = 0;
for (int i = 0; i < N; i++)
if (((idx < N) && (i <= idx)) || ((idx >= N) && (i > (idx-N)))) my_sum += A[i]*B[idx-i];
out[idx] = my_sum;
}

any idea what this line is doing?
if (((idx < N) && (i <= idx)) || ((idx >= N) && (i > (idx-N)))) my_sum += A[i]*B[idx-i];
if I change tap and sample length, seem like I need modify N with my tap length?

Robert_Crovella · December 7, 2015, 6:41pm

That code presumes that the two signals to be convolved are of the same length (N). If that is not the case for your modification, then the code won’t work as-is.

The nvidia online convolution class works through an example where the two signals are the same length (very similar to the stack overflow code) and another example where you are convolving one “large” signal with one “small” signal, which might be more typical of what many think about as “convolution”. In this case the “small” signal might be referred to as the “filter”.

s002wjh · December 7, 2015, 7:06pm

k thx, try to look at the lab, but its on maintenance right now.

Robert_Crovella · December 7, 2015, 7:16pm

The lab code for “task2” covers the large/small signal case, this is what it looks like:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

#define RG 10
#define USECPSEC 1000000ULL
#define nTPB 256
#define DSIZE (32768*1024)
#define FSIZE 17

//cuda error checking macros
#ifdef DEBUG
#define CUDA_CALL(F)  if( (F) != cudaSuccess ) \
  {printf("Error %s at %s:%d\n", cudaGetErrorString(cudaGetLastError()), \
   __FILE__,__LINE__); exit(-1);}
#define CUDA_CHECK()  if( (cudaPeekAtLastError()) != cudaSuccess ) \
  {printf("Error %s at %s:%d\n", cudaGetErrorString(cudaGetLastError()), \
   __FILE__,__LINE__-1); exit(-1);}
#else
#define CUDA_CALL(F) (F)
#define CUDA_CHECK()
#endif

typedef float mytype;
// host function to compute convolution reference results
void conv(const mytype *A, const mytype *B, mytype* out, int N, int P) {
// P is assumed to be odd, and greater than 1
    int rb = (P-1)/2;
    int lb = -rb;
    for (int i = rb; i < N-rb; ++i)
        for (int j = lb; j <= rb; ++j)
            out[i] += A[i+j] * B[j+rb];
}
// timing measurement function
unsigned long long dtime_usec(unsigned long long prev){
  timeval tv1;
  gettimeofday(&tv1,0);
  return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}
// convolution GPU kernel - not using shared memory
// Task 2
__global__ void conv_Kernel(const mytype * __restrict__ A, const mytype * __restrict__ B, mytype *C, const int N, const int P){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    int radius = (P-1)/2;
    if ((idx < (N-radius)) && (idx >= radius)){
      mytype my_sum = 0;
      for (int j = -radius; j <= radius; j++)
        my_sum += A[idx+j]*B[j+radius];
      C[idx] = my_sum;
    }
}

__global__ void conv_shared_Kernel(const mytype * __restrict__ A, const mytype * __restrict__ B, mytype *C, const int N, const int P){
    __shared__ mytype sA[nTPB+FSIZE];
    __shared__ mytype sB[FSIZE];
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    int radius = (P-1)/2;
    int lidx = threadIdx.x + radius;
    if (threadIdx.x < P) sB[threadIdx.x] = B[threadIdx.x];
    if (idx < N){
      sA[lidx] = A[idx];
      if (threadIdx.x < radius) {
        if (idx >= radius)   sA[threadIdx.x] = A[idx - radius];
        if ((idx + nTPB)< N) sA[nTPB + lidx] = A[idx + nTPB];}}
    __syncthreads();
    if ((idx < (N-radius)) && (idx >= radius)){
      mytype my_sum = 0;
      for (int j = -radius; j <= radius; j++)
        my_sum += sA[lidx+j] * sB[j+radius];
      C[idx] = my_sum;
    }
}

int main(int argc, char *argv[]){

  mytype *d_A, *A, *d_B, *B, *d_C, *C, *h_C;
  int my_N = DSIZE;
  int my_P = FSIZE;
// allocate host data
  A   = (mytype *)malloc(my_N*sizeof(mytype));
  B   = (mytype *)malloc(my_P*sizeof(mytype));
  C   = (mytype *)malloc(my_N*sizeof(mytype));
  h_C = (mytype *)malloc(my_N*sizeof(mytype));
// allocate device data
  CUDA_CALL(cudaMalloc(&d_A, my_N*sizeof(mytype)));
  CUDA_CALL(cudaMalloc(&d_B, my_P*sizeof(mytype)));
  CUDA_CALL(cudaMalloc(&d_C, my_N*sizeof(mytype)));
//initialize host input data
  for (int i=0; i < my_N; i++)
    A[i] = rand()%RG;
  for (int i=0; i < my_P; i++)
    B[i] = 1;
//zero out host result data
  for (int i=0; i < my_N; i++){
    C[i]   = 0;
    h_C[i] = 0;}
//begin timing for host reference function
  unsigned long long cpu_time = dtime_usec(0);
  conv(A, B, C, my_N, my_P);
  cpu_time = dtime_usec(cpu_time);
//initialize device result data
  CUDA_CALL(cudaMemset(d_C, 0, my_N*sizeof(mytype)));
//begin timing for device function
  unsigned long long gpu_time = dtime_usec(0);
//copy host input data to device
  CUDA_CALL(cudaMemcpy(d_A, A, my_N*sizeof(mytype), cudaMemcpyHostToDevice));
  CUDA_CALL(cudaMemcpy(d_B, B, my_P*sizeof(mytype), cudaMemcpyHostToDevice));
//run convolution kernel on GPU
  conv_shared_Kernel<<<(my_N+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_C, my_N, my_P);
  CUDA_CHECK();
//copy results from device to host
  CUDA_CALL(cudaMemcpy(h_C, d_C, my_N*sizeof(mytype), cudaMemcpyDeviceToHost));
  gpu_time = dtime_usec(gpu_time);
//check validity of results
  for (int i = 0; i < my_N; i++) if (C[i] != h_C[i]) {printf("FAIL at %d, cpu: %f, gpu %f\n", i, C[i], h_C[i]); return 1;}
//print timing and speed comparison
  printf("PASS.  cpu time: %ldus, gpu time: %ldus\n", cpu_time, gpu_time);
  printf("Speedup: cpu/gpu = %f\n", cpu_time/(float)gpu_time);
//all host and device allocated data will be implicitly freed at program termination
  return 0;
}

The “large” signal corresponds to DSIZE.

The “small” signal (the filter) corresponds to FSIZE.

Remember, this is for educational purposes, not necessarily optimally tuned for performance or guaranteed to be defect-free. You can definitely do better.

CudaaduC · December 9, 2015, 1:59am

Really great paper Scott and truly impressive results!
You are so far ahead of everyone else in this specific area, and you explain the topic and implementation very well.

I suggest (if you want) to submit this paper for collection on this site;

http://hgpu.org/

David_DiazGuerra · November 14, 2018, 8:08am

The link is broken, is it possible to access the class in some way? I would be really interested.

Thanks.

Robert_Crovella · November 14, 2018, 1:30pm

The externally available on-line classes from NVIDIA have been revamped, and organized under DLI:

https://www.nvidia.com/en-us/deep-learning-ai/education/

Unfortunately this class is no longer available.

David_DiazGuerra · November 14, 2018, 3:41pm

Ok, thank you anyway.

Topic		Replies	Views
fast convolution CUDA Programming and Performance	2	747	December 31, 2015
conv time vs freq performance CUDA Programming and Performance	1	714	January 20, 2016
About convolution performance CUDA Programming and Performance	3	830	February 17, 2017
2D Convolution problem following example from SDK source code included CUDA Programming and Performance	9	11787	June 7, 2011
Image filtering in frequency domain GPU-Accelerated Libraries	6	3756	October 3, 2014
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	242612	November 16, 2011
non-separable convolution example? CUDA Programming and Performance	2	6372	June 18, 2007
Using 1D Kernel on 2D fft Image filtering CUDA Programming and Performance	1	4300	June 28, 2009
FFT on CUDA Newbie has lots of problems! CUDA Programming and Performance	2	4771	August 17, 2011
convolutionFFT2D for kernels larger than input data CUDA Programming and Performance	1	1596	December 22, 2009

1d convolution performance

Related topics