1d convolution performance

is there a good example for kernel on 1d conv with good performance? the one I saw in example file are all done in frequency domain. is it because its faster that way?

Yes, frequently the recommendation is to use FFTs because they are faster (it’s O(nlogn) instead of O(n2) or or something like that.)


“With the help of the convolution theorem and the fast Fourier transform, the complexity of the convolution can be reduced to O(n log n).”

In practice, actual benefits of using frequency domain methods will vary substantially based on the sizes of the signals being convolved.

Some convolution resources:

cuda sample codes - there are many, click here:


and then search for “convolution”

this stackoverflow question may be of interest:


NVIDIA has an online class you can take that focuses on 1D convolution, looking at both time-domain and frequency-domain versions:


Another benefit of FFT is the (potential) improvement in accuracy due to the decrease in the number of floating point operations.

If your filter size is small like 3 (which is very common in deep learning) then I’d also consider the winograd transform:

The current best imagenet result uses a lot of stacked 1x3 and 3x1 1D convolution filters so I’ll be building winograd kernels for that soon.


seem the lab require purchase, though

I can make some credits available to you so you can take one of the paid labs without paying for it.

If you would like the details, send me a private message on this board.

quick question on the stackoverflow conv from previous post. I try to use different tap and sample length but its not working correct.

if (idx < (2*N)-1){
int my_sum = 0;
for (int i = 0; i < N; i++)
if (((idx < N) && (i <= idx)) || ((idx >= N) && (i > (idx-N)))) my_sum += A[i]*B[idx-i];
out[idx] = my_sum;

any idea what this line is doing?
if (((idx < N) && (i <= idx)) || ((idx >= N) && (i > (idx-N)))) my_sum += A[i]*B[idx-i];
if I change tap and sample length, seem like I need modify N with my tap length?

That code presumes that the two signals to be convolved are of the same length (N). If that is not the case for your modification, then the code won’t work as-is.

The nvidia online convolution class works through an example where the two signals are the same length (very similar to the stack overflow code) and another example where you are convolving one “large” signal with one “small” signal, which might be more typical of what many think about as “convolution”. In this case the “small” signal might be referred to as the “filter”.

k thx, try to look at the lab, but its on maintenance right now.

The lab code for “task2” covers the large/small signal case, this is what it looks like:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

#define RG 10
#define USECPSEC 1000000ULL
#define nTPB 256
#define DSIZE (32768*1024)
#define FSIZE 17

//cuda error checking macros
#ifdef DEBUG
#define CUDA_CALL(F)  if( (F) != cudaSuccess ) \
  {printf("Error %s at %s:%d\n", cudaGetErrorString(cudaGetLastError()), \
   __FILE__,__LINE__); exit(-1);}
#define CUDA_CHECK()  if( (cudaPeekAtLastError()) != cudaSuccess ) \
  {printf("Error %s at %s:%d\n", cudaGetErrorString(cudaGetLastError()), \
   __FILE__,__LINE__-1); exit(-1);}
#define CUDA_CALL(F) (F)
#define CUDA_CHECK()

typedef float mytype;
// host function to compute convolution reference results
void conv(const mytype *A, const mytype *B, mytype* out, int N, int P) {
// P is assumed to be odd, and greater than 1
    int rb = (P-1)/2;
    int lb = -rb;
    for (int i = rb; i < N-rb; ++i)
        for (int j = lb; j <= rb; ++j)
            out[i] += A[i+j] * B[j+rb];
// timing measurement function
unsigned long long dtime_usec(unsigned long long prev){
  timeval tv1;
  return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
// convolution GPU kernel - not using shared memory
// Task 2
__global__ void conv_Kernel(const mytype * __restrict__ A, const mytype * __restrict__ B, mytype *C, const int N, const int P){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    int radius = (P-1)/2;
    if ((idx < (N-radius)) && (idx >= radius)){
      mytype my_sum = 0;
      for (int j = -radius; j <= radius; j++)
        my_sum += A[idx+j]*B[j+radius];
      C[idx] = my_sum;

__global__ void conv_shared_Kernel(const mytype * __restrict__ A, const mytype * __restrict__ B, mytype *C, const int N, const int P){
    __shared__ mytype sA[nTPB+FSIZE];
    __shared__ mytype sB[FSIZE];
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    int radius = (P-1)/2;
    int lidx = threadIdx.x + radius;
    if (threadIdx.x < P) sB[threadIdx.x] = B[threadIdx.x];
    if (idx < N){
      sA[lidx] = A[idx];
      if (threadIdx.x < radius) {
        if (idx >= radius)   sA[threadIdx.x] = A[idx - radius];
        if ((idx + nTPB)< N) sA[nTPB + lidx] = A[idx + nTPB];}}
    if ((idx < (N-radius)) && (idx >= radius)){
      mytype my_sum = 0;
      for (int j = -radius; j <= radius; j++)
        my_sum += sA[lidx+j] * sB[j+radius];
      C[idx] = my_sum;

int main(int argc, char *argv[]){

  mytype *d_A, *A, *d_B, *B, *d_C, *C, *h_C;
  int my_N = DSIZE;
  int my_P = FSIZE;
// allocate host data
  A   = (mytype *)malloc(my_N*sizeof(mytype));
  B   = (mytype *)malloc(my_P*sizeof(mytype));
  C   = (mytype *)malloc(my_N*sizeof(mytype));
  h_C = (mytype *)malloc(my_N*sizeof(mytype));
// allocate device data
  CUDA_CALL(cudaMalloc(&d_A, my_N*sizeof(mytype)));
  CUDA_CALL(cudaMalloc(&d_B, my_P*sizeof(mytype)));
  CUDA_CALL(cudaMalloc(&d_C, my_N*sizeof(mytype)));
//initialize host input data
  for (int i=0; i < my_N; i++)
    A[i] = rand()%RG;
  for (int i=0; i < my_P; i++)
    B[i] = 1;
//zero out host result data
  for (int i=0; i < my_N; i++){
    C[i]   = 0;
    h_C[i] = 0;}
//begin timing for host reference function
  unsigned long long cpu_time = dtime_usec(0);
  conv(A, B, C, my_N, my_P);
  cpu_time = dtime_usec(cpu_time);
//initialize device result data
  CUDA_CALL(cudaMemset(d_C, 0, my_N*sizeof(mytype)));
//begin timing for device function
  unsigned long long gpu_time = dtime_usec(0);
//copy host input data to device
  CUDA_CALL(cudaMemcpy(d_A, A, my_N*sizeof(mytype), cudaMemcpyHostToDevice));
  CUDA_CALL(cudaMemcpy(d_B, B, my_P*sizeof(mytype), cudaMemcpyHostToDevice));
//run convolution kernel on GPU
  conv_shared_Kernel<<<(my_N+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_C, my_N, my_P);
//copy results from device to host
  CUDA_CALL(cudaMemcpy(h_C, d_C, my_N*sizeof(mytype), cudaMemcpyDeviceToHost));
  gpu_time = dtime_usec(gpu_time);
//check validity of results
  for (int i = 0; i < my_N; i++) if (C[i] != h_C[i]) {printf("FAIL at %d, cpu: %f, gpu %f\n", i, C[i], h_C[i]); return 1;}
//print timing and speed comparison
  printf("PASS.  cpu time: %ldus, gpu time: %ldus\n", cpu_time, gpu_time);
  printf("Speedup: cpu/gpu = %f\n", cpu_time/(float)gpu_time);
//all host and device allocated data will be implicitly freed at program termination
  return 0;

The “large” signal corresponds to DSIZE.

The “small” signal (the filter) corresponds to FSIZE.

Remember, this is for educational purposes, not necessarily optimally tuned for performance or guaranteed to be defect-free. You can definitely do better.

Really great paper Scott and truly impressive results!
You are so far ahead of everyone else in this specific area, and you explain the topic and implementation very well.

I suggest (if you want) to submit this paper for collection on this site;


The link is broken, is it possible to access the class in some way? I would be really interested.


The externally available on-line classes from NVIDIA have been revamped, and organized under DLI:

Unfortunately this class is no longer available.

Ok, thank you anyway.