Faster method of shifting texture data

Hi all, I am currently trying to develop a GUI to render data coming into my computer that has a Quadro K4000M gpu. Right now, I am creating a 2D waterfall plot using CUDA OpenGL interop. Essentially, I have CUDA access a texture from OpenGL (width: 1967, height: 16384), shift all the pixels to the left, and put the new color values at the end of the rows. RIght now this is working. However, the data comes in at a 100Hz rate (every 10ms) and the function that performs the shifting takes ~20ms to complete by itself. I’ve tried a couple of methods but have been unsuccessful so far and the fastest method I can get is in the code below using the surf2Dread and surf2Dwrite.

__global__ void ShiftData(cudaSurfaceObject_t out)
{	
	int idx = threadIdx.x + blockIdx.x * blockDim.x;
	int stride = blockDim.x * gridDim.x;

	uchar4 color;

	unsigned int numRowElements = 1967 * 4;
	for(unsigned int i = idx; i < 16384; i += stride)
	{
		for(unsigned int j = 0; j < numRowElements - 4; j += 4)
		{
			surf2Dread(&color, out, j + 4, i);
			surf2Dwrite(color, out, j,     i);
		}
        }
}

Does anybody know of a way to speed up this block of code?

What is your thread configuration?
Is width always 1967 ?
Can you use a separate cudaSurfaceObject for output?
Can you provide a complete minimal benchmark code than can be run by others?

Are you limited by memory throughput? Doing code profiling could answer this question.

Ever tried partially unrolling the j = 0 ; j < numRowElements -4 ; j += 4 loop?

Is it really necessary to shift the pixels physically in the texture? As an alternative, consider displaying the texture in a way that makes it move and wrap around - for example using a GLSL pixel shader or by changing texture coordinates for rendering (with GL_REPEAT wraparound mode) . So whatever data is written to the texture stays at the same memory location.

Here is a minimum set of code. As a set of requirements:

CUDA 8
GLFW-3.3.8
GLAD

I Ran this code off VS-2012.

As for your other questions, right now I’m testing with a thread configuration of 16 blocks and 1024 threads.

The width could change. If the texture height becomes smaller, the width becomes larger, and vice versa, but I’m testing with 1967 for the time being.

main.cpp

#include <stdio.h>
#include <string>

#include "kernel.cuh"

int main()
{
	// Setup a window since opengl needs one
	glfwInit();
    glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
    glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 3);
    glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);

	GLFWwindow* window = glfwCreateWindow(800, 600, "LearnOpenGL", NULL, NULL);
	if (window == NULL)
	{
	    glfwTerminate();
	    return -1;
	}
	glfwMakeContextCurrent(window);

	if (!gladLoadGLLoader((GLADloadproc)glfwGetProcAddress))
	{

	    return -1;
	}
	
	cudaGraphicsResource_t cudaTexture;
	cudaChannelFormatDesc  channelDesc = cudaCreateChannelDesc(8, 8, 8, 8, cudaChannelFormatKindUnsigned);
	cudaSurfaceObject_t    outputSurface;
	cudaResourceDesc       surfRes;
	cudaArray_t            cuArray;
	GLuint                 texture;
	
	// Setup the texture
	glGenTextures(1, &texture);
	glBindTexture(GL_TEXTURE_2D, texture);
	
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_LINEAR);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_LINEAR);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
	
	glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 1967, 16384, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
	glBindTexture(GL_TEXTURE_2D, 0);

	// Register and map the texture to cuda
	cudaGraphicsGLRegisterImage(&cudaTexture, texture, GL_TEXTURE_2D, cudaGraphicsRegisterFlagsSurfaceLoadStore);
	cudaGraphicsMapResources(1, &cudaTexture);
	cudaGraphicsSubResourceGetMappedArray(&cuArray, cudaTexture, 0, 0);
	
	// Setup the surface object
	memset(&surfRes, 0, sizeof(cudaResourceDesc));
	surfRes.resType = cudaResourceTypeArray;
	surfRes.res.array.array = cuArray;
	cudaCreateSurfaceObject(&outputSurface, &surfRes);

	// Run the kernel
	RunShift(&outputSurface);

	// Cleanup
	cudaGraphicsUnmapResources(1, &cudaTexture);
	cudaDestroySurfaceObject(outputSurface);

	glDeleteTextures(1, &texture);

	return 0;
}

kernel.cuh

#pragma once

#include <glad/glad.h>
#include <GLFW/glfw3.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cuda_gl_interop.h>

__global__ void ShiftData(cudaSurfaceObject_t out);

void RunShift(cudaSurfaceObject_t * out);

kernel.cu

#include "kernel.cuh"

__global__ void ShiftData(cudaSurfaceObject_t out)
{	
	int idx = threadIdx.x + blockIdx.x * blockDim.x;
	int stride = blockDim.x * gridDim.x;

	uchar4 color;

	unsigned int numRowElements = 1967 * 4;
	for(unsigned int i = idx; i < 16384; i += stride)
	{
		for(unsigned int j = 0; j < numRowElements - 4; j += 4)
		{
			surf2Dread(&color, out, j + 4, i);
			surf2Dwrite(color, out, j,     i);
		}
	}
}

void RunShift(cudaSurfaceObject_t * out)
{
	ShiftData<<<16, 1024>>>(*out);
	cudaDeviceSynchronize();
}

My profiler is measuring the following throughput metrics:

Read Throughput: 9.736 GB/s
Write Throughput: 26.522 GB/s

Could there be any reason why the read is much slower than write? Could it be that it is because I am offsetting the read by 4 and thus kind of striding it?

I’ll try playing around with using uint4 for the reads and writes and get back to you on this.

Can you point me to an example or some minimal code that uses the GL_REPEAT flag?

The theoretical memory bandwidth of your device is 89.6 GB/s. So you’re using maybe 40% of it.

Your code example unexpectedly passes GL_LINEAR for GL_TEXTURE_WRAP_S/T

Valid parameters are:
GL_CLAMP_TO_EDGE, GL_CLAMP_TO_BORDER, GL_MIRRORED_REPEAT, GL_REPEAT, or GL_MIRROR_CLAMP_TO_EDGE.

the different modes are explained here.

https://www.flipcode.com/archives/Advanced_OpenGL_Texture_Mapping.shtml

I have yet to understand where one can pass texture coordinates in your example code and how the rendering is done

I found a more complete GLFW example that specifies vertices and texture coordinates to render

https://learnopengl.com/code_viewer.php?code=getting-started/textures

By incrementing the texture S (x) coordinates by (1/width of texture) after rendering a frame you could make the waterfall plot scroll horizontally from one frame to the next, assuming GL_REPEAT is used for the horizontal texture coordinate. Be sure to update the vertex buffer array properly.

Then you only need to write new data to whatever pixel position that is currently rendered to the very edge of the window. The scrolling effect will no longer consume precious memory bandwidth.

Christian

Apologies, my actual code is rather extensive and I didn’t want to blow up the thread, so i thought to provide a minimal example of the shifting. I’ll try and attach the important bits here.

Essentially, when I read a new frame, I pass it to my cuda kernels via a memcpy:

cudaMemcpyAsync(devInput, hostInput, numElements * sizeof(float), cudaMemcpyHostToDevice)

where numElements is 16384. From here, I do some math in some other kernels and get the output in the form of a magnitude (dB):

float * passthrough;
cudaMalloc(&passthrough, 16384 * sizeof(float))
CalculateMagnitudeDb(&passthrough);

where passthrough gets the output from a separate CUDA kernel (not related to the shifting). Here, I pass in the passthrough as the input the kernel and the surface object that has the mapped cuArray attached:

ShiftDataAndConvertToColor<<<16, 1024>>>(outputSurface, passthrough);

In ShiftDataAndConvertToColor, I first shift all the data to the left and use the input to calculate a new set of colors and append them to the end of the texture:

__global__ void ShiftDataAndConvertToColor(cudaSurfaceObject_t out, float * in)
{	
	int idx = threadIdx.x + blockIdx.x * blockDim.x;
	int stride = blockDim.x * gridDim.x;

	float val;
	GLubyte r;
	GLubyte g;
	GLubyte b;
	uchar4 color;

	unsigned int numRowElements = 1967 * 4;
	for(unsigned int j = 0; j < numRowElements - 4; j += 4)
	{
		surf2Dread(&color, out, j + 4, idx);
		surf2Dwrite(color, out, j,     idx);
	}

	val = in[idx];

	if(val < -110.0f)
	{
		r = 0;
		g = 0;
		b = 64 + (191 * (220.0f + val) / 220.0f);
	}
	else if(val < -97.5f)
	{
		r = 0;
		g = 255 * (110.0f + val) / 12.5f;
		b = 255;
	}
	else if(val < -85.0f)
	{
		r = 0;
		g = 255;
		b = 255 - (255 * (85.0f + val) / 12.5f);
	}
	else if(val < -72.5f)
	{
		r = 255 * (72.5 + val) / 12.5f;
		g = 255;
		b = 0;
	}
	else if(val < -67.5f)
	{
		r = 255;
		g = 255 - (255 * (67.5f + val) / 5.0f);
		b = 0;
	}
	else
	{
		r = 255;
		g = 0;
		b = 0;
	}

	surf2Dwrite(make_uchar4(r, g, b, 255), out, numRowElements - 4, idx);}

Once this kernel completes, I unmap everything and blit the texture to the screen via a framebuffer object:

glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
		glFramebufferTexture2D(GL_READ_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, texture, 0);
		glBindFramebuffer(GL_DRAW_FRAMEBUFFER, 0); 
		glBlitFramebuffer(0, 0, 1967, 16384, width * 0.125f, (height * 0.125f) + 1.0f, (width * 0.875f) - 1.0f, height * 0.875f, GL_COLOR_BUFFER_BIT, GL_LINEAR);

I repeat these steps with any new data I receive.

If cuda surface follows the same rules as ordinary global memory, your access pattern causes non-coalesced memory accesses.

In that case I would recommend using one thread block per row, each thread processing x elements, for example 512 threads with 4 uchar4. Then, load up to 2048 pixels into registers, snyc the block, then write the pixels back shifted by 1. The snyc would not be required if input surface and output surface were separate.

if you rather like to stay with the glBlitFramebuffer API for display, try modifying the coordinate arguments passed to this API call to create a horizontal scrolling effect. I am not sure if you can harness the power of GL_REPEAT here. If not, you may have to blit your buffer in two segments to emulate a texture wraparound.

https://registry.khronos.org/OpenGL-Refpages/gl4/html/glBlitFramebuffer.xhtml

Treat your texture like a ring buffer

                   1967 wide
 <--------------------------------------->

that you display at different offsets in each frame
 ----------------------------------><-----
                                         ^ new data written here
1 Like

Here is a small benchmark with ordinary global memory. The inplace version with 512*4 elements per block (kernel 2) is around ten times faster on an A100 than using 1 thread per row. You could try to adapt this approach with cudaSurface.

shiftRowsKernel 19.2051 ms. 65.0868GB/s
shiftRowsKernel2 1.8135 ms. 689.273GB/s
shiftRowsKernel3 1.84934 ms. 675.915GB/s
#include <iostream>
#include <cassert>

__global__
void shiftRowsKernel(uchar4* data, int width, int height){
  int idx = threadIdx.x + blockIdx.x * blockDim.x;
	int stride = blockDim.x * gridDim.x;

	for(unsigned int i = idx; i < height; i += stride)
	{
		for(unsigned int j = 0; j < width - 1; j += 1)
		{
			uchar4 color = data[i * width + j + 1];
      data[i * width + j] = color;
		}

    data[i * width + width - 1] = make_uchar4(0,0,0,0);
  }
}

__global__
void shiftRowsKernel2(uchar4* data, int width, int height){

	for(unsigned int i = blockIdx.x; i < height; i += gridDim.x)
	{
    uchar4 color[4]; //assuming blocksize 512, each block can hold 512*4 = 2048 pixels

    for(int k = 0; k < 4; k++){
      const int x = k * blockDim.x + threadIdx.x;
      if(0 < x && x < width){
        color[k] = data[i * width + x];
      }
    }
    __syncthreads();
    for(int k = 0; k < 4; k++){
      const int x = k * blockDim.x + threadIdx.x;
      if(0 < x && x < width){
        data[i * width + x-1] = color[k];
      }
    }
    if(threadIdx.x == 0){
      data[i * width + width-1] = make_uchar4(0,0,0,0);
    }
  }
}

__global__
void shiftRowsKernel3(uchar4* __restrict__ outdata, const uchar4* __restrict__ indata, int width, int height){
	for(unsigned int i = blockIdx.x; i < height; i += gridDim.x)
	{
    for(int x = threadIdx.x; x < 2048; x += 512){
      if(x < width - 1){
        outdata[i * width + x] = indata[i * width + x + 1];
      }else if(x == width - 1){
        outdata[i * width + x] = make_uchar4(0,0,0,0);
      }
    }
  }
}

__global__
void check(uchar4* data, int width, int height, int numIterations){
	for(unsigned int i = blockIdx.x; i < height; i += gridDim.x)
	{
    for(int x = threadIdx.x; x < width - numIterations; x += blockDim.x){
      uchar4 pixel = data[i * width + x];
      assert(pixel.x == 42);
      assert(pixel.y == 42);
      assert(pixel.z == 42);
      assert(pixel.w == 42);
    }

    for(int x = width - numIterations + threadIdx.x; x < width; x += blockDim.x){
      uchar4 pixel = data[i * width + x];
      //printf("%d, %d %d %d %d\n", i * width + x, int(pixel.x), int(pixel.y), int(pixel.z), int(pixel.w));
      assert(pixel.x == 0);
      assert(pixel.y == 0);
      assert(pixel.z == 0);
      assert(pixel.w == 0);
    }
  }
}

int main(){
  const int width = 2048;
  const int height = 16*1024;
  const int numIterations = 10;

  uchar4* d_data; cudaMalloc(&d_data, sizeof(uchar4) * width * height);
  uchar4* d_data2; cudaMalloc(&d_data2, sizeof(uchar4) * width * height);
  cudaMemset(d_data, (char)42, sizeof(uchar4) * width * height);

  cudaEvent_t start; cudaEventCreate(&start);
  cudaEvent_t stop; cudaEventCreate(&stop);

  check<<<height, 512>>>(d_data, width, height, 0);
  cudaDeviceSynchronize();

  cudaEventRecord(start);
  for(int iter = 0; iter < numIterations; iter++){
    //std::cout << "iter " << iter << "\n";
    shiftRowsKernel<<<((height) + 512 - 1) / 512, 512>>>(d_data, width, height);
    // check<<<height, 512>>>(d_data, width, height, iter+1);
    // cudaDeviceSynchronize();
  }
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);

  check<<<height, 512>>>(d_data, width, height, numIterations);
  cudaDeviceSynchronize();

  float timingMS; cudaEventElapsedTime(&timingMS, start, stop);
  double numGB = (sizeof(uchar4) * width * height) * numIterations / 1024. / 1024. / 1024.;
  double GBperSecond = numGB / (timingMS / 1000);
  std::cout << "shiftRowsKernel " << timingMS << " ms. " << GBperSecond << "GB/s\n";



  cudaMemset(d_data, (char)42, sizeof(uchar4) * width * height);
  cudaEventRecord(start);
  for(int iter = 0; iter < numIterations; iter++){
    //std::cout << "iter " << iter << "\n";
    shiftRowsKernel2<<<height, 512>>>(d_data, width, height);
    //check<<<height, 512>>>(d_data, width, height, iter+1);
    //cudaDeviceSynchronize();
  }
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);

  check<<<height, 512>>>(d_data, width, height, numIterations);
  cudaDeviceSynchronize();

  float timingMS2; cudaEventElapsedTime(&timingMS2, start, stop);
  double numGB2 = (sizeof(uchar4) * width * height) * numIterations / 1024. / 1024. / 1024.;
  double GBperSecond2 = numGB2 / (timingMS2 / 1000);
  std::cout << "shiftRowsKernel2 " << timingMS2 << " ms. " << GBperSecond2 << "GB/s\n";


  cudaMemset(d_data, (char)42, sizeof(uchar4) * width * height);
  cudaEventRecord(start);
  for(int iter = 0; iter < numIterations; iter++){
    //std::cout << "iter " << iter << "\n";
    const uchar4* input = (iter % 2 == 0) ? d_data : d_data2;
    uchar4* output = (iter % 2 == 0) ? d_data2 : d_data;
    shiftRowsKernel3<<<height, 512>>>(output, input, width, height);
    // check<<<height, 512>>>(output, width, height, iter+1);
    // cudaDeviceSynchronize();
  }
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);

  uchar4* output = ((numIterations-1) % 2 == 0) ? d_data2 : d_data;
  check<<<height, 512>>>(output, width, height, numIterations);
  cudaDeviceSynchronize();

  float timingMS3; cudaEventElapsedTime(&timingMS3, start, stop);
  double numGB3 = (sizeof(uchar4) * width * height) * numIterations / 1024. / 1024. / 1024.;
  double GBperSecond3 = numGB3 / (timingMS3 / 1000);
  std::cout << "shiftRowsKernel3 " << timingMS3 << " ms. " << GBperSecond3 << "GB/s\n";
}
1 Like

So after some thorough testing and a bit of a coffee break, here are some of the results I came to:

After running your same code on my environment, I got the following results:

shiftRowsKernel 193.316 ms. 0.646609GB/s
shiftRowsKernel2 6.05754 ms. 20.6355GB/s
shiftRowsKernel3 5.54835 ms. 22.5292GB/s

I then modified the kernel I was running to match your shiftRowsKernel2. For reference, here is the code:

__global__ void ShiftData2(cudaSurfaceObject_t out)
{	
	for(unsigned int i = blockIdx.x; i < 16384; i += gridDim.x)
	{
		uchar4 color[4]; //assuming blocksize 512, each block can hold 512*4 = 2048 pixels

		for(int k = 0; k < 4; k++)
		{
			const int x = k * blockDim.x + threadIdx.x;
			if(0 < x && x < 1967)
			{
				surf2Dread(&color[k], out, x * 4, i);
				//color[k] = data[i * width + x];
			}
		}
		
		__syncthreads();
		
		for(int k = 0; k < 4; k++)
		{
			const int x = k * blockDim.x + threadIdx.x;
			if(0 < x && x < 1967)
			{
				surf2Dwrite(color[k], out, x * 4 - 4, i);
				//data[i * width + x-1] = color[k];
			}
		}
		if(threadIdx.x == 0)
		{
			surf2Dwrite(make_uchar4(255,0,0,255), out, 1967 * 4 - 4, i);
			//data[i * width + width-1] = make_uchar4(0,0,0,0);
		}
    }
}

These are the results I got:
shiftData2 25.0511 ms. 4.79245GB/s

At this point, I decided to test another computer I have access to that has an RTX 2060 card in it. Running these same tests on that computer, I was getting the following results:

shiftRowsKernel 13.8898 ms. 8.99941GB/s
shiftRowsKernel2 1.28515 ms. 97.2648GB/s
shiftRowsKernel3 2.34147 ms. 53.3852GB/s
shiftData2 1.21869 ms. 98.5126GB/s

With these results, my best guess is that the quadro can’t handle textures very well. I might continue doing my development on the other computer as it seems like the quadro can’t handle what I want to do at reasonable speeds.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.