Could you please recommend me sufficient HW?

Hello, I have been working with CUDA for half a year. My aim is, to optimize relatively easy algorithm which is searching for regions in the grayscale image (8bpp). These regions I am searching for contain pixels with brightness level higher than some threshold (some value). This is all I need.
This algorithm is used for real-time detection of defects in fabric (textile), while the fabric is produced. Basically, I receive 200 000 frames per second (i’ve got 200kHz kamera) from industrial cameras on the input. Each frame has resolution 11kpix what makes 2.2GB/s of data (in form of images). Every image has to be processed by the algorithm mentioned above.

Please could you let me know, which NVIDIA GPUs would you recommend me for this purpouse? It’s kind of easy data processing but extremely time limited? Thank you very much for your opinions.

Have a nice day.

I can never tell when people say Gb/s do they mean gigabit per second or gigabyte per second. I prefer GB for gigabyte and Gb for gigabit, but not sure everyone sees it that way.

However 200 frames per second at 11kpix per frame does not get you anywhere close to even 2.2 gigabit per second data rate. (unless you have ~1000 bits per pixel)

Perhaps you meant 11Mpix per frame?

Please clarify all your terms. Show the math by which you arrived at the 2.2Gb/s data rate. The math would make sense if it were 11Mpix per frame and 2.2Gigabyte per second, assuming 1 byte per pixel.

Sorry, my fault. I ment 2,2GB/s and the camera produce 200 000fps. If one frame has resolution 11kpix, the amount of data at the input is equal to 2.2GB/s
Thank you very much for your attention.

I am wondering whether a GPU is suitable for your application at all. Is there any sort of buffering available? PCIe uses packetized transport, and its throughput strongly varies with packet size. When you send 200,000 tiny frames individually, you may not achieve a 2.2 GB/s throughput to the GPU. How much data is sent back from the GPU to the CPU in your envisioned scheme?

Buffering and batching frames would be the way to achieve good PCIe throughput, but this will add latency to the overall image processing pipeline. What latency requirements exist for this use case? Presumably the cameras are watching textiles fly by in order to adjust or even stop machinery in real time.

From your description of the image processing I can’t tell what the computational complexity is and how much computational processing power is required. 11,000 pixels does not allow a sufficient number of threads to run on a high-end GPU even if each thread handles just one pixel, so again you would want to batch several frames for processing.

I think it would be best to invest into an exploratory prototype based on a relatively cheap consumer card with a PCIe gen3 x16 interface (e.g. a GTX 1060) and evaluate whether moving forward with a GPU-based solution is the right path. Do you have an existing CPU-based processing pipeline that you can use as a baseline?

Your assumption is good. Algorithm scans the textile and looks for some defects (could be fly). OK, I will try to answer your questions.

  1. Is there any sort of buffering available?
    I don’t exactly know what you meant with buffering here. The camera used is a industrial, line-scan 200kHz camera (with resolution 11kpix). It sends scanned lines to grabber card through CoaXPress serial interface. So, yes buffering is available. Actual algorithm works with images with resolution only 8192(width)x500(height). This image should be sent to CUDA capable GPU. Theoretically, I can adjust the height of resolution. For example, acquired image can have resolution 8192x1000 as well.

  2. How much data is sent back from the GPU to the CPU in your envisioned scheme?
    After processing of one image, the CPU receives only array of structures. These structures contains position x,y of a pixel that value is higher than some threshold. It depends on how many defects are present on actual image. The structure declaration could look like:

typedef struct{
   uint64_t x;
   uint64_t y;
   uint8_t pixel_value;
} PositionInfo;
  1. Do you have an existing CPU-based processing pipeline that you can use as a baseline?
    Well, actual algorhitm is very complex containing a lot of computations which are not intersting for us. But basic structure can look like:
//basics of CPU algorhitm
for(int y = 0; y < 8192 ; y++) //rows
{
   for(int x = 0; x < 500 ; x++) //cols
   {
       if((image[x][y] < LOW_LIMIT) || (image[x][y] > HIGH_LIMIT))
       {
         //...initialize a new structure <b>PositionInfo</b> in the array
       }
   }
}

Actually, they are of interest, if the goal is to size hardware appropriate for the amount of computation required and the predominant memory access patterns. But I understand that you may not be able or authorized to shared the details of the processing here.

If I understand correctly, there is a capability to aggregate a number of smaller images into a larger one before sending the data to the GPU for processing. That should help with the potential PCIe bottleneck and insufficient exposed parallelism in the GPU processing of the images that I pointed out.

If you have a solid grasp on the characteristics of this problem, you could try applying a roofline performance model. However, analytical performance models are limited in their predictive power, and my advice would still be to build a prototype with a cheap consumer GPU first, before diving head first into a GPU solution using professional grade hardware (Tesla). The devil is often in the details and a prototype will help you flush out the “unknown unkowns”.

You would probably want to target a Linux platform for this work, to provide yourself with the maximum flexibility in structuring your software.

If you intend to send 200,000 of those images per second to the GPU, at 1 byte per pixel, you’re in trouble. That’s an ingest rate of over 800GB/s.

Furthemore, regardless of image size, this output structure has the potential to blow up the data transfer on the other end:

typedef struct{
   uint64_t x;
   uint64_t y;
   uint8_t pixel_value;
} PositionInfo;

That’s 17 bytes per exceptional pixel. if the rate of exceptional pixels is higher than 1/17, the output rate will exceed the input rate. Why would you need uint64_t for x or y? A ushort might suffice, dropping you down from 17 bytes to 5 bytes.

Here is the way I understood the use case (which may well be wrong): the 11K pixel images are narrow strip-like frames coming off a sensor that are then composed into a larger 8192x500 image inside the grabber hardware. In other words, each such image is composed of about 400 of the original “frames”. The total data rate at which data is sent to the GPU is still 11 KB/frame * 200K frames/sec = 2.2 GB/sec.

This would be in analogy to some scanners that use a sensor comprising just a single line of pixels that record an image by moving the sensor array in small steps while stitching together the picture from the individual scanned lines. The difference here would be that in a textile machine, the sensor is stationary but the item to be scanned is streaming by at high speed (many ft/s), requiring a high-speed camera to get the desired spatial resolution.

My question about the return data was driven by concern of generating a faithful performance profile on a consumer GPU with a single DMA engine that can then be used to reliably estimate performance on a high-end Tesla. I understood OP’s reply to indicate that the amount of extracted data returned by the GPU kernel is significantly smaller than the amount of raw data pumped to the GPU. Certainly “compressing” the output along the lines of your suggestion seems like a good idea.

Almost exactly like njuffa said. 11kpix frame is basically long strip/vector of data coming off the camera. The camera has a line-rate 200kHz (200 000 strips per second). These captured strips are gradually stored in grabber card memory. However, before a frame is stored in memory, there is some preprocessing done by camera (similar to averaging), which takes the long strip of data (11 000pix strip) and shorten it to 8192pix. It means that only 8192pix long strip is saved to the grabber card. When there are 500 of these strips saved in the memory of grabber card, Whole composed image with resolution 8192x500 is sent to CPU. And this is the location of algorhitm mentioned above, where GPU optimlaization should apply.

Here is a link i found in youtube, for better understanding how similar cameras work.
https://www.youtube.com/watch?v=ktSqmv6xF3A

txbob, you are right, it would be more economical solution if the structure lookes like:

typedef struct{
   uint16_t x;
   uint16_t y;
   uint8_t pixel_value;
} PositionInfo;

But the fact is, that amount of data coming off the GPU is a lot smaller than data coming on GPU input.
So your opinion is that high-end tesla GPU could be sufficient for this kind of image processing?

Please, if you had a time, could you shortly respond to my question in the comment above? Thank you very much.

Have you actually specified the real frame rate?

I don’t see it. I’m not going to read papers and work through some math. I’m not going to guess at it, or make assumptions.

You could have said something like “I am generating frames of 8192x500, where each pixel is 1 byte, that I need to deliver to the GPU at a rate of 26.235 frames per second” right at the beginning of this thread. So far, you still haven’t done that.

You also haven’t specified a rate (or a maximum upper bound on the rate) at which pixel data will be returned by the algorithm.

First of all, sorry for the inaccurate description in my pervious comments.

Yes, I am generating frames with resolution 8192x500 (8bpp). I need to deliver to GPU more than 333frames per second (333 is a limit).
After processing of a single frame by GPU I want to send back to the CPU an array of ‘PositionInfo’ structures:

typedef struct{
   uint16_t x;
   uint16_t y;
} PositionInfo; //4 bytes of memory

This array will contain all coordinates of pixels that value is higher or lower than some threshold (basically, coordinates of pixels which represent a defect on the textile). The size of this array may differ per frame (some frames don’t have to contain any defect).
The rate at which the pixel data will return back to CPU should be also more than 333 returns/second
Tnak you for your patience.

The budged for HW is set to $500 per GPU

I wonder how it was determined that a GPU costing $500 would be sufficient for the task at hand, without an exploratory prototype. You might also want to consider that consumer-grade GPUs are not designed for a 24/7 duty cycle in a relatively harsh industrial environment.

I think you are on a high-risk path here which may well lead to failure. It is not realistic to expect specific hardware recommendations from strangers on the internet who know a lot less about your use case than you do.

Well, it’s not strictly set to $500. It’s an approximate limit price per card (for experimental purposes). This is my final thesis assignment and so far I am doing some research about available HW and opinions from the people who properly understand to this topic are very, very usefull.

100,000,000 is more than 333

right?

So if you need to deliver 100,000,000 of those frames per second to the GPU, it won’t work.

Yes, you are right. 100,000,000 of those frames whould be a pain to transfer. But as I mentioned in parenthesis, 333 frames per sec is A LIMIT. It means, when I transfer more, e.g. 334 or 340 or 400 fps i would be really satisfied.

so 333 frames/s is ~1.3GB/s

With respect to the data flow to the GPU, you should be able to do that with any recent/modern desktop CUDA GPU in a proper setup (x16, Gen2 or Gen3).