GPU resource calculator

Is there a calculation tool that can help us determine the exact “cuda core” resource we need when we do real-time work in the customer location environment?

example;
Mistral 7b LLM model, There will be a maximum of 20 simultaneous transactions and I will process a total of 3500 characters per second.

Is there a resource calculation tool for resource calculation and selecting the right resource for the better functioning of the solution we have developed?

Hi there,

There is no such tool to my knowledge since it usually is the other way round. You adjust your app to the capabilities you have available while CUDA and other tools automatically make the most out of that resource.

There are tools like CUPTI or NSIGHT that help you analyse performance of your app and find ways to optimise it though.

But I happily move your question over to the CUDA experts, they might have more tips for you.

Thanks!

Hi,

Determining the GPU we will use and project costs are an important detail. Expensive hardware that we do not need increases our project cost, or when we choose a lower capacity GPU hardware, we may experience performance problems in the project.

We ask questions and get answers to our Python app Mistral 7b LLM model in different sessions at the same time. Everything runs very fast in the development environment anyway. But performing multiple operations simultaneously makes the job more complicated. The way the model is used and the methods are very important for performance, we know this.

In order to offer a project solution to our customer, we need to be able to estimate capacity and limits in advance.

For example, we need to approximately know the response time from the model for a total of 3500 letters transmitted from 20 different sessions simultaneously. When this number is 43750 letters transmitted from 250 different sessions during busy hours, I need to know the waiting time and response time.

For example, if there are a total of 3500 letters for 20 simultaneous requests, I need to respond to my customer as if all requests would be answered in an average of 3.9 seconds.

Thanks.

You need a queuing model of your process. I do not have the necessary background, but someone who has worked for a cloud provider or communication network provider may have. You would want someone with a background in statistics and operations research.

You will also need to perform some relevant benchmarking using your intended deployment platform. Performance of real life applications is a function of many more variables than just “CUDA core” count.

Yes, there needs to be a queue structure, I have developed this, all task requests are processed sequentially.

Our need is to be able to calculate that I need the right GPU hardware to get the best results in Production. Benchmarks report of the GPU and alternative GPU hardware on the hardware on which the software runs. As a result, it is important for us to know in advance how much efficiency will be achieved when we purchase higher GPU hardware with an increase in cost.

You could benchmark your software on cloud systems like amazon AWS for little money to figure out which gpus are suitable for you

That is not what I am talking about. Generally, requests to a shared service will come in randomly following some statistical distribution. Think supermarket checkout counters as an example, or calls coming into a call center. While the average rate of arrival may be roughly known and / or predictable by time of day or day of the week, the instantaneous rate can exceed this by factors and this can drive up the response time of your system quickly, possibly beyond the QoS limits your application is supposed to maintain.

This kind of process can be modeled mathematically, and that is the queuing model I was referring to. Data from system benchmarks and traffic measurements would feed into model parameters, and the model can then let you know what quality of service your system can support under what conditions, allowing you to make informed decisions. The one time I had to deal with this in the course of my work it turned out that trying to quickly read up on this topic was infeasible, so I turned to a colleague with a background in statistical models who whipped up a simple analytical model in R in a few hours.