Profiling CUDA time and memory during LLM inference

I am making inferences on LLM using many prompts. I want to estimate the GPU time and memory consumption on each prompt. Please suggest a technique or tools I can use.

Thanks