Need help with using a card to it's max

timmilliken · October 1, 2014, 1:50pm

I need some help with using a card to 100%. I have a large data set that I have to run same kernel on. I had planned to setup the data structure in the host then hand the buffer off to the gpu for processing. How to set the launch params to have all cores doing something at all times? The kernel function is not huge but also not small, will I need to put the kernel launch in a for loop to prevent a TDR? I am not too new to CUDA, but I am also not a pro as you can tell. The launch params need to be caculated at runtime so that it will be portable. What is the best way to form the memory block that I will copy to device memory so that access to it from each kernel is streamlined?

thanks for any help,
Tim

MutantJohn · October 1, 2014, 3:43pm

Try nvvp or nvprof?

timmilliken · October 1, 2014, 3:49pm

I need code help, I can’t profile code that is not wrote. I have read 2 books, but neither go into detail about how threads, blocks, warps, and grids work together. Also, nothing has told how to get deviceProps and translate the returned info into the block size, thread size, ect…

Tim

MutantJohn · October 1, 2014, 4:47pm

Oh, okay. Well, I’m sure there’s a bit of information out there if you google it. There’s many presentations/powerpoint slides of the info you’re looking for.

And instead of using deviceProps, there are occupancy calculators now which, again, you should research on your own first. This new API makes the task of choosing block size simpler.

Edit : But I think you should learn enough CUDA to write something compile-able and then you should profile it to identify where the bottlenecks actually are. Learning about CUDA theoretically is good but so is learning it in the wild and you shouldn’t be scared of writing something slow.

timmilliken · October 1, 2014, 5:14pm

This stuff can’t be pre-computed. When you change from one machine to another, it might not have same abilities as you coded for. So for my problem, it has to be solved in code. I have written several utils that use CUDA, I am comfy with writing cuda code. I can compile and run cuda apps. I just asked how to solve the amount of block, threads, warps, ect with code.

-Tim

MutantJohn · October 1, 2014, 5:32pm

Did you not see me talk about the occupancy calculator?

timmilliken · October 1, 2014, 7:16pm

Yes, I seen that and I dl’d it. This is a great tool if the hardware does not change, unless I am missing something. What I am asking is how to make code use max resources for any cuda enabled card that it might be used with. From a GTX560 to the Titian to a Tesla? Am I missing something in this tool?

MutantJohn · October 1, 2014, 7:51pm

Edit : Oh, I see now. You can try to write code based off different shared memory configurations and you can detect grid sizes and stuff like that using deviceProps() or w/e it’s called.

Sorry, took me awhile to get what you were really asking.

Edit edit : Are you sure you’re not able to use deviceProps to extract all the info you need? It gives you darn near everything.

timmilliken · October 1, 2014, 8:59pm

I am sure that the deviceProps does, I am just not sure how to use it for what I want to accomplish. :) Thanks for being patient during the misunderstanding.

-Tim

Robert_Crovella · October 1, 2014, 9:47pm

CUDA 6.5 includes a new occupancy API for making occupancy calculations at run-time. May be of interest:

[url]CUDA Runtime API :: CUDA Toolkit Documentation

[url]CUDA Samples :: CUDA Toolkit Documentation

MutantJohn · October 1, 2014, 9:47pm

That’s what I was trying to find!

timmilliken · October 2, 2014, 2:14pm

Correct me if I am wrong, but this helps determine if your code is using all the resources of the gpu. I want to calculate how many threads, blocks, and grids I can use so that I use the entire gpu.

MutantJohn · October 2, 2014, 3:25pm

Literally literally the Simple Occupancy example shows you how to launch kernels with an estimated ideal block size.

Here’s my output, as an example :

starting Simple Occupancy

[ Manual configuration with 32 threads per block ]
Potential occupancy: 50%
Elapsed time: 0.210944ms

[ Automatic, occupancy-based configuration ]
Suggested block size: 1024
Minimum grid size for maximum occupancy: 10
Potential occupancy: 100%
Elapsed time: 0.137696ms

Test PASSED

Robert_Crovella · October 2, 2014, 4:15pm

more info:

[url]http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/[/url]

timmilliken · October 2, 2014, 4:50pm

Thanks to all that responded, this article explained what was going one, rather than just assume what was going on.

Tim

Topic		Replies	Views
CUDA Pro Tip: Occupancy API Simplifies Launch Configuration Technical Blog	12	863	February 21, 2017
Find the right set of blocks, threads CUDA Programming and Performance	2	480	September 13, 2018
Occupancy/ Optimazation How to use Occupancy Calculator, improve performance CUDA Programming and Performance	12	17038	December 7, 2011
False information from occupancy calculator? CUDA Programming and Performance	1	717	February 2, 2018
maximum threads per block CUDA Programming and Performance	2	986	October 29, 2014
Setting block size and avoiding errors CUDA Programming and Performance	7	6394	November 15, 2008
CUDA Occupancy Calculator Helps pick optimal thread block size CUDA Programming and Performance	76	313128	September 13, 2011
How to determine the Block Size CUDA Programming and Performance	1	6020	September 4, 2009
What data to enter into Occupancy calculator? CUDA Programming and Performance	4	715	March 29, 2023
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	6018	July 25, 2007

Need help with using a card to it's max

Related topics