i get obsure results from this and even after threads synchronisation timing varies a lot , i cant launch more than a 5000x5000 matrix in this
i have a 9800 GX2 , and the cubin file always show 10 register usage . dunno whats the problem
can someone answer this to me plzz :
1) how is FSB limitation overcome in cuda , as intel cpu's bottleneck on multicore performance over 2 cores , as in consecutive memory access from all the threads.
2) Maximum sizes of each dimension of a block: 512 x 512 x 64 , Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 . i wanted to know how are these limitations fixed
dimensions of block basically means number of threads in block , but thats limited to 512
and grid dimensions would mean number of blocks .
You should have a look at the matrix multiplication example in the CUDA programming guide. As was already pointed out, your kernel has issues.
clock() is an extremely imprecise time measurement. Your kernel call is probably completing in ~milliseconds or less. You need a higher resolution timer. I.e. gettimeofday on linux or queryPerformanceCounter on windoze: these are wrapped in a cross platform way by CUT which comes with the CUDA SDK. If you want extremely high precision timing of just the kernel launch, use the timers in the CUDA event API (read the programming guide).
What do you mean? There is no FSB on the GPU. The GTX 280, for instance, has a 512-bit memory bus tied directly from the RAM to the memory manager on the GPU. It is capable of feeding the GPU with ~140 GiB/s of bandwidth (read or write).
Because the hardware scheduler was designed that way.
Note that you can’t always run 512 threads in a block depending on the number of registers used in the kernel.
Also, you said that you can’t launch a matrix larger than 5000x5000…if this is due to memory restrictions, you could write a kernel that does an “in-place” multiplication (overwrites one of the input matrices with the output to save space). You should get a decent increase in the maximum possible matrix size from doing that.
if u see my for loop , for (int k = 0; k < (int)(size/blockDim.x*blockDim.y); ++k)
where size is the row/col of a square matrix , blockDim.x*blockDim.y is the total threads , so this gives number of blocks ,
i’m trying to iterate this loop over to all the existing blocks , or maybe i must’ve taken gridDim , correct me plz if i’m wrong
yup thanks for that , i’ll definitely consider that advice , but i’ve already posted 3-4 different threads and no reply yet , so i thought i could as well ask again ;-)
1)hey the matrix multiplication in manual involves so much of optimisation and all , so i thought to write a simple cuda enabled MM , then optimise it
hey i tried using the CUT timer functions from cutil , but i got “cutil.h problrms” they were in SDK directory , i even gave its path , there was linking problem , 'm not root here so cant configure paths , i’ll try and make use of the cudaEvent routine , thanks
2)I used FSB coz i did not find a term for the GPU architecture , hey i thought the quick nature of GPU multicore thing was the presence of 16 banks in shared memory which allowed a half wrap to process and a full wrap in 4 clock cycles , and global memory had a latency of 400-600 clock cycles , correct me if i’m wrong , i could’nt intrepret what does these clock cycles meant ,
and GPU had 2 frequency , CORE and SHADER , which once to take into account for FLOPS calculations
3)hardware scheduler ? yeah i agree that for 512 thread limitation per block , can u tell the limitation on NUMBER OF BLOCKS AND GRIDS
yup i agree if i have to make lots of threads i need to compromise on shared memory and registers per thread , which being 16kb and 8192
hey i have 1 GB per card , what possible meory restrictions i could have so i allocate 3 float arrays on GPU , that’ll be 350005000*sizeof(float) = 300 MB i still have lots of space
$ ./a.out 5000
cuda error : Kernel Invocation failed : invalid configuration argument
That could be a problem in host code where you call your kernel. The part between <<< and >>> is the configuration.
You use the shader clock to calculate flops.
The maximum amount of blocks in a grid is 65535 x 65535 (or maybe the last digits are 6). Each kernel runs on a single grid. I don’t know if there’s a limit of how man grids and/or kernels your code can have
This doesn’t change the answer. The hardware scheduler is still responsible for launching all blocks. Think about it, on on MP with 8 blocks running flat out, one of them suddenly completes. The GPU isn’t going to go back to the host or anything dumb like that to get another block to run: they are all queued up in the hardware scheduler so that as soon as a block completes it is replaced by another.
With the GX2, each GPU gets half of the memory, so each one can only access 512MB. I would think you would be able to get closer to that limit, but from what I’ve read, there’s a decent amount of memory that you can’t ever access due to other processes using the GPU. Try testing some smaller numbers and see what the cutoff is between where it works and where it doesn’t.
its the memory stroed for mode switch , thats around 7 -1 0 MB just the bits on the screen depending on the resolution , so leaving this i still have 500 MB on a card and 512 on the other
No, from what I’ve read on the forums, you still won’t be able to access the full 500MB. There are other things internal to the video/CUDA driver that use memory as well, reducing the amount of memory that you are able to access. There is a method used by some developers on this forum where they “brute-force” some memory allocation, that is, trying to allocate a huge block, then a slightly smaller one, and so one until the allocation is successful. If you try this as an experiment, you will be able to determine the actual amount of memory available to you for your specific setup.