Appendix G of the CUDA 3.1 C programming guide says
Constant memory size
64KB
My largest kernel uses 55878 bytes of constant memory
in one large and two small (32 int each) arrays.
It runs very slowly.
I can make peformance worse by moving where the arrays are decleared.
I am using a mixture of short int and unsigned int.
I am not sure of the significance of the 8KB cache.
But am beginning to suspect that (despite lots of effort
with shared memory) the kernel is held up by random access
to off-chip memory for “constant” data as the 8KB cache is overwelmed.
On the other hand perhaps the 295 GTX does not like short int constant
As always any help, comments or hints would be most welcome
Bill
Appendix G of the CUDA 3.1 C programming guide says
Constant memory size
64KB
My largest kernel uses 55878 bytes of constant memory
in one large and two small (32 int each) arrays.
It runs very slowly.
I can make peformance worse by moving where the arrays are decleared.
I am using a mixture of short int and unsigned int.
I am not sure of the significance of the 8KB cache.
But am beginning to suspect that (despite lots of effort
with shared memory) the kernel is held up by random access
to off-chip memory for “constant” data as the 8KB cache is overwelmed.
On the other hand perhaps the 295 GTX does not like short int constant
As always any help, comments or hints would be most welcome
Bill
tera
November 3, 2010, 5:57pm
3
Do all threads of the warp access the same array elements? As far as I remember, constant cache accesses to different elements get serialized.
tera
November 3, 2010, 5:57pm
4
Do all threads of the warp access the same array elements? As far as I remember, constant cache accesses to different elements get serialized.
Dear tera,
Whilst there is some structure, each thread tends to read at random from the array.
My initial assumption was that this would be good enough since I had struggled to fit all
the data into 64KB. Now my plan is to split the kernel in two and force each half to limit
itself to reading < 8KB.
How big a deal is serialised access? I am hoping that using less than the cache size will
ensure no off-chip reads and that will be a big enough win. Any thoughts on how to check this?
Once again many thanks
Bill
Dear tera,
Whilst there is some structure, each thread tends to read at random from the array.
My initial assumption was that this would be good enough since I had struggled to fit all
the data into 64KB. Now my plan is to split the kernel in two and force each half to limit
itself to reading < 8KB.
How big a deal is serialised access? I am hoping that using less than the cache size will
ensure no off-chip reads and that will be a big enough win. Any thoughts on how to check this?
Once again many thanks
Bill
My guess is that using textured reads (using e.g. cudaArrays) will give better performance.
My guess is that using textured reads (using e.g. cudaArrays) will give better performance.
tera
November 4, 2010, 4:05pm
9
I agree.
Whilst there is some structure, each thread tends to read at random from the array.My initial assumption was that this would be good enough since I had struggled to fit allthe data into 64KB. Now my plan is to split the kernel in two and force each half to limititself to reading < 8KB. How big a deal is serialised access? I am hoping that using less than the cache size willensure no off-chip reads and that will be a big enough win. Any thoughts on how to check this?Once again many thanksBill
If you can split your kernel operation on ~64kb of data into two operating on ~8kb each, that would indeed be a perfect solution (is the constant data kind of separable, or why is that possible?)
Cached serialized access should indeed still be faster than uncached access to the same address.
tera
November 4, 2010, 4:05pm
10
I agree.
Whilst there is some structure, each thread tends to read at random from the array.My initial assumption was that this would be good enough since I had struggled to fit allthe data into 64KB. Now my plan is to split the kernel in two and force each half to limititself to reading < 8KB. How big a deal is serialised access? I am hoping that using less than the cache size willensure no off-chip reads and that will be a big enough win. Any thoughts on how to check this?Once again many thanksBill
If you can split your kernel operation on ~64kb of data into two operating on ~8kb each, that would indeed be a perfect solution (is the constant data kind of separable, or why is that possible?)
Cached serialized access should indeed still be faster than uncached access to the same address.
I agree.
If you can split your kernel operation on ~64kb of data into two operating on ~8kb each, that would indeed be a perfect solution (is the constant data kind of separable, or why is that possible?)
Cached serialized access should indeed still be faster than uncached access to the same address.
Dear tera and cbuchner1,
Many thanks for your suggestions. I am indeed trying to restructure the code
so that it works on each column (row) one at a time. The hope is that this will
limit the volume of data read by each multi-processor and so fit inside the 8Kbyte limit.
Is there an easy way to see how effective each constant cache is?
Eg hit or miss rates?
Many thanks
Bill
I agree.
If you can split your kernel operation on ~64kb of data into two operating on ~8kb each, that would indeed be a perfect solution (is the constant data kind of separable, or why is that possible?)
Cached serialized access should indeed still be faster than uncached access to the same address.
Dear tera and cbuchner1,
Many thanks for your suggestions. I am indeed trying to restructure the code
so that it works on each column (row) one at a time. The hope is that this will
limit the volume of data read by each multi-processor and so fit inside the 8Kbyte limit.
Is there an easy way to see how effective each constant cache is?
Eg hit or miss rates?
Many thanks
Bill
Still working on this:-(
However people may be interested in a recent article in which they probe the 280GTX’s caches
in huge detail.
Bill
Wong, H.; Papadopoulou, M.-M.; Sadooghi-Alvandi, M.; Moshovos, A.; , “Demystifying GPU microarchitecture through microbenchmarking,” Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on , vol., no., pp.235-246, 28-30 March 2010
Graphics processors (GPU) offer the promise of more than an order of magnitude speedup over conventional processors for certain non-graphics computations. Because the GPU is often presented as a C-like abstraction (e.g., Nvidia's CUDA), little is...
Sarnath
February 15, 2011, 4:48am
14
Dear Dr.Langdon,
Have you evaluated Textures for your requirement?
Warp Serialization can bring down performance dearly especially for FLOP intensive apps which can’t hide this latency.
Best Regards,
Sarnath