I have a program that needs to suck in a day’s worth of data, totaling about 300-350MB worth of text files of some information. When setting up the proper arrays and copying them to the GPU, this takes up about 540MB of memory in the GPU. Everything I am doing is having to read from global memory, but I am trying to start thinking about speeding up my kernel. The algorithm is very complex and it will be tough to parallelize it beyond what it is, but I am more wondering about storage and speed.
I think someone once said I should be looking into CUDA arrays and texture memory? As of now, the algorithm and all threads pull from global memory. LOTS of gmem accesses, trust me. What can I do about this?
I’m going to find out from a coworker this week a bit more detail about how the algorithm works and picks elements to analyze for its computation. This isn’t a simple matrix multiplication kind of problem–this is a thousand times more complex. I’m starting to think about faster memory access. Some people have suggested using Texture Memory, but what I read tells me this is best for graphics processing. Also, isn’t there a limit on how large texture memory is on the GPU? I’m trying to think about how to speed things up from the memory/storage point-of-view first.
By the way, I didn’t see your original post, so you don’t have to worry about that.
The way you should read that is “Texture memory is good for non-sequential memory access patterns with spatial locality in a 1D, 2D, or 3D array.” So unless you are processing these text files into some kind of multidimensional array and then operating on it, texture memory probably won’t help.
Exactly. I have four 2D matrices that get copied to the GPU and some computation is done using all 4 of those matrices. A 5th matrix is allocated on the GPU just to hold the results of the algorithm on it. Then it gets copied back to the host. I’m not sure yet because I haven’t counted these things just yet, but I’m sure a full day’s worth of data will make at least thousands of GMEM accesses during the lifetime of the algorithm. I do plan on using shared memory to help me out once I talk to one of our developers about the nature of the algorithm (trust me, it’s complicated in nature).
So, using textures and CUDA arrays is ideal for my situation? What about limitations on size? What are some really good, clear examples that I can look at to guide me through? Sometimes the programming guide isn’t correct or clear enough.
Possibly. You haven’t said anything about the array access pattern yet. For totally random reads, texture caching is no help. If each thread is reading an element and its N spatially adjacent neighbors (like in a convolution kernel), then the texture cache will help. If you are reading long continuous runs of elements (for example matrix multiplication) then you generally get best performance managing that yourself in global memory with the help of shared memory as needed.
Appendix A.1.1 in the programming guide says that for textures bound to CUDA arrays (which you have to use to get the spatial caching benefit) the maximum dimensions are:
1D = 2^13 elements
2D = 2^16 by 2^15
3D = 2^11 by 2^11 by 2^11
Beyond those indexing limits, your only limitation is of course the actual memory capacity of the card.
Those are in bytes, correct? EACH of my 2D matrices will hold at least 50MB worth of data (they are full of doubles and __int64 values).
Anyway, as far as the array access pattern, it works kind of like this:
I have 5 2D arrays that are the exact same size. They are 2D jagged arrays with dimx number of rows, with the length of each row differing from one to the next, but each of the 5 arrays are the exact same shape with different data. There are 2 2D arrays that get returned, but they are not jagged like the previous 5. They are dimx X dimx in dimensions… square matrices. The algorithm works by one row of the returned matrices at a time.
So, let’s call the 2 matrices “c” and “p”. We start by calculating c and p. To do this, we start reading various values of each row of the 5 2D jagged arrays, making some computations, then changing c and p. Then we do c and p. Then, after that row is done, we do c and p. I haven’t talked to the developer yet, so I can’t yet explain the process of how what values are chosen and explain it right now. I’m working on it.
Seems to me like I’m working with data too large for texture binding to CUDA arrays. Sounds like I should just manage global memory myself using shared memory to help speed things up.
No, those are index limits. The actual texture elements can be 1, 2 or 4 component values of 8/16/32-bit integer or float type. No direct support for 64-bit types, which means if textures were beneficial, you would need to kludge it with an int2 texture.
Your arrays probably fit within the texture limits, but if you are reading across whole rows, I don’t think the texture cache will help you.
I am starting to adjust the algorithm to incorporate better usage of shared memory. When applying what I have to 1/10 of a day’s worth of data, generating two 211 x 211 square matrices as output, I managed to go from 22,237,465 global memory accesses to 13,935,240 instead. I am getting the correct output, too. However, the speed of my program is 3 seconds slower (~ 2.5% slower) by adding these shared memory accesses. First thought was that I wasn’t allowing for coalescing reads to shared memory.
I suppose there are bank conflicts. I have it so that for double and __int64 values (both are 8 bytes), smem-smem are used to hold a value from gmem that the current thread needs. Then smem-smem are used to hold a different value across all 211 threads, etc. By doing this, I suppose smem-smem are all trying to access from the same bank. I got those figures because for the Tesla T10, there are 16 banks for 16kb shared memory… So, that is 1kb per bank, aka 1024 bytes. So, for 8-byte values, this means smem-smem are bank 0, smem-smem are bank 1, etc.?
I can’t help but feel that I’m really missing something here.
Is that 22,237,465 global memory accesses per thread?
Or across the entire grid? Assuming 8-byte accesses, that is only 0.16 GB read/written. A fast GPU should be able to deliver that in under 2 milliseconds (assuming reasonable coalescing to give you ~100 GB/s).
That is the total number of GMEM accesses by all threads on a certain day’s worth of data. It doesn’t do that all those GMEM accesses at once. We end up going into at least 3 layers of loops to do all this. There is a great amount of doubles being multiplied and subtracted in the algorithm.
If I try to use more SMEM than what is available, the compiler and run-time does not complain. Does it spill over into GMEM when this happens? Also, how am I to think of how much SMEM is available? The max amount (for the Tesla T10) of blocks that can be scheduled simultaneously on a single multiprocessor is 8. Each multiprocessor is allowed 16KB worth of shared memory. So, if the maximum of 8 blocks are scheduled simultaneously, then each block, essentially, is allowed 2KB worth of SMEM storage?
EDIT: Well, actually the compiler does complain whenever you try to allocate an amount of SMEM larger than what is available. It (sneakily) lets you think you built a new solution, but it actually uses the last successful build at that point. I was wondering why nothing was changing in my program anymore.
I don’t know where I can really speed up my program. It makes hundreds of millions, possibly billions of mathematical calculations, and I am not sure where I can do any more optimization. I’m still confused on GMEM/SMEM coalescing.
Yes - the banks are interleaved. For int (4-byte) values, bank 0 consists of smem, smem, … while bank 1 holds smem, smem, …
This particularly means that 8-byte values can never be accessed without (two-way) bank conflicts. If it’s possible, it would be beneficial to decompose the 8-byte values into two 4-byte values and reorder data while transferring it to shared memory. OTOH 2-way bank conflicts are not too bad and might not be worth working around.
For double values the 2-way bank conflict is not worth bothering. Any arithmetic operation on them is 8x slower as operations on float (unless you are on Fermi, but even there they are slower). Although I’m not aware of this being documented, I assume the bank conflict is fully hidden behind the slower arithmetic. At worst, the penalty is marginal.