Optimal Data Structure for Jagged, Static, and Sparse Data

Hi All,

I wanted to get some advice on the most efficient way to interact with my data on the GPU. The data is stored in a Python dictionary of Sorted Dict. My plan was to transfer/ transform the dictionary data into the following format to the device: Row i: [key_index_0: float_0 ,key_index_1: float_1 ,...,key_index_n: float_n], where key_index_0: float_0 are stored in a struct and all key_index are already in sorted order. Then I want to perform some calculation (for example the dot product) on all matching key values present in both A_i, and B_j. To do this calculation between A_0i, and B_j, I was planning on performing binary search to find the matching key values, then performing the dot product on those results, and finally return the maximum result values for (A_i. * B_j). Below is an example:

Let A = [ [key_0: float_0  ,key_1: float_1  ,...,key_3: float_3 ] ] 
Let B = [ [key_0: float_0  ,key_1: float_1  ,…,key_5: float_5 ],
                [key_0: float_0  ,key_1: float_1  ,…,key_10:float_10 ] ]
Dot_product_results = [value, value] = (A_i. * B_j) 

I’m currently considering the following three approaches:

  1. Manually refactor/flatten the jagged array into a flat (1D) array. In this approach, I believe I would need one array containing all categories and one (1D) arrays that stores the start of each individual sub-array. For example, the following array [(1:0.5), (2:3.43), (3:0.00032)], [(2:1.0), (6:1.5)] would be transformed as data = [(1:0.5), (2:3.43), (3:0.00032), (2:1.0), (6:1.5)] and index = [0,3]. I am hoping that there is a way to use this data structure without pointer chasing, use coalesced memory, and somehow take advantage of shared memory.
  2. The second approach would be to create a 2D matrix that is equal to the largest array of my data, and, then append a bunch of empty structs to ensure uniformity. In theory, the B matrix will be larger than the A matrix, and thus I will keep it in column-major order to ensure memory coalescing on the GPU. Which means, the A matrix would need to be transposed; meaning that the x-dimension would contain categories and the y-dimension contain key-values. In this scenario then, the calculation operation would be more like (B_i. * A_j). I think that this approach would be horrible for GPU memory as less than 0.05% of columns would be filled with non-empty structs. To fix that, I could use a sparse matrix representation to reduce the memory overhead for this approach.
  3. My final approach would be to utilize a more traditional approach of using (SpMV) or (SpMM) representation, such as ELLPACK (ELL). Basically, my data would no longer be stored inside a struct, rather the columns indexes value would be mapped to the key values, the row indexes would mapped the category index, and the data stored inside each cell would be the floating-point value. Again, to ensure that the memory access for B’s matrix was coalesced, it would be in column-major order. A’s sparse matrix would require that the x-dimension contains categories (rows from array) and the y-dimension would contain all keys present in my data. However, I believe it would be more computationally expensive to transform my host data into this format.

So, any suggestions, any alternative ideas, or any help be greatly appreciated.

I would lean toward choice 1 for most situations. It generally results in efficient presentation of data (i.e. how to achieve coalescing is fairly obvious), it works well if by chance you should use a library like thrust, and it doesn’t necessarily involve pointer-chasing per access, which is a desirable feature.

Choice 2 tends to be not much better in my experience, because you still have to keep track of the length per row. Indexing might be simpler, but strap on your big-boy pants and tackle whatever indexing challenges may come, is my view. Also choice 2 can result in significant “wasted” space depending on your length pattern.

Choice 3 really only makes sense if your data is truly “sparse” i.e. it exists at random x,y positions in a matrix and the fill ratio is like a few percent or less, and/or if you intend to use a sparse linear algebra library. At first glance, your data organization (jagged rows) doesn’t seem to need this.

1 Like

Thank you for getting back to me in such a timely manner. I truly appreciate your time. I was actually quite surprised by your reply as I figured most people were going to recommend approach two or three.

If possible, do you mind elaborating a little bit more on how coalescing would be achieved in approach one. I understand, fairly well, how to achieve coalescing in approach two and three; as well as, utilizing shared memory. However, I was under the impression that approach one would require me to do pointer chasing by doing the following val = data [index [j] + i]. My extremely, almost not thought out, idea for the first approach would be to do something as follows.

-device- void Calculation(CatVector_A * a, CatVector_B * b, float * o){
	Loop over (a-entries; i+=blockIdx){
		Value = DoSearch(entries, b)
		P += DotPr(a -entries, Value)}
	If(threadidx == 0)
	* o = P
}}

-device- void Main(Flatten_A * a, flatten_B * b, index_B * indB,  index_A * indA, float * r){
	- kernel that is responsible for breaking A and B into there correct vectorv
	- Then using dynamic parallelism to lunch Calculation(CatVector_A * a, CatVector_B * b, float * o)
	- kernel for parallel Max reduce on all results of Dot product 
	- return that maximum dot product result for A&B

I’ve only barely used thrust before, but my understanding is that it is a little slower than using straight CUDA kernels. Since the point of this endeavor is to get the optimal speedup factor, I have kind of shied away from utilizing thrust. Truthfully, I haven’t spent a lot of time researching how to do the first approach on thrust; so, if you have any resources that do something similar or if you can point me in the right thought direction, I would greatly appreciate it.

Again thank you for taking the time to share your knowledge with me.