Hi All,

I wanted to get some advice on the most efficient way to interact with my data on the GPU. The data is stored in a Python dictionary of Sorted Dict. My plan was to transfer/ transform the dictionary data into the following format to the device: `Row i: [key_index_0: float_0 ,key_index_1: float_1 ,...,key_index_n: float_n]`

, where `key_index_0: float_0`

are stored in a struct and all key_index are already in sorted order. Then I want to perform some calculation (for example the dot product) on all matching key values present in both A_i, and B_j. To do this calculation between A_0i, and B_j, I was planning on performing binary search to find the matching key values, then performing the dot product on those results, and finally return the maximum result values for (A_i. * B_j). Below is an example:

```
Let A = [ [key_0: float_0 ,key_1: float_1 ,...,key_3: float_3 ] ]
Let B = [ [key_0: float_0 ,key_1: float_1 ,â€¦,key_5: float_5 ],
[key_0: float_0 ,key_1: float_1 ,â€¦,key_10:float_10 ] ]
Dot_product_results = [value, value] = (A_i. * B_j)
```

Iâ€™m currently considering the following three approaches:

- Manually refactor/flatten the jagged array into a flat (1D) array. In this approach, I believe I would need one array containing all categories and one (1D) arrays that stores the start of each individual sub-array. For example, the following array
`[(1:0.5), (2:3.43), (3:0.00032)], [(2:1.0), (6:1.5)]`

would be transformed as`data = [(1:0.5), (2:3.43), (3:0.00032), (2:1.0), (6:1.5)]`

and`index = [0,3]`

. I am hoping that there is a way to use this data structure without pointer chasing, use coalesced memory, and somehow take advantage of shared memory. - The second approach would be to create a 2D matrix that is equal to the largest array of my data, and, then append a bunch of empty structs to ensure uniformity. In theory, the B matrix will be larger than the A matrix, and thus I will keep it in column-major order to ensure memory coalescing on the GPU. Which means, the A matrix would need to be transposed; meaning that the x-dimension would contain categories and the y-dimension contain key-values. In this scenario then, the calculation operation would be more like
`(B_i. * A_j)`

. I think that this approach would be horrible for GPU memory as less than 0.05% of columns would be filled with non-empty structs. To fix that, I could use a sparse matrix representation to reduce the memory overhead for this approach. - My final approach would be to utilize a more traditional approach of using (SpMV) or (SpMM) representation, such as ELLPACK (ELL). Basically, my data would no longer be stored inside a struct, rather the columns indexes value would be mapped to the key values, the row indexes would mapped the category index, and the data stored inside each cell would be the floating-point value. Again, to ensure that the memory access for Bâ€™s matrix was coalesced, it would be in column-major order. Aâ€™s sparse matrix would require that the x-dimension contains categories (rows from array) and the y-dimension would contain all keys present in my data. However, I believe it would be more computationally expensive to transform my host data into this format.

So, any suggestions, any alternative ideas, or any help be greatly appreciated.