Hello
Im building a k-means application where one of the steps are to compare a set of vectors (the data) to another set of vectors (the centers). My first implementation uses float vectors of length 16 and just reads them to the shared memory (each block reads 32 floats (2 vectors) from the data and from the centers).
The reads should coalesce, but wont the reads from the centers have problems as all the blocks are trying to access the same memory? If the 16xfloat center vectors are denotet C1, C2, C3… and the blocks are denoted B1, B2, B3 the access the centers are like:
(Vectors are in global memory)
B1: C1C2 C3C4 C5C6…
B2: C1C2 C3C4 C5C6…
B3: C1C2 C3C4 C5C6…
…
…
B128… (or higher)
Wont there be a problem when all blocks want to access C1C2 at the same time, or does the hardware account for this? Im mostly thinking about if B2 must wait for B1 to finish, this might take a while for all blocks to have done a read. Are there any better way of doing this than coalesce into shared mem?
Kind Regards
Brian