Parallelized cluster finding in trd::Hitfind across rows.
Enabled parallelization across rows of the cluster finding step in trd::Hitfind. The result is a nice scaling of cluster building with the number of cores, not exactly linear but in the ballpark.
When comparing the output of this code with the original one on a high-multiplicity timeslice of run 2457 I noticed a slight difference in the number of produced hits: 7595248 instead of 7595161, so an increase of around 0,001 percent.
I traced this back to the 2D clusterizer. With the new version, with the same number of input digis, the number of produced 2D clusters can differ. I believe to have isolated the origin of this deviation: In the original code, digis are sorted by time for each module. In the new code, digis are first bucket-sorted by row and then by time. One would expect this to not matter, but there are very rare cases where two digis have the same time stamp but different T- and R-values, and these can in principle end up in the stream in reverse order. I confirmed on a few examples that this indeed happens. From the logic of the algorithm, I'm not completely sure how this can affect the number of produced clusters, but it is not entirely implausible that it does. Maybe @a.bercuci can comment.
Furthermore: When looking at the hitfinders once more, I realized that the first stage of hitfinding, i.e. the assignment of physical properties to the clusters, is completely row-parallel, both for 1D and 2D. It is only the row-merge step, which currently only exists for 2D, which is not. This means that the parallelization done here can be pushed even further. It is conceivable that even the row-merge step can be parallelized, by using even-odd partitioning. All-in-all this suggests that we can probably speed up this code by at least another order of magnitude or so.
I also have some bad news: I think the hit-merging as currently implemented in HitFinder2D doesn't do what is intended, at least at present and also in the original class CbmTrdModuleRec2D
. The problem is that the nested loops inside the HitFinder2D::operator()
rely on time-order to restrict their ranges. This is a valid optimization in principle, but the output of Clusterizer2D is not actually time-sorted. It is sorted by row-index, and then by time for each row.
What this effectively does, is make it such that hit-merging is (mostly) only applied within rows, but not across rows. What can be done to fix this, is add another time-sort step at the module level. I have included this here as a commented-out line. Enabling this produces a sizeable performance hit however, and rather strongly affects the number of produced hits, making a backwards-comparison impossible. @a.bercuci we should address these issues at some point.
I should add, that the 1D hitfinder doesn't care whether the input clusters are time-sorted. Only the clusterizer needs time-sorted digis. Switching to the row-parallel version hence had no effect.
@fweig As a side remark: I am currently missing a "add_bytes" command for the hitfinder timer. This is because determining the input-size of the row-ordered cluster-array is a bit convoluted. Should I just use the digi input size?
I'm not sure whether we should merge this before the next beamtime. We have a version that works and is fast enough (or is it?) so I don't want to add further complications. On the other hand, if we have some sort of test-system which confirms the correctness of the entire chain (which we probably need anyhow) I don't see a strong reason not to.