traffic_replay: Improve assign_groups() performance with large domains
When assigning 10,000 users to 15 groups each (on average),
assign_groups() would take over 30 seconds. This did not include any DB
operations whatsoever. This patch improves things, so that it takes less
than a second in the same situation.
The problem was the code was looping ~23 million times where the
'random.random() < probability * 10000' condition was not met. The
problem is individual group/user probabilities get lower as the number
of groups/users increases. And so with large numbers of users, most of
the time the calculated probability was very small and didn't meet the
threshold.
This patch changes it so we can select a user/group in one go, avoiding
the need to loop multiple times.
Basically we distribute the users (or groups) between 0.0 and 1.0, so
that each user has their own 'slice', and this slice is proporational to
their weighted probability. random.random() generates a value between
0.0 and 1.0, so we can use this to pick a 'slice' (or rather, we use
this as an index into the list, using .bisect()). Users/groups with
larger probabilities end up with larger slices, so are more likely to
get picked.
The end result is roughly the same distribution as before, although the
first 10 or so user/groups seem to get picked more frequently, so the
weighted-probability calculations may need tweaking some more.
Signed-off-by: Tim Beale <timbeale@catalyst.net.nz>
Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>