High latency is a prevalent performance problem in streaming systems, and it can have multiple underlying causes. Excessive latency not only affects the freshness of data but also poses a risk to the overall stability of the system, potentially leading to out-of-memory (OOM) errors.
http://meta-node:5691
. If the barrier is stuck, the Await Tree Dump will reveal the barrier waiting for a specific operation to finish. This fragment is likely to be the bottleneck of the streaming job.A join B on A.nation = B.nation
, any operation on a row in A or B will be amplified by the join to potentially thousands or millions of rows, depending on how many rows share that nation. This could lead to extremely high latency.high_join_amplification
with the problematic join keys will be printed, such as
product_id = 1
is a hot-selling product, an update from stream product_description
with product_id=1
can match 100K rows from orders
.
We can split the MV into multiple MVs:
orders.user_id % 7 = 0
and perform the join.orders.user_id % 7 = 1
.orders.user_id % 7 = 6
. In total, we will have 7 MVs.user_id
cannot be directly applied with the modulo operation, it needs to be processed by a hash function first. In cases where the hash function can return a negative value, it’s worth noting that in both PostgreSQL and RisingWave, -1 % 7 = -1
instead of 6
. Therefore, we need to include additional MVs where hash_function(orders.user_id) % 7 = -1/-2/-3/-4/-5/-6
.
Finally, we can union the results from all 7 MVs. If user_id
is uniformly distributed given product_id = 1
, the join amplification in each actor is reduced by a factor of 7.
Please note that it depends on finding a user_id
column that is highly cardinality and as uniformly distributed as possible.