Slow Clap is too slow on edge cases #58
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: ghostchain/ghost-node#58
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Slow Clap Description
The validator list can change with each session, so the original concept was to divide each clap round based on session timeframes. In other words, once a transaction appears on the foreign chain, we establish the current session along with the corresponding list of validators responsible for claps. Each validator maintains its own specific
known EVM block
, which represents the latest block on the foreign chain and is solely recognized by that validator.As you can see from above,
EVM tx #1
should be validated insession T+1
byvalidators K2
only, because any other validator set could differ.Session Fold
The issue arose due to the overly strict timeframe measurements and the unpredictability of the
latest block
for each validator. This problem occurred several times during bridging transactions when a transaction was sent at the end of a session. An example is provided below:This issue can only occur if a bridging transaction is executed right before a session rotation, which lasts approximately 15 minutes, depending on the max range of
latest block
among validators, which is unknown to anyone. Specifically, the bridging transaction could receive claps in bothsession T+1
andsession T+2
simultaneously. For example:In both sessions, no consensus was reached, even though the transaction was fully covered by all validators. This highlights the need to address and manage such situations effectively.
Potential Solutions
Given that this issue occurs too frequently - especially with a larger set of validators - I believe we need to make the system "fail-proof" to ensure that cross-session consensus is always achieved. This will help prevent similar problems in the future.
We can rely on
block number
from the transaction receipt.Invlolved Code
From my perspective, there's no need to alter the off-chain behavior, as the primary issue lies with the authority in session checks. Therefore, I propose the following measures:
try_slow_clap
function only.Steps needed for the fix of
try_slow_clap
:T::AuthorityId
with help ofclap.session_index
andclap.authority_index
.authority_index
in previous session.ReceivedClaps
already contains proposed transaction metadata in previous session.previous_authority_index
andprevious_session
.Clap
structure.Potential challenges in backward compatibility
Transaction pool
Currently, we are removing all claps from the transaction pool where the
session_index
does not match the current one. Based on your proposal, we should implement a similar check. This would prevent issues if a transaction remains in the pool for too long and thesession_index
increments. However, the likelihood of this occurring is quite low, so it might be more efficient to maintain the current approach to avoid unnecessary computational overhead.Transaction validity
Currently, the constant is set to LOCK_BLOCK_EXPIRATION: u64 = 10;. We should assess whether extending this limit is necessary or if keeping the expiration at 10 blocks is sufficient for transactions to remain in the pool. This question is tightly coupled with previous one.
Slashing Logic
Based on your proposal, we will extend the session to T+1. This means that by the end of T+1, we can perform checks for session T, specifically examining session T-1 for slashing and disabling logic. This approach should be acceptable, as validators are required to wait a minimum of one era to withdraw funds, which corresponds to a single session delay.
Incorrect threshold
Look at this code. It appears that the current threshold does not take into account validators that have already been disabled. A more effective approach would be to utilize
ClapsInSession
. Here’s an example to illustrate this: