Slow Clap is too slow on edge cases #58

Open
opened 2025-10-15 16:46:16 +02:00 by st1nky · 1 comment
Owner

Slow Clap Description

The validator list can change with each session, so the original concept was to divide each clap round based on session timeframes. In other words, once a transaction appears on the foreign chain, we establish the current session along with the corresponding list of validators responsible for claps. Each validator maintains its own specific known EVM block, which represents the latest block on the foreign chain and is solely recognized by that validator.

==>  session T+0  ==> session T+1   ==>  session T+2  ==> ... == session T+n
==> validators K1 ==> validators K2 ==> validators K3 ==> ... == validators Kn
==>   no evm txs  ==>   EVM tx 1    ==>  no evm txs   ==> ... == no evm txs 

As you can see from above, EVM tx #1 should be validated in session T+1 by validators K2 only, because any other validator set could differ.

Session Fold

The issue arose due to the overly strict timeframe measurements and the unpredictability of the latest block for each validator. This problem occurred several times during bridging transactions when a transaction was sent at the end of a session. An example is provided below:

==>  session T+0  ==> session T+1   ==>  session T+2  ==> ... == session T+n
==> validators K1 ==> validators K2 ==> validators K3 ==> ... == validators Kn
==>   no evm txs  ==>   EVM tx 1    ==>   EVM tx 1    ==> ... == no evm txs 

This issue can only occur if a bridging transaction is executed right before a session rotation, which lasts approximately 15 minutes, depending on the max range of latest block among validators, which is unknown to anyone. Specifically, the bridging transaction could receive claps in both session T+1 and session T+2 simultaneously. For example:

EVM tx 1 is sent at 11:45 AM

Validator K2 = [1, 2, 3, 4, 5]
Validator K3 = [1, 2, 3, 4, 5]

Validators 1, 2, 3 has delay 5 mins
Validators 4, 5 has delay 20 mins

In session T+1 claps for EVM tx 1  from 1, 2, 3.
Session T+1 changed to session T+2 at 00:00 PM 
In session T+2 claps for EVM tx 1  from 4, 5

Applause threshold = total validators * 2 / 3 + 1 = 5 * 2 / 3 + 1 = 4

In both sessions, no consensus was reached, even though the transaction was fully covered by all validators. This highlights the need to address and manage such situations effectively.

Potential Solutions

Given that this issue occurs too frequently - especially with a larger set of validators - I believe we need to make the system "fail-proof" to ensure that cross-session consensus is always achieved. This will help prevent similar problems in the future.

We can rely on

  • Validators in each session will have intersection with validators from previous session.
  • Unused block number from the transaction receipt.
  • Validator could not withdraw staking bond instantly, for now it's one era aka 24 hours.

Invlolved Code

  • Validation of transaction happens here.
  • During transaction execution check could be found here.
  • Offchain worker is fully dependant on session here.

From my perspective, there's no need to alter the off-chain behavior, as the primary issue lies with the authority in session checks. Therefore, I propose the following measures:

  • Do NOT change the offchain worker's logic.
  • Do NOT change the check of transaction validity.
  • Update the logic of try_slow_clap function only.

Steps needed for the fix of try_slow_clap :

  1. Get the T::AuthorityId with help of clap.session_index and clap.authority_index.
  2. Find the authority_index in previous session.
  3. Check if ReceivedClaps already contains proposed transaction metadata in previous session.
  4. If yes, return previous_authority_index and previous_session.
  5. Otherwise, stick with current values from Clap structure.
# Slow Clap Description The validator list can change with each session, so the original concept was to divide each clap round based on session timeframes. In other words, once a transaction appears on the foreign chain, we establish the current session along with the corresponding list of validators responsible for claps. Each validator maintains its own specific `known EVM block`, which represents the latest block on the foreign chain and is solely recognized by that validator. ```bash ==> session T+0 ==> session T+1 ==> session T+2 ==> ... == session T+n ==> validators K1 ==> validators K2 ==> validators K3 ==> ... == validators Kn ==> no evm txs ==> EVM tx 1 ==> no evm txs ==> ... == no evm txs ``` As you can see from above, `EVM tx #1` should be validated in `session T+1` by `validators K2` only, because any other validator set could differ. # Session Fold The issue arose due to the overly strict timeframe measurements and the unpredictability of the `latest block` for each validator. This problem occurred several times during bridging transactions when a transaction was sent at the end of a session. An example is provided below: ```bash ==> session T+0 ==> session T+1 ==> session T+2 ==> ... == session T+n ==> validators K1 ==> validators K2 ==> validators K3 ==> ... == validators Kn ==> no evm txs ==> EVM tx 1 ==> EVM tx 1 ==> ... == no evm txs ``` This issue can only occur if a bridging transaction is executed right before a session rotation, which lasts approximately 15 minutes, depending on the max range of `latest block` among validators, which is unknown to anyone. Specifically, the bridging transaction could receive claps in both `session T+1` and `session T+2` simultaneously. For example: ```bash EVM tx 1 is sent at 11:45 AM Validator K2 = [1, 2, 3, 4, 5] Validator K3 = [1, 2, 3, 4, 5] Validators 1, 2, 3 has delay 5 mins Validators 4, 5 has delay 20 mins In session T+1 claps for EVM tx 1 from 1, 2, 3. Session T+1 changed to session T+2 at 00:00 PM In session T+2 claps for EVM tx 1 from 4, 5 Applause threshold = total validators * 2 / 3 + 1 = 5 * 2 / 3 + 1 = 4 ``` In both sessions, no consensus was reached, even though the transaction was fully covered by all validators. This highlights the need to address and manage such situations effectively. # Potential Solutions Given that this issue occurs too frequently - especially with a larger set of validators - I believe we need to make the system "fail-proof" to ensure that cross-session consensus is always achieved. This will help prevent similar problems in the future. ## We can rely on * Validators in each session will have intersection with validators from previous session. * Unused `block number` from the transaction receipt. * Validator could not withdraw staking bond instantly, for now it's one era aka 24 hours. ## Invlolved Code * Validation of transaction happens [here](https://git.ghostchain.io/ghostchain/ghost-node/src/commit/fae0fa4d7be5aeb0c05ac54c3cd7acb024ad324c/pallets/slow-clap/src/lib.rs#L382). * During transaction execution check could be found [here](https://git.ghostchain.io/ghostchain/ghost-node/src/commit/fae0fa4d7be5aeb0c05ac54c3cd7acb024ad324c/pallets/slow-clap/src/lib.rs#L464). * Offchain worker is fully dependant on session [here](https://git.ghostchain.io/ghostchain/ghost-node/src/commit/fae0fa4d7be5aeb0c05ac54c3cd7acb024ad324c/pallets/slow-clap/src/lib.rs#L614). From my perspective, there's no need to alter the off-chain behavior, as the primary issue lies with the authority in session checks. Therefore, I propose the following measures: * Do NOT change the offchain worker's logic. * Do NOT change the check of transaction validity. * Update the logic of `try_slow_clap` function only. Steps needed for the fix of `try_slow_clap` : 0. Get the `T::AuthorityId` with help of `clap.session_index` and `clap.authority_index`. 1. Find the `authority_index` in previous session. 2. Check if `ReceivedClaps` already contains proposed transaction metadata in previous session. 3. If yes, return `previous_authority_index` and `previous_session`. 4. Otherwise, stick with current values from `Clap` structure.
Owner

Potential challenges in backward compatibility

Transaction pool

Currently, we are removing all claps from the transaction pool where the session_index does not match the current one. Based on your proposal, we should implement a similar check. This would prevent issues if a transaction remains in the pool for too long and the session_index increments. However, the likelihood of this occurring is quite low, so it might be more efficient to maintain the current approach to avoid unnecessary computational overhead.

Transaction validity

Currently, the constant is set to LOCK_BLOCK_EXPIRATION: u64 = 10;. We should assess whether extending this limit is necessary or if keeping the expiration at 10 blocks is sufficient for transactions to remain in the pool. This question is tightly coupled with previous one.

Slashing Logic

Based on your proposal, we will extend the session to T+1. This means that by the end of T+1, we can perform checks for session T, specifically examining session T-1 for slashing and disabling logic. This approach should be acceptable, as validators are required to wait a minimum of one era to withdraw funds, which corresponds to a single session delay.

Incorrect threshold

Look at this code. It appears that the current threshold does not take into account validators that have already been disabled. A more effective approach would be to utilize ClapsInSession. Here’s an example to illustrate this:

10 validators
4 disabled validators

Applause threshold = 10 * 2 / 3 + 1 = 7 (incorrect)
Corrected applause threshold = (10 - 4) * 2 / 3 + 1 = 5
## Potential challenges in backward compatibility ### Transaction pool Currently, we are removing all claps from the transaction pool where the `session_index` does not match the current one. Based on your proposal, we should implement a similar check. This would prevent issues if a transaction remains in the pool for too long and the `session_index` increments. However, the likelihood of this occurring is quite low, so it might be more efficient to maintain the current approach to avoid unnecessary computational overhead. ### Transaction validity Currently, the constant is set to LOCK_BLOCK_EXPIRATION: u64 = 10;. We should assess whether extending this limit is necessary or if keeping the expiration at 10 blocks is sufficient for transactions to remain in the pool. This question is tightly coupled with previous one. ### Slashing Logic Based on your proposal, we will extend the session to T+1. This means that by the end of T+1, we can perform checks for session T, specifically examining session T-1 for slashing and disabling logic. This approach should be acceptable, as validators are required to wait a minimum of one era to withdraw funds, which corresponds to a single session delay. ### Incorrect threshold [Look at this code](https://git.ghostchain.io/ghostchain/ghost-node/src/commit/fae0fa4d7be5aeb0c05ac54c3cd7acb024ad324c/pallets/slow-clap/src/lib.rs#L519). It appears that the current threshold does not take into account validators that have already been disabled. A more effective approach would be to utilize `ClapsInSession`. Here’s an example to illustrate this: ``` 10 validators 4 disabled validators Applause threshold = 10 * 2 / 3 + 1 = 7 (incorrect) Corrected applause threshold = (10 - 4) * 2 / 3 + 1 = 5 ```
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: ghostchain/ghost-node#58
No description provided.