In our previous blog post and most recent EOS Hot Sauce, we took a look behind the scenes at the collaborative work that’s been happening in the last few weeks between Block.one and the key block producers of the EOS Mainnet. Today we want to take a more technical look at the details of the architectural changes recently deployed by block producers.
Troubleshooting
In mid-January, the EOS Mainnet started experiencing a much larger volume of incoming transactions, which led to a subsequent surge in microforks. The first hypothesis was that the high number of transactions was overwhelming the signing nodes (aka producing nodes).
Key block producers first attempted to increase the CPU speed on signing nodes in order to compensate for the large transaction volume, which helped, but didn’t fix the problem. The signing nodes were now producing bigger blocks but still overloaded while trying to process incoming transactions. Furthermore, blocks were too slow to propagate to the next block producer and microforks remained too high.
After significant investigation, Block.one and key block producers identified the source of the problem: the high number of transactions hitting the signing node from multiple connections meant the node could not work efficiently. The solution required a reduction in the number of connections on the signing node. Ultimately, that meant having only 2 connections.
This is a significant change as block producers might have had multiple connections to public P2P cluster, multiple connections to API nodes, and other various connections. All of those needed to be replaced by only 2 connections.
First Connection: Blocks Only
Block producers organized to create a ‘blocks only’ P2P network to quickly deliver blocks between each other. Allowing block producers to ignore transactions means a reduced processing load – and when blocks arrive on time, microforks don’t happen.
So a newly produced block has to go from producer 1 -> block peer 1 -> block peer 2 -> producer 2.
For this to work properly, a few changes were needed in EOSIO from Block.one:
- Ability to establish bi-directional blocks-only connections (added in EOSIO 2.0.2 only)
- Ability to limit the CPU used to create blocks in the signing node (added in EOSIO 1.8.11 and 2.0.2).
As these changes were made available, block producers quickly deployed them and assigned a CPU-effort value of 50%. This dramatically reduced the number of dropped blocks.
This architecture is working very well at the time of publication, but a few more adjustments are needed to reduce the remaining dropped blocks further. To guide future changes, Block.one wrote a detailed explanation of how to configure settings to assure blocks arrive on time.
Second Connection: Transactions
In this new node architecture, block producers also had to create a transactions network. Note: the transactions network also includes blocks.
All incoming transactions, no matter where they are coming from, must consolidate into a single “transaction peer” node that connects to the signing node. This node must handle as many transactions as possible, but also must not overwhelm the signing node with too many transactions so as to allow the signing node to continue producing blocks efficiently.
To handle the incoming transaction volume, Block producers might wish to use EOSIO 2.0 with EOS-VM enabled. However, using EOS-VM would overwhelm an block signing node runing version 1.8. In this case an extra “barrier” 1.8 node is needed to slow down the transactions and allow the signing node to function efficiently.
While block producers were optimizing the block-only delivery mechanism, the ability to process transactions was temporarily reduced. This negatively impacted some dapps due to transactions getting lost. This should no longer be an issue.
Conclusion
Figuring out the problem and coming up with a solution was a very collaborative effort between the Top 21 and Block.one. This took a tremendous amount of work and coordination on the part of many teams located all around the world. For example, at one point several adjacent block producers sent in their logs to Block.one so issues could be traced going from one block producer to the next. It’s normal that some teams were more involved than others, but all the teams contributed what they had to. We also want to give a special shout out to WhaleEx who provided great leadership by coming up with the solution, as well as offering a lot of help implementing it.
As a final note, it’s important to know that not all block producers have the exact same configuration. Node topology, CPU speed, transaction load and EOSIO versions do have some variation as different block producers have different needs and requirements. EOS Nation continues to work with Block.one to work towards having more than 2 connections to the block signing node so that it is operating with efficiency and redundancy.
This summary should be helpful for other block producers on EOS as well as other EOSIO network block producers who might want to come up with similar design to prevent overloading.
Issues That Would Help With Troubleshooting:
GitHub EOSIO Pull Requests:
- Remove new block id notify feature – 2.0 #8471
- Report block header diff when digests do not match – 2.0 #8472
- Drop late blocks – 1.8 #8495
- Drop late blocks – 2.0 #8496
- CPU block effort – 1.8 #8507
- Consolidated Security Fixes for 2.0.1 #8514
- Consolidated Security Fixes for 1.8.10 #8516
- Handle socket close before async callback – 2.0 #8526
- Net plugin dispatch – 2.0 #8546
- Net plugin send priority – 1.8 #8547
- Net plugin unlinkable blocks – 2.0 #8552
- Drop late check – 2.0 #8555
- Read-only with drop-late-block – 2.0 #8557
- Net plugin post – 2.0 #8561
- Delayed production time – 2.0 #8564
- CPU block effort – 2.0 #8571
- Net plugin unlinkable – 1.8 #8572
- CPU effort last block – 1.8 #8574
- CPU effort last block – 2.0 #8578
- P2P read only – 2.0 #8583
- Producer plugin log – 2.0 #8589
- Init net_plugin member variables – 1.8 #8616
- Init net_plugin member variables – 2.0 – #8617
- Exit transaction early when insufficient account cpu – 2.0 #8638
- Produce block when max_block_cpu_usage reached #8649
- Produce block immediately if exhausted – 2.0 #8651
Diagram Details
Here are the configuration technical details:
- Producer Node
1.8.x or 2.0.x, wabt, speculative, cpu-effort-percent = 50 – 80 - Blocks Peer Node
2.0.x, eos-vm-jit, read-only, full validation - Transactions Barrier Node
1.8.x, wabt, speculative, light validation - Transactions Peer Node
2.0.x, eos-vm-jit, speculative, full validation
EOS Nation is a top 21 Block Producer on the EOS public network. We earn inflation rewards based on the percentage of tokens staked towards us. Those rewards are shared back with token holders through our Proxy4Nation Reward Proxy and also reinvested into EOSIO community, tools, and infrastructure. Help grow the ecosystem by staking your vote to eosnationftw or proxying to proxy4nation.
These updates for EOSNation are awesome and help me believe my investment in EOS is in good hands. It would be great if all BPs published a similar report from each of their perspectives on a regular basis. One of the greatest fuels for FUD in this is space (other than plain old lies) is silence. Thanks for your great work and please encourage other BPs to follow your lead in great comms.
Thank you for reading and for the wonderful comment! As for other BPs publishing similar reports, we don’t feel there is that much of a need for it. We are happy to take on this role of “leading communicator” within the Top21.