Many services running on the EOS mainnet have struggled with the recent massive increase in transaction rates and traffic levels, and this past Tuesday (November 26) dfuse also experienced an outage. The recent massive changes in network conditions on the EOS mainnet revealed a subtle bug in one of our systems, and exacerbated its consequences.
We know that the reliability of our services is of the utmost importance to our users who are building their applications on dfuse, and we take the trust that you place in dfuse extremely seriously. Accordingly, in the interest of transparency, and also because we think it’s an interesting story, we wanted to share this post mortem on the issue we encountered.
The internal stack supporting the dfuse platform is composed of many micro-services, organized into a series of logical layers. These layers range from the low-level data acquisition layer, through storage management and post-processing layers like our dfuse Search indexers, up to our user-facing APIs like the GraphQL and REST layers.
Our low-level data layer is the main entry point for blockchain data (blocks, transactions, etc.) into the dfuse data pipelines. At this layer we not only ingest data, but also aggregate all the extra instrumentation data extracted from blocks and transactions.
The data acquisition layer consists of a number of mission critical services that interact with the native blockchain node processes, receiving raw data from them. If this data flow is interrupted, this causes all of the dependent services to stop receiving data for their own processing. For this reason, we pay special attention to the reliability of this layer.
The EOSIO JSON format has the particularity of using a different JSON representation for numbers that fit in 32 bits versus those larger than 32 bits, the former being represented as a
numeric type while the latter are rendered as
string type. This has been well known to us since the early days of dfuse, and we have a special wrapper type to handle this quirk in the deserialization of these fields.
On Nov 26, 2019 6:49:53 PM EST, at block 91,961,540 on the EOS mainnet, the
action_receipt:auth_sequence value went beyond 32 bits, resulting in a flip to string serialization. Unfortunately, this value was contained within a C++ struct that was serialized by EOSIO into a JSON list, rather than the more common representation of a struct as a JSON object. Within our Go-based deserializer, our code for handling the string representation wasn’t included in the specialized AuthSequence parser; instead in this case it assumed a numeric value (patched here).
At 18:50 EST, blocks stopped flowing into dfuse, and a few minutes later alarms went off in many of our services, as these services were starving for blocks. Note that if, instead of failing, our data ingestion layer had started delivering incorrect data, this would have been trapped by our data integrity protection mechanisms, and the services halted regardless. These mechanisms, which are deployed at numerous locations in our processing pipeline, ensure the integrity of the data you receive as a customer.
It took us roughly 20 minutes to identify the issue, correct it, and build a fixed image of the service involved. Another 15 minutes were needed to complete deployments.
40 minutes after the application of the patch, which included a race to catch up with the backlog of blocks that had accumulated since the beginning of the outage, the blockchain was back in sync with live blocks, feeding our various services.
Meanwhile, we discovered a single block that had been dropped during the service fluctuation. Many services could not operate without rigorous continuity of blocks. Another 30 minutes were needed to perform selective reprocessing, in order to get to that block. Once this hole was filled, all flows were restarted, and most services were now back up (state database, WebSocket streams, most REST calls) -- most except for dfuse Search.
dfuse Search has special requirements, because of its massive indexes and the sheer amount of data moving through the blockchain. Without any help, we knew it would take Search about 2.5 hours to catch up. But we had another trick we could deploy, using a different part of the dfuse stack to help Search catch up faster. However, that component wasn’t ready for the task and required some preparation. While that was happening, Search was catching up on its own, eventually reaching the live point before we were able to ready the other system.
At 23:04 EST, all services had caught up, and the full dfuse platform was back to healthy behaviour, a little over 4 hours after the start of the outage.
While the downtime was unfortunate, we are confident that we are now in a better position to recover from any future issues. We’ve identified several opportunities for improvements, including some improvements to our monitoring systems to detect issues even more quickly, and are working on tweaks to our system architecture to greatly shorten downtimes in case of any future event.
When managing a database like a blockchain -- a public database, where everyone has write permission -- it’s likely we will see other unforeseen events, and are designing for extremely fast recovery. Catch-up time for dfuse Search is one item that this event highlighted as most critical, and the team has work in progress on a set of mechanisms to accelerate this further. While we hope these mechanisms will never need to be triggered, we want to improve every aspect of the developer experience, including best-in-class capability to manage and respond to unwanted and unforeseen events.
During the event, we were actively communicating with our users in real time through Telegram. If you ever experience an issue with dfuse, please be sure to check our Telegram channel for updates. We apologize for any inconvenience this may have caused. We hope that our transparency of this issue may help others who may have built their own internal tooling.