Validators are the fundamental pillars of every Proof of Stake (PoS) blockchain. The performance and security of a blockchain is heavily reliant on the efficient and responsible operation of these validators. As a validator ourselves, we would like to shed some light on the inner workings of this industry. This role is a huge commitment – where we need to uphold the integrity of multiple blockchain ecosystems. Our team of expert Site Reliability Engineers, located across multiple time zones, guarantee 99.99% uptime and unwavering support. How they do it – is what we shall explore in this article. We take you on a guided tour, providing a sneak peek into the workings of a staking provider. This can get a clearer picture of the staking world and see how our team and also our friends in the industry try to achieve optimum performance.
Before commencing setting up a validator node, it is essential to have a comprehensive understanding of validating on the blockchain in question. At Luganodes, our SRE team works closely with our research team to document critical information pertaining to the network and its operation. In order to ensure a structured approach, we ask the following questions:
- How does the network work and what makes it special?
- How do the delegation/undelegation/redelegation time periods and amounts work?
- How long is an epoch and how long does it take for a validator to go live?
- How many tokens do you need to create a validator and where can one obtain them?
- What is the minimum self-stake? If any.
- What is the necessary delegation amount to become active?
- What are the validator hardware requirements?
- What is the slashing criteria? If any.
These questions help us understand the economic feasibility of running a validator on the blockchain in question.
After performing the due diligence, our operations team conducts a dry run to set up a validator — somewhat like a rehearsal dinner. This test node setup process receives the same level of meticulous documentation as a production node. The operations team diligently reviews and comprehends the documentation, following each step to establish a validator on the blockchain’s testnet. Throughout this process, they take notes on any unique concepts encountered during the validator setup. Here is a snippet of the documentation that our SREs were kind enough to share.
This is an example of a testnode that was setup before the hardfork of Cardano to test the protocol version upgrade to 8.1.1
In the snippet above, the node operators aim to deploy binaries that are built from the source along with their dependencies i.e. LibSodium (a forked version by IOHK).
As evident from our setup process – comprehensive documentation plays a vital role in ensuring the long-term success of a validator setup. Our operations team at Luganodes understands the importance of meticulous documentation and goes the extra mile to include every little detail, even those that might have been overlooked in the official validator documentation.
When documenting the validator setup process, we cover various aspects, including validator upgrades, validator log checks, validator identification information, and any other pertinent information that facilitates a thorough understanding of the validator setup. By providing detailed documentation, we aim to establish a seamless onboarding workflow for our new SREs by providing them with the necessary knowledge to navigate and comprehend the intricacies of our validator setup process.
Despite a meticulous setup of validator nodes, it is not uncommon for them to encounter connection errors, memory errors, and other network-specific issues. To tackle these challenges, we have implemented a robust monitoring system utilizing Grafana OnCall.
Grafana OnCall is an open-source incident response management tool. It integrates with a wide range of monitoring systems, including Prometheus, Alertmanager, and Zabbix. This allows us to centralize all of your alerts in one place, making it easy to see what’s happening and who needs to be notified. Grafana OnCall has a plethora of features that we utilize daily to make sure the alerts are handled in a timely manner.
By filtering and prioritizing alerts, we ensure that only the most important and urgent alerts reach the relevant team members. This helps us maintain an efficient response system while minimizing distractions caused by non-essential notifications.
In addition, we proactively set up secondary monitoring for each chain or even develop customized monitoring solutions tailored to specific chains if necessary. Secondary monitoring is a fail-safe mechanism set in place abstracted from the primary monitoring stack that can monitor similar or dissimilar metrics of the concerned nodes.
This is usually achieved by monitoring third-party nodes via remote procedure calls or RPCs
and comparing them with the metrics of our own nodes. Certain thresholds are set in place to fire alerts when incidents occur.
We also self host services like TenderDuty, UptimeRobot, Watchtower, Suibot, Netdata(for MultiversX) and various other community-driven initiatives and widely used secondary monitoring solutions.
This approach allows us to distribute reliability throughout our setups, enhancing the overall stability and performance of our validator nodes.
Peer Reviewing Setup
After the implementation of the aforementioned steps, the whole setup is reviewed by all members of the operations team. Every member of the operations team lends their expertise to ensure the sanity of the setup. Additionally, another team of senior engineers is responsible for testing the setup based on a standard procedure. This helps us corroborate the longevity and robustness of the setup.
The job is not over after a validator node is set up and reinforced with our scrupulous alerting system! With frequent releases of bug fixes, security enhancements and feature additions, it is imperative that our nodes are always up to date. To this end, our SRE team has set up a monitoring system to track the latest releases of our validator nodes. We strive to promptly perform these upgrades after their release, with zero downtime.
Optimization through Automation
By now one would have certainly come to grasp that the node setup and maintenance is a task that requires skill, patience and grit – along with continuous monitoring 24×7. As with all other fields – the optimal setup must have a degree of automation to account for human limitations.
We make use of Docker images as a toolchain for building network-specific binaries. Docker is a containerization tool, a technology that allows developers to package applications and their dependencies into isolated, lightweight containers to run smoothly on various systems. Docker images are standalone, executable software packages that include everything needed to run a piece of software, including the code, runtime, libraries, and system tools. Building binaries refers to the conversion of source code to binary – specifically for a certain network. So docker makes the process of running software on various networks easier.
We use Ansible – a powerful automation tool in order to manage various aspects of Docker containers and Dockerized environments. Ansible works via an agentless format and does not require any software installation. Instead, it uses SSH(Secure Shell) to connect to remote systems and execute commands, making it easy to manage a diverse range of systems. Ansible uses playbooks, which are YAML files containing a set of instructions (tasks) that define the desired state of systems or configurations.
Automating Key-management and Backups
Key-management is an important part of maintaining the security and integrity of digital assets and automation is thoroughly necessary to make this fool-proof and have multiple layers of security. Private keys being the most sensitive piece of information require proper protection – we use remote signers for this very purpose. Remote signers secure transaction signing without exposing the private keys to the online environment – acting as a safe intermediary.
A remote signer accesses the offline private keys and signs the transactions it receives. Transactions get signed and sent back while the private keys remain in their safe isolated offline environment, which leads us to the next point – hardware wallets.
The storage of these private keys is done in a hardware wallet as an extra layer of security – decreasing the risk of unauthorised access or theft. Oftentimes times these wallets also feature a physical confirmation requirement, adding another layer of security.
Coming to backups – an essential feature is snapshotting. In the blockchain world itself, snapshotting has an important role to play – it is a comprehensive record of the entire blockchain ledger, encompassing all current addresses along with their associated data, such as transactions, fees, balances, metadata, and additional relevant information. While downloading the entire blockchain from the genesis block can lead to large amounts of storage being used – snapshots catch up to the network by obtaining the most recent states. A node operator can even choose to use third-party snapshotting services. We choose to maintain our own snapshots on multiple chains and also provide snapshots to the communities.
Chains such as Sui and Aptos store snapshots on Amazon’s Simple Storage Service (S3) – similar instances can be used for backup and storing snapshots.
Automating Node Monitoring
Probably the biggest challenge in the node setups is monitoring and ensuring maximum node uptime. Automation and other monitoring setups are what make this tedious task streamlined for our SREs so that every alert is properly addressed. For this purpose – we have a status check solution called a Unified Dashboard which provides a global view of the entire infrastructure provided by Luganodes. We also utilize monitoring tools such as Tenderduty for Tendermint-based chains. These services monitor metrics via RPCs provided by nodes and check for missed blocks and other such alerts.
Moreover, for every network that we set up – we have dedicated Ansible scripts setup that check the health and performance status of every node in the infrastructure.
That was quite a process – after multiple steps, each with nuances of its own – the nodes have been meticulously set up and are being maintained by the SREs. This process is similar across staking providers – with diligent engineers working to keep the chains running. Even as we compete for the best uptimes and service – all validators are cornerstones of the Proof of Stake networks – and in doing so, the greater decentralized future.
Speaking from a validator’s perspective, the effort put into this is truly rewarding as we find acknowledgement in our unrelenting performance and continuous trust and support from institutions and individuals. These efforts have led Luganodes to achieve the acclaimed AAA rating from Staking Rewards, recognising our provision of institutional-grade staking node services.