Blockchains are a good way to secure data, but they also add complications for retrieving that data. In this article we’ll explore some of the technical challenges surrounding data storage inside blockchains and how solving these challenges can create better user experiences.
Data and State
Business applications rarely store records only once. Normally records are added, edited, and deleted. The records are stored in a databases, and those databases change over time.
State is a particular configuration of data. As changes are made to data, that data transitions from one state to another. A stored state is a snapshot. There are three primary methods for storing state.
State Storage Methods
1. Mutable Single Snapshot
The first method is mutable: storing the latest state as a single snapshot. Any time the data changes, the current snapshot is edited in-place, resulting in a new and slightly different state.
The main problem with method #1 is there is no history. As the single snapshot is modified, any previous state is lost.
2. Immutable Snapshots
The second method is immutable: storing each state as a separate snapshot. Any time a change is made to the data, a new snapshot is created and saved, leaving any previous snapshots untouched. This method requires a storage capacity roughly equal to the size of the data set multiplied by the number of times the data is changed.
The cost of method #2 can be acceptable if the data set is small and the rate of change is low, but for millions of rapidly modified records, method #2 can lead to astronomical costs and business scaling problems.
3. Immutable Transitions
The third method is also immutable: storing the differences between each state and its previous state. Each of these entries is a transition.
This is the most space-efficient method of storing historical data. A data set could contain millions of rows, but if only one row is changed, only one record needs to be created to reflect that change, all while preserving history.
The downside of method #3 is it has no readily-available snapshots. In order to acquire a snapshot of the state at a particular point in time, the sequence of transitions needs to be played through from the beginning up to that point. Doing so can be slow and expensive.
In practice, systems often use combinations of these three methods. For example, some aspects of the data can be stored in transitions, and one or more snapshots can be maintained.
Even though methods #2 and #3 are labeled immutable, they are only immutable in the sense that the intended process for managing them performs no mutation. That does not prevent someone from directly mutating previous records and causing security vulnerabilities.
Enter the Blockchain
Securing Your Data
Blockchains add security to data storage because they are immutable and self-verifiable. Any modification to a previous record will result in identifiable discrepancies between the modified record and other links of the chain. While blockchains do not directly assist in fixing modifications, they do act as a effective alarm.
So which storage method do blockchains use? Blockchain data is immutable, which rules out method #1. Blockchains often store data sets with millions of records, ruling out method #2. This leaves method #3. Blockchains store data as an immutable sequence of transitions.
A blockchain is comprised of blocks. Each block represents a single state at a particular time. Blocks contain transactions, which are the transitions between blocks. Each transaction describes a data change between two blocks. Most blockchains allow a block to have zero transactions. An empty block has the exact same state as the previous block.
Blockchain nodes performs two types of validation: merkle proofing and transition validation.
Merkle proofing ensures that the content of a block and the reference to its previous block are both valid. Merkle proofing can be performed while the blockchain data is in its natural state: a sequence of transitions.
Transition validation ensures that the changes in state between two consecutive blocks do not violate business logic. The simplest and most common business logic for a blockchain is that a transaction cannot send an amount that exceeds the balance of the sending address.
Transition validation requires more readily available data than merkle proofing. It requires at least a snapshot of the previous state and in Bitcoin’s case, indexed historical data of unspent transaction outputs.
Validation has been the primary need dictating the data querying features of a blockchain node. Such features work great for processing transactions and reaching consensus, but leave a void for business applications. The data needs of a business application go far beyond the data needs of block validation. Business applications need indexed snapshots and indexed historical data so they can promptly answer questions like, “Did Customer X send me a payment yesterday?” or “How much money have I sent to address Y?”. Blockchain validation does not need to answer these questions.
Some blockchains like Ethereum have additional integrated indexing features, but that has only muddied the water. The querying needs of business applications is an infinite and complex problem domain. Any integrated miner data solutions cannot hope to cover the problem domain and instead end up bloating blockchain node code bases and overhead.
Because nodes provide some indexing and query features, application developers have stretched those features to the snapping point instead of investing in more robust solutions.
At its root this is a problem of query optimization. When it comes to optimizing queries, there is no automated, universal solution. Different applications need different data, and even when two different applications happen to need the same data, they won’t necessarily use that data in the same way.
Blockchain nodes can never solve that problem. Instead, applications need ‘lenses.’ Blockchain lenses are software services that scan data from blockchain nodes and store it in a format that is optimized for business applications. Some of these optimizations are fairly universal, in that the same lens can provide optimized querying for a wide variety of applications. Some applications need specialized lenses with specialized data handling.
1. A business reporting for annual taxes
Depending on country, any business with cryptocurrency assets will need the exact balance of their wallets at the turn of the year for tax reporting. Most blockchain nodes can only provide current balances, so unless an accountant records each wallet’s balance on new year’s eve at the stroke of midnight, a historical balance will be needed and only a lens can provide that.
2. An online store that accepts cryptocurrency
All crypto purchases are sent to store-owned hot wallets. The store then transfers that money to a cold wallet. The blockchain ledger for each of those wallets contains millions of transactions. The store regularly runs a security report on all transactions to and from those wallets. The report helps identify suspicious activity and potential vulnerabilities. The report is complex, including features like grouping transactions by customer and flagging any discrepancies between orders and payments. The complexity of the report would cause a normal blockchain lens to choke if performed on a data set with millions of records. Because of this the store uses a specialized lens that is optimized for their wallet reporting.
3. A game using smart contracts for items
In this game, players can purchase and trade items. As items are used within the game, they can also be upgraded and customized. All item data is stored inside an Ethereum smart contract. The player’s item dashboard needs a variety of data for both items and their constantly changing customizations.
Querying the data directly from an Ethereum node requires calling custom getter functions inside the smart contract. This approach has two problems. First, it is hard to maintain because whenever the game changes how it queries item data it would need to migrate to a new contract with modified getter functions. Secondly, Ethereum nodes are expensive due to their considerable hardware requirements.
Instead, the game can use a lens that maintains a snapshot of the item data in a SQL database. The item data can then be readily queried however the game needs it without changes to a smart contract, and a SQL database is less expensive to run than an Ethereum node.
- Blockchains add security to data storage
- Blockchain storage is not directly usable by applications
- Blockchain nodes should be simple and not directly service applications
- Applications need lenses to abstract and optimize blockchain data