Entropy is a notoriously tricky concept. It’s associated with many ideas: disorder, randomness, messiness, uncertainty, information, energy and the arrow of time to name a few. To complicate matters, the notion of entropy originated from different fields: thermodynamics and statistical mechanics (which are both branches of physics) and later, information theory (which is a branch of mathematics and computer science).
In this post we’ll explore the evolution and generalisation of the concept of entropy, see how it relates to both physics and information, discuss paradoxes, demons, black holes and the end of the universe.
Prior to research for this post, I understood entropy at a surface level having used it in a machine learning context and reading about it in popular science books. This is my attempt to understand entropy in its various forms at a deeper level.
Entropy is basically a measure of randomness. If your room is tidy and organised, you can say it has a low entropy arrangement. However, if your room is a mess with things everywhere randomly, you can say it has a high entropy arrangement. In some sense, entropy is related to the number of ways objects can be arranged: there are many more ways for a room to be messy compared to organised. This imbalance between messy vs. organised states is at the heart of entropy and leads to very interesting results for any system, including your room or even the entire universe!
Wikipedia describes entropy as: “a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognised, to the microscopic description of nature in statistical physics, and to the principles of information theory.”
The use of the term entropy in multiple fields is one reason it can be a confusing concept as it is defined and interpreted somewhat different is each. Let’s review the three main fields it’s used:
Thermodynamics is a branch of physics that deals with the relationships between heat, energy, and temperature. It is fundamental to understanding the behaviour of a physical system, such as an engine. In classical thermodynamics, entropy is a quantitative, measurable, macroscopic physical property invented to describe heat transfer in systems. Famously, the second law of thermodynamics says that all processes in a system have a direction, from lower to higher entropy, which reduces the amount of useable energy available. The unusable energy takes the form of heat. This asymmetry within a system can be used empirically to distinguish past and future and establishes entropy as an arrow of time.
Statistical mechanics (also known as statistical thermodynamics) is considered the third pillar of modern physics, next to quantum theory and relativity theory. It emerged with the development of atomic theories, explaining classical macroscopic thermodynamics as a result of statistical methods applied to large numbers of microscopic entities. In statistical mechanics, entropy is a measure of the number of ways a system can be arranged. Less likely “ordered” arrangements having low entropy, and more likely “unordered” arrangements having high entropy.
Information theory is a mathematical approach to studying the coding of information for storage and transmission. Information entropy is analogous to statistical mechanics entropy, while being a more general concept. In information theory, entropy is the average level of “information content” or “surprise” of observed outcomes from a random process. For example, flipping a coin 10 times and observing 10 heads is more surprising (and so has higher entropy) than observing a mix of heads and tails, which is less surprising. Another interpretation of information entropy is the amount of information (measured in bits) needed to fully describe a system.
So entropy has different formulations, and varying interpretations, depending the field it’s used in. Even within a single field entropy can be defined in different ways, as we will see later. However, over time it’s been shown that all formulations are related.
While the concept of entropy may seem abstract, it has many real world applications across physics, chemistry, biological systems, economics, climate change, telecommunications and even the big bang and the arrow of time!
The following sections take a deeper dive into the concept of entropy in the fields described above.
In 1865, Rudolf Clausius (1822 - 1888) gave irreversible heat loss a name: Entropy. He was building on the work of Carnot (1796 - 1832) who laid the foundations of the discipline of thermodynamics while looking to improve the performance of steam engines. Classical thermodynamics is the description of systems near-equilibrium using macroscopic, measurable properties. It models exchanges of energy, work and heat based on the laws of thermodynamics. The qualifier classical reflects the fact that it represents the first level of understanding of the subject as it developed in the 19th century and describes the changes of a system in terms of macroscopic parameters.
There are 4 laws of thermodynamics. The second law establishes the concept of entropy as a physical property of a thermodynamic system and states that if a physical process is irreversible, the combined entropy of the system and the environment must increase. Importantly this implies that a perpetual motion machine is physically impossible. For a reversible process which is free of dissipative losses, total entropy may be conserved, however such physical systems cannot exist. The Carnot cycle is an idealised reversible process that provides a theoretical upper limit on the efficiency of any classical thermodynamic engine.
The 1865 paper where Clausius introduced the concept of entropy ends with the following summary of the first and second laws of thermodynamics:
The energy of the universe is constant.
The entropy of the universe tends to a maximum.
Clausius also gives the expression for the entropy production for a cyclical process in a closed system, which he denotes by N. Here Q is the quantity of heat, T is the temperature, S is the final state entropy and S0 the initial state entropy. S0 - S is the entropy difference for the backwards part of the process. The integral is to be taken from the initial state to the final state, giving the entropy difference for the forwards part of the process. From the context, it is clear that N = 0 if the process is reversible and N > 0 in case of an irreversible process.
Next we shift gears from the macroscopic to the microscopic realm…
Statistical mechanics, also known as statistical thermodynamics, emerged with the development of atomic and molecular theories in the late 19th century and early 20th century. It supplemented classical thermodynamics with an interpretation of the microscopic interactions between individual particles and relates the microscopic properties of individual atoms to the macroscopic, bulk properties of materials that can be observed on the human scale, thereby explaining classical thermodynamics as a natural result of statistics and classical mechanics. It is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. It does not assume or postulate any natural laws, but explains the macroscopic behaviour of nature from the behaviour of such ensembles.
Josiah Willard Gibbs (1839 - 1903) coined the term statistical mechanics which explains the laws of thermodynamics as consequences of the statistical properties of the possible states of a physical system which is composed of many particles. In Elementary Principles in Statistical Mechanics (1902), which is considered to be the foundation of modern statistical mechanics, he writes:
Although, as a matter of history, statistical mechanics owes its origin
to investigations in thermodynamics, it seems eminently worthy of an
independent development, both on account of the elegance and simplicity
of its principles, and because it yields new results and places old truths
in a new light. - Gibbs
Ludwig Boltzmann (1844 - 1906) is the other leading figure of statistical mechanics and developed the statistical explanation of the second law of thermodynamics. In 1877 he provided what is known as the Boltzmann definition of entropy, S:
where Ω is the number of distinct microscopic states available to the system given a fixed total energy, and kB the Boltzmann constant. The Boltzmann constant, and therefore Boltzmann entropy, have dimensions of energy divided by temperature, which has a unit of joules per kelvin (J⋅K−1) or kg⋅m2⋅s−2⋅K−1 in terms of base units. It could have been chosen to have any value, including 1 (i.e. dimensionless), however for historical reasons it was chosen to have the value: kB = 1.38649 × 10−23 joules per kelvin. As described by danielsank, kB only exists because people defined temperature and entropy before they understood statistical mechanics. If temperature had dimensions of energy, then under this definition entropy would have been dimensionless.
Gibbs refined this formulation and generalised Boltzmann's statistical interpretation of entropy in his work Elementary Principles in Statistical Mechanics (1902) by defining the entropy of an arbitrary ensemble as:
where kB is the Boltzmann constant, while the sum is over all possible microstates i, with pi the corresponding probability of the microstate. Both Boltzmann and Gibbs entropies are the pillars of the foundation of statistical mechanics and are the basis of all the entropy concepts in modern physics.
As described by ACuriousMind, the Gibbs entropy is the generalisation of the Boltzmann entropy holding for all systems, while the Boltzmann entropy is only the entropy if the system is in global thermodynamical equilibrium (when there is no net macroscopic flows of matter or energy within the system). Both are a measure for the microstates available to a system, but the Gibbs entropy does not require the system to be in a single, well-defined macrostate.
The second law of thermodynamics states that the total entropy of an isolated system always increases over time. It is a statistical law rather than an absolute law. The statistical nature arises from the fact that it is based on the probability of different configurations of the particles in a system, and how those probabilities change over time. It is not impossible, in principle, for all atoms in a box of a gas to spontaneously migrate to one half; it is only astronomically unlikely.
Entropy is one of the few quantities in the physical sciences that require a particular direction for time and provides a natural explanation for why we observe an arrow of time in the universe. It explains why systems tend to become more disordered over time, and why we perceive time as having a certain direction from past to future. In cosmology, the past hypothesis postulates that the universe started in a low-entropy state which was highly ordered and uniform and this is responsible for the observed structure and organisation of the universe today which is compatible with the second law. At the other end of the spectrum, the heat death of the universe is a hypothesis on the fate of the universe stating the universe will evolve to a state of no thermodynamic free energy, and will therefore be unable to sustain processes that increase entropy. Luckily this should take over 10^100 years. It is suggested that, over vast periods of time, a spontaneous entropy decrease would eventually occur creating anther universe, but in reality we have no idea. To wrap up the cosmological angle, it’s interesting to note that a black hole has an entropy that is proportional to its surface area, rather than its volume, implying that the entropy increases as it absorbs more matter.
The Gibbs paradox is a thought experiment puzzle in statistical mechanics that arises from the different ways of counting the number of possible arrangements of particles in a mixture of two identical gases within a box. One way is to assume that particles are distinguishable, while the other assumes they are indistinguishable. These two ways of counting can lead to different predictions for the entropy of the system, and result in a paradox if we are not careful to specify particle distinguishability. The paradox is resolved when we realise that we need to take into account the fact that identical particles can be swapped with each other without changing the overall arrangement. When we do this, we find that the number of possible arrangements for two identical gases is actually greater than the number of possible arrangements for two different gases. This explains why the entropy of a mixture of identical gases is higher than the entropy of a mixture of different gases.
Next we shift gears again, from the physical world to the informational world…
Information theory is the scientific study of the quantification, storage, and communication of information. In 1948, Claude Shannon set out to mathematically quantify the statistical nature of “lost information” in phone-line signals. To do this, Shannon developed the very general concept of information entropy which was published in his A Mathematical Theory of Communication. Shannon considered various ways to encode, compress, and transmit messages, and proved that the entropy represents an absolute mathematical limit on how well data from a source can be losslessly compressed onto a perfectly noiseless channel.
The core idea of information theory is that the “informational value” of a message depends on the degree to which the content of the message is surprising. If a highly likely event occurs, the message carries very little information. However, if a highly unlikely event occurs, the message is much more informative. Information theory often concerns itself with measures of information of the distributions associated with random variables. The entropy of a random variable is the average level of “information”, “surprise”, or “uncertainty” inherent to the variable's possible outcomes.
Shannon entropy, H, can be defined as:
where p(x) is the probability of outcome x and b is the logarithm base, where b = 2 encodes binary digits.
Shannon entropy is clearly analogous to Gibbs entropy, without the Boltzmann constant. The analogy results when the values of the random variable designate energies of microstates. In the view of Jaynes, entropy within statistical mechanics should be seen as an application of Shannon's information theory: the thermodynamic entropy is interpreted as being proportional to the amount of further Shannon information needed to define the detailed microscopic state of the system, that remains uncommunicated by a description solely in terms of the macroscopic variables of classical thermodynamics. For example, adding heat to a system increases its thermodynamic entropy because it increases the number of possible microscopic states of the system that are consistent with the measurable values of its macroscopic variables, making any complete state description longer.
Maxwell's demon is a thought experiment, proposed in 1867, that would hypothetically violate the second law of thermodynamics. In the experiment a demon controls a door between two chambers of gas. As gas molecules approach the door, the demon allows only fast-moving molecules through in one direction, and slow-moving in the other direction, causing one chamber to warm up and the other to cool down. This decreases the total entropy of the system, without applying any work, hence the violation. It stimulated work on the relationship between thermodynamics and information theory.
Maxwell's demon can (hypothetically) reduce the thermodynamic entropy of a system by using information about the states of individual molecules; but, as Landauer showed in 1961, to function the demon himself must increase thermodynamic entropy in the process, by at least the amount of Shannon information he proposes to first acquire and store; and so the total thermodynamic entropy does not decrease (which resolves the paradox). Landauer's principle is a physical principle pertaining to the lower theoretical limit of energy consumption of computation. It imposes a lower bound on the amount of heat a computer must generate to process a given amount of information, though modern computers are far less efficient.
The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data. The principle was first expounded by Jaynes where he emphasised a natural correspondence between statistical mechanics and information theory. It can be said to express a claim of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.
The principle of maximum entropy is commonly applied to inferential problems, for example, to obtain prior probability distributions for Bayesian inference or making predictions with logistic regression, which corresponds to the maximum entropy classifier for independent observations. Giffin and Caticha (2007) state that Bayes' theorem and the principle of maximum entropy are completely compatible and can be seen as special cases of the “method of maximum relative entropy”. Jaynes stated, in 1988, that Bayes' theorem was a way to calculate a probability, while maximum entropy was a way to assign a prior probability distribution.
We’ve touched upon the key measures of entropy as the concept matured and generalised since around 1865. There are many additional definitions that are related to each other in different ways.
The Entropy Universe (2021) by Ribeiro et al, is a paper which aims to review the many variants of entropy definitions and how they relate to each other. The authors describe the relationship between the most applied entropies for different scientific fields, establishing bases for researchers to properly choose the variant of entropy most suitable for their data. It’s well worth checking out.
For example, here is the timeline (in logarithmic scale) of the universe of entropies covered in the paper:
And here is the full entropy relation diagram, which can be found on page 21:
It’s helpful to understand the timeline over which concept of entropy has evolved, from 19th century combustion engines to modern communications.
In the 19th century people were building combustion engines and observed that some energy released from combustion reactions was always lost and not transformed into useful work. Investigations into trying to solve this problem led to the initial concept of entropy.
Carnot was a French mechanical engineer and the “father of thermodynamics”. His work was used by Clausius and Kelvin to formalise the second law of thermodynamics and define the concept of entropy.
In 1865 Clausius, a German physicist, introduced and named the concept of entropy. He studied the mechanical theory of heat and built upon Carnot’s work. His most important paper, “On the Moving Force of Heat” first stated the basic ideas of the second law of thermodynamics.
Kelvin did important work in the mathematics of electricity and formulation of the first and second laws of thermodynamics, and did much to unify the emerging discipline of physics in its contemporary form. He also coined term thermodynamics in his publication “An Account of Carnot's Theory of the Motive Power of Heat”.
Maxwell, was a Scottish mathematician and scientist, helped develop the Maxwell–Boltzmann distribution - a statistical means of describing aspects of the kinetic theory of gases. Also proposed “Maxwell's demon” a paradoxical thought experiment where entropy decreases.
Boltzmann, an Austrian physicist and philosopher, developed the statistical explanation of the second law of thermodynamics. He also developed the fundamental statistical interpretation of entropy (Boltzmann entropy) in terms of a collection of microstates.
Gibbs, an American scientist, generalised Boltzmann's entropy, so that a system could exchange energy with its surroundings (Gibbs entropy). He also coined the term statistical mechanics which explains the laws of thermodynamics as consequences of the statistical properties of the possible states of a physical system which is composed of many particles. His papers from the 1870s introduced the idea of expressing the internal energy of a system in terms of the entropy, in addition to the usual state-variables of volume, pressure, and temperature.
In physics, the von Neumann entropy, named after John von Neumann, is an extension of the concept of Gibbs entropy from classical statistical mechanics to quantum statistical mechanics.
Known as a "father of information theory". Shannon developed information entropy as a measure of the information content in a message, which is a measure of uncertainty reduced by the message. In so doing, he essentially invented the field of information theory.
Wrote extensively on statistical mechanics and on foundations of probability and statistical inference, initiating in 1957 the maximum entropy interpretation of thermodynamics.
As stated in the introduction, entropy is a notoriously tricky concept associated with many concepts, and we’ve only scratched the surface on this fascinating concept here.
I’ve certainly enjoyed researching for this post and apologies if it’s a bit rough, but it’s time to click publish and get it out as it’s been sitting in draft for a while.
I hope to write another related post, particularly expanding on Shannon information entropy, quantities of information, complexity, randomness, quantum information and how they’re all related.
I’ll wrap up with links to resources I found valuable reading. Thanks for reading this far, I hope you got something out of it!
Entropic Physics: Probability, Entropy, and the Foundations of Physics (Caticha, 2022)
Entropy, Information, and the Updating of Probabilities (Caticha, 2021)
Lectures on Probability, Entropy, and Statistical Physics (Ariel Caticha, 2008)
The Entropy Universe - a timeline of entropies (Ribeiro et al, 2021)
Researchers in an Entropy Wonderland - A Review of the Entropy Concept (Popovic, 2017)
A Mathematical Theory of Communication (Shannon, 1948)
Information Theory and Statistical Mechanics (Jaynes, 1957)
Information Theory and Statistical Mechanics II (Jaynes, 1957)
Information Theory and Statistical Mechanics - Lecture Notes (Jaynes, 1962)
Where do we Stand on Maximum Entropy (Jaynes, 1978)
The Relation of Bayesian and Maximum Entropy Methods (Jaynes, 1988)
Irreversibility and heat generation in the computing process (Landauer, 1961)
Is There a Unique Physical Entropy? Micro versus Macro (Dieks, 2012)
Entropy? Exercices de Style (Gaudenzi, 2019)
Quantifying the Rise and Fall of Complexity in Closed Systems (Carroll and Aaronson, 2014)
This is Part 2 in the machine learning series, covering model training and learning from data. Part 1 covers an overview of machine learning and common algorithms.
There are many types of ML algorithms. In this post I will discuss supervised learning algorithms. These operate on labelled data, where each data point has one or more features (also known as attributes) and an associated known true label value.
The goal of supervised machine learning is to develop an algorithm that can learn from labelled data to train a model and then use the model to make accurate predictions on unseen test data. All without being explicitly programmed to do so.
An algorithm learns from training data by iterating over labeled examples and optimising model parameters to minimise the difference between label predictions and true labels. It is this optimisation component that enables “learning”.
Some algorithms are quite simple with few parameters to tune (for example logistic regression) while others are very complex (for example transformer deep learning models). Below we’ll cover the trade-offs between different algorithms and the subtleties of training a model that generalises well for predictions.
A typical ML algorithm consists of a few essential components:
A loss function (which applies to a single training example)
An objective (or cost) function (usually the summation of the loss function over all the examples in the dataset)
An optimisation algorithm to update learned parameters and improve the objective function
Let’s use a linear regression model as an example, where the weights (w) and bias (b) are the parameters of the model to learn, x is a single input data instance and f(x) is the prediction for x.
Here’s the linear model formulation:
And the components:
The squared area loss function (applies to a single training instance):
The mean squared error (MSE) objective function applies to the whole dataset:
An optimisation algorithm. One possible example is Gradient Descent, since the objective function is differentiable. (Another option is a closed form solution, but that isn’t always solvable when the dataset is large).
Gradient Descent is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It’s commonly used to optimise linear regression, logistic regression and neural networks.
The basic idea behind gradient descent is to take steps in the direction of the negative gradient of the objective function with respect to the parameters. The negative gradient informs the direction in which the objective function is decreasing most rapidly, so taking a step in that direction should quickly reach a minimum.
There are several variations of gradient descent, such as stochastic gradient descent, which uses a random subset of the data to update the parameters in each iteration, and mini-batch gradient descent, which uses a small batch of data to update the parameters in each iteration.
It shouldn’t be confused with backpropagation, which is an efficient method of computing gradients in a directed graphs of computations (for example in neural networks).
The No Free Lunch Theorem is a concept that states that there is no one model that works best for every problem. The theorem essentially says that if an algorithm performs well on a certain set of problems, then it must necessarily perform worse on others.
The assumptions of a great model for one problem may not hold for another problem, so it is common to try multiple models and find one that works best for a particular problem.
Underfitting and overfitting are common problems that can occur when training models.
Underfitting occurs when the model is too simple to capture the underlying patterns in the data. In other words, the model is not complex enough to fit the training data, and it performs poorly on both the training and testing data.
Overfitting occurs when the model is too complex and fits the training data too well, including the noise in the data. In this case, the model is not able to generalise to new, unseen data and performs poorly on the test data.
Both underfitting and overfitting can lead to poor performance and inaccurate predictions. The goal is to find the right balance between model complexity and performance on the training and testing data. Techniques such as cross-validation, regularisation, and early stopping can help to prevent overfitting and underfitting in machine learning models.
Bias and variance are two important concepts that relate to the ability of a model to accurately capture the underlying patterns in the data.
Bias refers to the error that is introduced by approximating a real-world problem with a simplified model. A model with high bias is unable to capture the complexity of the underlying patterns in the data and may result in underfitting.
Variance, on the other hand, refers to the error that is introduced by the model's sensitivity to small fluctuations in the training data. A model with high variance is too complex and may result in overfitting. In other words, a model with high variance is too sensitive to the noise in the training data and may fail to generalise to new, unseen data.
Bias and variance trade off against each other. This tradeoff is a central problem in supervised learning.
The goal is to find the right balance between bias and variance to achieve the best predictive performance on new, unseen data. Techniques such as regularisation, cross-validation, and ensembling can help to balance bias and variance in machine learning models.
Ideally, we want a model that accurately captures the regularities in its training data and generalises well to unseen data. Unfortunately, it is typically impossible to do both simultaneously.
Expected generalisation error is the sum of the bias and variance error
Overfitting: low bias, high variance
Underfitting: high bias, low variance
Training, validation, and test data splits are used to evaluate the performance of a model on new, unseen data. These data splits are used to train the model, tune its hyperparameters, and evaluate its performance.
The training set is the part of the data that is used to train the model. It is the data on which the model is fitted, and its parameters are optimised to minimise the objective function.
The validation set is used to evaluate the performance of the model with different hyperparameter values and select the best set of hyperparameters.
The test set is the part of the data that is used to evaluate the final performance of the model. It is a new, unseen dataset that the model has not been trained on or used to tune hyperparameters. The test set is used to estimate the performance of the model on new, unseen data and to determine if the model is generalising well.
It is important to keep the test set separate from the training and validation sets to avoid overfitting and to obtain an unbiased estimate of the model's performance. If there is not enough labelled, cross-validation is another option.
Hyperparameters in machine learning are model settings that cannot be learned during training but must be set before the training process begins. They control the behaviour of the model and can have a significant impact on its performance.
Hyperparameters are typically set by the user and are not learned from the data. They can include settings such as the learning rate, the number of hidden layers in a neural network, the number of trees in a random forest, the regularisation strength, or the kernel type in a support vector machine.
Finding the right hyperparameter values is essential to ensure that the model performs well on new, unseen data. They can be difficult to set, as their optimal values can depend on the dataset, the model architecture, and the specific problem being solved.
There are many approaches to hyper-parameter optimisation, for example:
Grid search – simple parameter sweep
Bayesian optimisation - builds a model of the function mapping from hyper-parameter values to the objective
Regularisation is an umbrella term of methods that force the learning algorithm to build a less complex model. It is any modification we make to a learning algorithm that is intended to reduce its generalisation error but not its training error.
Regularisation examples:
L1 regularisation (aka lasso regression) – adds the absolute value of the model coefficients as a penalty term to the loss function. (many coefficients tend to 0 which helps with feature selection)
L2 regularisation (aka ridge regression) – adds the squared value of the model coefficients as a penalty term to the loss function. (L2 is differentiable meaning gradient descent can be used to optimise the objective function)
For neural networks:
Dropout: randomly "dropping out" units during training
Batch normalisation: re-centering and re-scaling layer inputs
A learning curve is a graph that shows how the performance of a model changes as the size of the training set increases or the model complexity increases.
Learning curves are used to diagnose the bias-variance tradeoff of a model and to determine whether the model is underfitting or overfitting. They can also be used to determine if more data will improve the model, evaluate the effect of regularisation, perform feature selection, and hyperparameter tuning on the performance of the model.
To assess the performance of a machine learning model, several evaluation metrics can be used. The choice of the metric depends on the specific problem and the type of model being used.
Here are some common evaluation methods and metrics:
Confusion matrix summarises predictions vs true labels
Precision: fraction of relevant instances among the retrieved instances
Recall: fraction of relevant instances that were retrieved
F1-score: balance between precision and recall
ROC curve: plot of model performance for all classification thresholds
AUC: Area under the ROC curve, provides an aggregate measure between 0 and 1 of performance across all possible classification thresholds
There are many further important considerations when training a good machine learning model. I plan to cover these in a future post. For now I’ll list
Handling imbalanced datasets with over & under sampling
Choosing a good objective function to optimise
Normalisation/standardisation of features
Prevent information leakage from training to test sets (especially with time series data)
Model interpretation & explainability
Data and model/concept drift over time
Model performance at training (and inference) time
Check out Part 3 of this machine learning series covering deploying machine learning models into a production system and maintaining them over time.
These are my recommended machine and deep learning lecture videos, all available via Youtube:
Stanford CS229: Machine Learning with Andrew Ng, 2018 (online notes)
Stanford CS230: Deep Learning with Andrew Ng, 2018 (online notes)
CS231n: DL for Computer Vision with Andrej Karpathy, 2016 (online notes)
MIT 6.S191 Intro to DL Home with Alexander Amini, 2022 (github repo)
UW-Madison STAT 451 Introduction to ML with Sebastian Raschka, 2021 (github repo)
Google also has a good foundational and advanced ML courses.
Additionally, here are some amazing machine learning notes by Christopher Olah that everyone should checkout: colah.github.io
]]>Machine Learning (ML) is a subfield of AI that develops algorithms to automatically improve performance on a specific task from data, without being explicitly programmed to perform the task. For example, using historical price data to do well at the task of predicting future stock prices. Or using text data from the internet to do well at the task of engaging in human-like conversation.
ML has a rich, and at times fuzzy, history stemming from many fields including Statistics, Computational Statistics, Artificial intelligence and Computer Science.
Regardless of the precise definition of ML, there are a few essential ingredients: a specific user defined task, input data matching the task, a method or algorithm that uses the data to produce output results, and a way of evaluating the result.
An algorithm is a set of rules used for solving a problem or performing a computation. There are several families of ML algorithms that typically have different qualities of input data and achieve different task objectives.
Supervised learning algorithms operate on labelled examples, where each data point has features (also known as attributes) and an associated known true label value. For example, features of a patient medical condition with a known label of the disease.
If you have labelled data, supervised learning algorithms learn a function that maps input feature vectors to the output label. Within supervised learning there are different types of algorithms depending on your data and task objective.
Classification algorithms are used when the outputs are restricted to a limited set of label values.
Regression algorithms are used when the outputs may have any numerical value.
If you have data with no labels, unsupervised learning algorithms can be used to discover patterns. Within unsupervised learning there are also different types of algorithms depending on your data and task objective.
Clustering, for example discovering customer segmentations based on behavioural patterns where the segments are unknown ahead of time.
Association mining, discovering interesting relations between features in large datasets, for example to power a product recommendation engine.
Dimensionality reduction, where data is transformed from a high dimension (with many features) into a lower dimension (fewer features). This can help with data visualisation, data summarisation and improve downstream efficiency for further analysis.
Semi-supervised learning (also known as weak learning) is a middle ground between supervised and unsupervised approaches, especially when gathering supervised datasets is costly. This approach uses labelled and unlabelled data to create weak labels on the unlabelled subset and may produce better models than supervised methods alone.
Semi-supervised learning is widely used in many applications such as natural language processing, computer vision, and speech recognition, where labeled data is often scarce, but large amounts of unlabelled data are available.
Reinforcement learning is quite different from the other methods described above. It is the training of an ML model that directs an agent to make a sequence of actions in an environment. For example, self-driving cars, playing chess, minimising energy consumption in data centres.
The agent interacts with the environment and receives feedback in the form of rewards or penalties, and uses this information to improve its decision making over time. Reinforcement learning well-suited for problems where it’s difficult to explicitly program an optimal solution, or where the optimal solution may change over time.
Due to its generality, reinforcement learning is studied in many disciplines, such as game theory, control theory, operations research, and multi-agent systems.
Deep learning involves training artificial neural networks (inspired by the structure of the brain) on large amounts of data and using the learned representations for a variety of tasks. The word “deep” in “deep learning” refers to the number of layers through which the data is transformed.
Deep learning is an approach that encompasses each of the families of algorithms described above. For supervised learning tasks, deep learning methods eliminate feature engineering, by translating the data into compact intermediate representations. An example of unsupervised deep learning is a Deep belief network, while Deep reinforcement learning is an active area of research.
Deep learning has been very successful in a variety of applications and has outperformed traditional machine learning algorithms on many tasks. This is partially due to an abundance of data and faster GPUs in recent years, as well as advancements in training methods.
Choosing the right machine learning algorithm depends on a number of factors, including the type of problem you are trying to solve, the size and quality of your data, the desired interpretability of the result, and the computational resources you have available.
Identifying the family of algorithms based on your specific task is the first step, followed by trying one or more algorithms. Often there is no single “best” algorithm for a given problem. In many cases, an ensemble of multiple algorithms can produce better results than a single algorithm.
Here is a quick start cheat-sheet to help select an algorithm from the popular Python ML package, scikit-learn:
And here are a couple more algorithm selection cheat-sheets, from Datacamp and Microsoft.
The following list gives you an idea of the more common classification algorithms. For a much broader list see Outline of Machine Learning on Wikipedia or the scikit-learn documentation.
It can be useful to understand the different types of algorithms and be aware of different qualities they may have from each other. For example:
Experimentation is important when developing a machine learning model. It allows testing of different hypotheses and evaluating the effectiveness of various algorithms, parameters, and architectures. This provides insights into the strengths and weaknesses of models, identifying areas for improvement to achieve better performance.
Experimentation also enables the practitioner to evaluate the model’s generalisation ability, which is crucial for ensuring that the model can make accurate predictions on unseen data. This aspect of training a model will be covered in a future post.
I recommend playing with the TensorFlow Playground which lets you create, run and experiment with simple neural networks on sample data. This is a great way to get a intuition about what a network can and cannot learn.
Check out Part 2 of the machine learning series covering how algorithms learn from data, as well as ways to ensure you train a good model by addressing model complexity, overfitting and data splits.
If you really want to jump into ML, I also recommend Andrew Ng’s fantastic Stanford CS229 lecture series available on YouTube. Have fun!
]]>This is Part 3 in the machine learning series, covering productionisation of models. Part 1 covers an overview of machine learning and common algorithms, while Part 2 covers model training and learning from data.
Machine learning productionisation can mean different things. It could mean taking a pre-trained ML model and making it available as an API. It could also involve building a pipeline for data preparation, model training, optimisation, and validation.
Hopefully you are also monitoring your productionised system to ensure everything is running well, for example catching unexpected errors or measuring model drift on new data over time. If you’re able to reproduce experimental results, iterate quickly and push to production automatically then you get bonus points.
Regardless of the precise definition, getting machine learning models into a production system and maintaining them over time is hard.
Best practices are emerging to ensure you can successfully prepare data, train a model, validate predictions, deploy, serve and monitor your model.
This post is a 10,000-foot overview of things to consider during the life cycle of a machine learning project and includes pointers to useful resources.
1) Production software requirements
2) Production ML System Requirements
3) Machine Learning Canvas
4) Google's Rules of Machine Learning
5) Technical debt in Machine Learning systems
6) Automated Machine Learning pipelines - basic to advanced
7) Machine Learning frameworks
8) Final thoughts
As a baseline, a robust production ML system should adhere to good software engineering practices. These foundations must be solid to successfully deploy and operationalise ML models, which come with additional challenges.
Baseline examples:
Atlassian has a series of articles on software best practices that you may find useful.
ML systems have additional requirements over and above traditional software systems due to their reliance on training data to construct a model
Data is as important to ML systems as code is to traditional software. Also, many of the same issues affecting code affect data as well, for example versioning.
A comparison of traditional programming vs ML is shown in this diagram. It highlights the shift in thinking required for constructing models and learning rules from data.
There are several emerging fields growing to address the ML system life cycle.
These fields are still being defined and created and some may disappear in time. What these descriptions highlight is the interdisciplinary nature required to successfully productionise ML models.
Highly desirable ML system requirements:
Desirable ML system requirements:
Update Nov 2022:
Operationalizing Machine Learning: An Interview Study from University of California, Berkeley is a great read. It covers interviews with 18 MLEs working across many applications and summarises common practices for successful ML experimentation, deployment, sustaining production performance, pain points and anti-patterns.
It’s worth remembering that not all models need to be productionised. Before going down this path hopefully you’ve determined that the value from productionising a model is greater than the cost of productionising and the associated ongoing system maintenance.
The Machine Learning Canvas (adapted from the popular Lean Canvas) identifies requirements, problems and scope of an ML model and is useful to get all parties on the same page early in an ML project.
It helps describe how your ML system will turn predictions into value for end-users, which data it will learn from, and how to make sure it will work as intended. It also helps anticipate costs, identify bottlenecks, specify requirements, and create a roadmap.
Google has published their machine learning best practices (Rules of Machine Learning) and has deep wisdom.
These 4 points are recommended when starting to productionise ML:
Followed by this wise caveat: “This approach will work well for a long period of time. Diverge from this approach only when there are no more simple tricks to get you any farther. Adding complexity slows future releases.”
There are 43 rules in total. Here are some highlights:
It’s helpful to know what you’re getting into. If you are new on this journey, the paper Hidden Technical Debt in Machine Learning Systems is a must-read.
Here’s the gist:
“ML systems have a special capacity for incurring technical debt, because they have all of the maintenance problems of traditional code plus an additional set of ML-specific issues. This debt may be difficult to detect because it exists at the system level rather than the code level. Traditional abstractions and boundaries may be subtly corrupted or invalidated by the fact that data influences ML system behaviour. Typical methods for paying down code level technical debt are not sufficient to address ML-specific technical debt at the system level.”
“Code dependencies can be identified via static analysis by compilers and linkers. Without similar tooling for data dependencies, it can be inappropriately easy to build large data dependency chains that can be difficult to untangle.”
This image summarises the paper very well and puts the core ML code into perspective when productionising:
There are several steps in a full ML pipeline. Some steps may be manual at first, especially when proving value to the business.
The key is to understand that the manual steps should be automated in time rather than continuing to accumulate tech debt and potentially produce error-prone results. It is best to understand this commitment at the start.
A productionised predictive ML model is one piece with a fair bit going on, and the best place to start.
Further down the track data quality, data splits, feature engineering, model optimisation and training comprise other pieces.
The first step is to productionise the trained model to make automated predictions on unlabelled data.
This diagram only shows the input data and output prediction flow - it doesn’t address where the data comes from or how the predictions are used.
An extension is to automate the data splits, model training and performance evaluation. This helps with reproducibility, removes error-prone manual steps, and saves time.
A next possible step is to automate training and evaluation of multiple models with multiple training input data feature sets.
This gets complex very quickly and requires pre-requisite systems to be in place to run efficiently. The emerging field of MLOps (machine learning operations) is building tools, techniques and companies to handle these type of scenarios.
What isn’t addressed in the pipeline diagrams above is model deployment, serving and monitoring.
This is dependent on many things: your current infrastructure, who will be consuming the model results, performance requirements, size of data etc.
For example, if you have a Kubernetes cluster available, you could containerise your model and deploy using Kubeflow. Or maybe your company uses AWS and your model is small? In this case you could use existing deployment practices and deploy to Lambda or else utilise the managed AWS SageMaker service. If you’re just starting out, you may be able to set up a Python webserver and utilise a framework like FastAPI to serve model results.
Whichever path you choose, follow the software and ML system best practices described above where possible.
To round off this post, here are a some open-source and commercial offerings that you could check out.
It is best to build on the shoulders of giants. Here are some open-source ML frameworks you might find want to explore in your ML productionisation journey:
The big cloud players are in MLOps now too.
See more platforms here: mlops.neptune.ai
The machine learning space is evolving rapidly and it’s great to see best practices and tools emerging to make productionisation of models faster and more robust. However, there is still a long way to go!!
The following is an annotated version of the 2013 Ethereum Whitepaper by Vitalik Buterin where the original is displayed in dark text and annotations in lighter text.
I previously annotated the Bitcoin Whitepaper with a primarily purpose of understanding each section thoroughly by reading and summarising various background references. This was a very useful exercise and it helped me appreciate both the history leading to Satoshi's discovery and also why Bitcoin and blockchains are such an important invention.
Annotating the Ethereum Whitepaper was a more involved task as the scope is much larger and on-going research much deeper. Ethereum learns from and improves upon certain Bitcoin limitations. It includes a turing-complete programmatic smart contract layer and describes distributed trustless applications (that are today implemented and working) across finance, identity, reputation, prediction markets, file storage and even decentralized autonomous organizations. It also, importantly, is working towards a more environmentally friendly consensus mechanism than Bitcoin.
I found sections describing cryptoeconomic scenarios (where systems combine economic incentives with cryptography, particularly in adversarial environments) the most fascinating as currently implemented solutions draw from many different disciplines. Ultimately Ethereum is super exciting in that it offers a platform for automating certain archaic systems and removing the inherent weaknesses of the trust based model not only for payments and value transfer, but also for many more advanced applications and marketplaces.
Without further ado, may I present...
1. Introduction to Bitcoin and Existing Concepts
2. Ethereum
3. Applications
4. Miscellanea And Concerns
5. Putting It All Together: Decentralized Applications
6. Conclusion
7. Notes and Further Reading
The original Ethereum whitepaper (2013) is displayed in dark text with annotations below in lighter text.
Ethereum (Ξ) is a decentralized, open-source blockchain with smart contract functionality.
Vitalik Buterin is a Russian-Canadian programmer and writer who is best known as one of the co-founders of Ethereum (vitalik.ca)
ethereum.org was registered on 27 Nov 2013.
Satoshi Nakamoto's development of Bitcoin in 2009 has often been hailed as a radical development in money and currency, being the first example of a digital asset which simultaneously has no backing or "intrinsic value" and no centralized issuer or controller. However, another, arguably more important, part of the Bitcoin experiment is the underlying blockchain technology as a tool of distributed consensus, and attention is rapidly starting to shift to this other aspect of Bitcoin.
Commonly cited alternative applications of blockchain technology include using on-blockchain digital assets to represent custom currencies and financial instruments ("colored coins"), the ownership of an underlying physical device ("smart property"), non-fungible assets such as domain names ("Namecoin"), as well as more complex applications involving having digital assets being directly controlled by a piece of code implementing arbitrary rules ("smart contracts") or even blockchain-based "decentralized autonomous organizations" (DAOs). What Ethereum intends to provide is a blockchain with a built-in fully fledged Turing-complete programming language that can be used to create "contracts" that can be used to encode arbitrary state transition functions, allowing users to create any of the systems described above, as well as many others that we have not yet imagined, simply by writing up the logic in a few lines of code.
Satoshi Nakamoto is the name used by the presumed pseudonymous person or persons who developed bitcoin.
Bitcoin (₿) is a decentralized digital currency, without a central bank or single administrator, that can be sent from user to user on the peer-to-peer bitcoin network without the need for intermediaries.
A blockchain is a list of cryptographically linked blocks typically managed by a peer-to-peer network for use as a publicly distributed ledger.
In 2013, Buterin briefly worked with eToro CEO Yoni Assia on the Colored Coins project.
Smart property is property whose ownership is controlled via a blockchain.
Namecoin is a key/value pair registration and transfer system based on the Bitcoin technology
A smart contract is a program that runs on a blockchain with the objectives of reduction in trusted intermediators, arbitrations, enforcement costs and fraud losses.
A decentralized autonomous organization (DAO) is an organization represented by rules encoded as a computer program that is transparent, controlled by the organization members and not influenced by a central government.
A programming language that is Turing complete is theoretically capable of expressing all tasks accomplishable by computers
The concept of decentralized digital currency, as well as alternative applications like property registries, has been around for decades. The anonymous e-cash protocols of the 1980s and the 1990s, mostly reliant on a cryptographic primitive known as Chaumian blinding, provided a currency with a high degree of privacy, but the protocols largely failed to gain traction because of their reliance on a centralized intermediary. In 1998, Wei Dai's b-money became the first proposal to introduce the idea of creating money through solving computational puzzles as well as decentralized consensus, but the proposal was scant on details as to how decentralized consensus could actually be implemented. In 2005, Hal Finney introduced a concept of "reusable proofs of work", a system which uses ideas from b-money together with Adam Back's computationally difficult Hashcash puzzles to create a concept for a cryptocurrency, but once again fell short of the ideal by relying on trusted computing as a backend.
Because currency is a first-to-file application, where the order of transactions is often of critical importance, decentralized currencies require a solution to decentralized consensus. The main roadblock that all pre-Bitcoin currency protocols faced is the fact that, while there had been plenty of research on creating secure Byzantine-fault-tolerant multiparty consensus systems for many years, all of the protocols described were solving only half of the problem. The protocols assumed that all participants in the system were known, and produced security margins of the form "if N parties participate, then the system can tolerate up to N/4 malicious actors". The problem is, however, that in an anonymous setting such security margins are vulnerable to sybil attacks, where a single attacker creates thousands of simulated nodes on a server or botnet and uses these nodes to unilaterally secure a majority share.
The innovation provided by Satoshi is the idea of combining a very simple decentralized consensus protocol, based on nodes combining transactions into a "block" every ten minutes creating an ever-growing blockchain, with proof of work as a mechanism through which nodes gain the right to participate in the system. While nodes with a large amount of computational power do have proportionately greater influence, coming up with more computational power than the entire network combined is much harder than simulating a million nodes. Despite the Bitcoin blockchain model's crudeness and simplicity, it has proven to be good enough, and would over the next five years become the bedrock of over two hundred currencies and protocols around the world.
A digital currency is any money-like asset managed, stored or exchanged on digital computer systems.
Ecash was conceived by David Chaum as an anonymous (but centralised) cryptographic electronic money system in 1983.
A Chaumian blinding is a form of digital signature in which the content of a message is disguised (blinded) before it is signed.
Wei Dai proposed B-money in 1998 as an "anonymous, distributed electronic cash system".
Hal Finney was an early bitcoin contributor.
Proof-of-work is a cryptographic zero-knowledge proof in which one party (the prover) proves to others (the verifiers) that an amount of computational effort has been expended. It is an example of a permissionless consensus protocol that allows anyone in the network to join dynamically and participate without prior permission.
Hashcash is a proof-of-work system proposed in 1997 by Adam Back.
A cryptocurrency is a sub-type of digital currency that relies on cryptography to chain together digital signatures of asset transfers.
A fundamental problem in distributed computing is to coordinate processes to reach consensus on some data value that is needed during computation.
A Byzantine fault is any fault presenting different symptoms to different observers. Byzantine fault tolerance is the dependability of a computer system to such conditions and can be achieved if the non-faulty participants have a consensus.
Nodes can be anonymous and there are no access controls in an open, permissionless blockchain.
A Sybil attack is when an attacker creates a large number of pseudonymous identities and uses them to gain a large influence. Imposing economic costs may be used to make Sybil attacks more expensive.
Proof-of-work is a cryptographic zero-knowledge proof in which one party (the prover) proves to others (the verifiers) that an amount of computational effort has been expended.
From a technical standpoint, the Bitcoin ledger can be thought of as a state transition system, where there is a "state" consisting of the ownership status of all existing bitcoins and a "state transition function" that takes a state and a transaction and outputs a new state which is the result. In a standard banking system, for example, the state is a balance sheet, a transaction is a request to move $X from A to B, and the state transition function reduces the value in A's account by $X and increases the value in B's account by $X. If A's account has less than $X in the first place, the state transition function returns an error. Hence, one can formally define:
APPLY(S,TX) > S' or ERROR
In the banking system defined above:
APPLY({ Alice: $50, Bob: $50 },"send $20 from Alice to Bob") = { Alice: $30, Bob: $70 }
But:
APPLY({ Alice: $50, Bob: $50 },"send $70 from Alice to Bob") = ERROR
The "state" in Bitcoin is the collection of all coins (technically, unspent transaction outputs" or UTXO) that have been minted and not yet spent, with each UTXO having a denomination and an owner (defined by a 20-byte address which is essentially a cryptographic public key[1]). A transaction contains one or more inputs, with each input containing a reference to an existing UTXO and a cryptographic signature produced by the private key associated with the owner's address, and one or more outputs, with each output containing a new UTXO to be added to the state.
The state transition function APPLY(S,TX) > S'
can be defined roughly as follows:
1. For each input in TX:
i. If the referenced UTXO is not in S, return an error.
ii. If the provided signature does not match the owner of the UTXO, return an error.
2. If the sum of the denominations of all input UTXO is less than the sum of the denominations of
all output UTXO, return an error.
3. Return S with all input UTXO removed and all output UTXO added.
The first half of the first step prevents transaction senders from spending coins that do not exist, the second half of the first step prevents transaction senders from spending other people's coins, and the second step enforces conservation of value. In order to use this for payment, the protocol is as follows. Suppose Alice wants to send 11.7 BTC to Bob. First, Alice will look for a set of available UTXO that she owns that totals up to at least 11.7 BTC. Realistically, Alice will not be able to get exactly 11.7 BTC; say that the smallest she can get is 6+4+2=12. She then creates a transaction with those three inputs and two outputs. The first output will be 11.7 BTC with Bob's address as its owner, and the second output will be the remaining 0.3 BTC "change", with the owner being Alice herself.
A state transition system is a concept used in the study of computation to describe the potential behavior of discrete systems consisting of states and transitions between states.
An Unspent Transaction Output (UTXO) defines an output of a transaction that has not been spent, i.e. can be used as an input in a new transaction.
If we had access to a trustworthy centralized service, this system would be trivial to implement; it could simply be coded exactly as described. However, with Bitcoin we are trying to build a decentralized currency system, so we will need to combine the state transition system with a consensus system in order to ensure that everyone agrees on the order of transactions. Bitcoin's decentralized consensus process requires nodes in the network to continuously attempt to produce packages of transactions called "blocks". The network is intended to produce roughly one block every ten minutes, with each block containing a timestamp, a nonce, a reference to (ie. hash of) the previous block and a list of all of the transactions that have taken place since the previous block. Over time, this creates a persistent, ever-growing, "blockchain" that constantly updates to represent the latest state of the Bitcoin ledger.
The algorithm for checking if a block is valid, expressed in this paradigm, is as follows:
1. Check if the previous block referenced by the block exists and is valid
2. Check that the timestamp of the block is greater than that of the previous block[2] and less than 2 hours into the future.
3. Check that the proof of work on the block is valid.
4. Let S[0] be the state at the end of the previous block.
5. Suppose TX is the block's transaction list with n transactions. For all i in 0...n-1, setS[i+1] =
APPLY(S[i],TX[i]) If any application returns an error, exit and return false.
6. Return true, and register S[n] as the state at the end of this block
Essentially, each transaction in the block must provide a state transition that is valid. Note that the state is not encoded in the block in any way; it is purely an abstraction to be remembered by the validating node and can only be (securely) computed for any block by starting from the genesis state and sequentially applying every transaction in every block. Additionally, note that the order in which the miner includes transactions into the block matters; if there are two transactions A and B in a block such that B spends a UTXO created by A, then the block will be valid if A comes before B but not otherwise.
The interesting part of the block validation algorithm is the concept of "proof of work": the condition is that the SHA256 hash of every block, treated as a 256-bit number, must be less than a dynamically adjusted target, which as of the time of this writing is approximately 2190. The purpose of this is to make block creation computationally "hard", thereby preventing sybil attackers from remaking the entire blockchain in their favor. Because SHA256 is designed to be a completely unpredictable pseudorandom function, the only way to create a valid block is simply trial and error, repeatedly incrementing the nonce and seeing if the new hash matches. At the current target of 2192, this means an average of 264 tries; in general, the target is recalibrated by the network every 2016 blocks so that on average a new block is produced by some node in the network every ten minutes. In order to compensate miners for this computational work, the miner of every block is entitled to include a transaction giving themselves 25 BTC out of nowhere. Additionally, if any transaction has a higher total denomination in its inputs than in its outputs, the difference also goes to the miner as a "transaction fee". Incidentally, this is also the only mechanism by which BTC are issued; the genesis state contained no coins at all.
In order to better understand the purpose of mining, let us examine what happens in the event of a malicious attacker. Since Bitcoin's underlying cryptography is known to be secure, the attacker will target the one part of the Bitcoin system that is not protected by cryptography directly: the order of transactions. The attacker's strategy is simple:
1. Send 100 BTC to a merchant in exchange for some product (preferably a rapid-delivery digital good)
2. Wait for the delivery of the product
3. Produce another transaction sending the same 100 BTC to himself
4. Try to convince the network that his transaction to himself was the one that came first.
Once step (1) has taken place, after a few minutes some miner will include the transaction in a block, say block number 270000. After about one hour, five more blocks will have been added to the chain after that block, with each of those blocks indirectly pointing to the transaction and thus "confirming" it. At this point, the merchant will accept the payment as finalized and deliver the product; since we are assuming this is a digital good, delivery is instant. Now, the attacker creates another transaction sending the 100 BTC to himself. If the attacker simply releases it into the wild, the transaction will not be processed; miners will attempt to run APPLY(S,TX) and notice that TX consumes a UTXO which is no longer in the state. So instead, the attacker creates a "fork" of the blockchain, starting by mining another version of block 270000 pointing to the same block 269999 as a parent but with the new transaction in place of the old one. Because the block data is different, this requires redoing the proof of work. Furthermore, the attacker's new version of block 270000 has a different hash, so the original blocks 270001 to 270005 do not "point" to it; thus, the original chain and the attacker's new chain are completely separate. The rule is that in a fork the longest blockchain (ie. the one backed by the largest quantity of proof of work) is taken to be the truth, and so legitimate miners will work on the 270005 chain while the attacker alone is working on the 270000 chain. In order for the attacker to make his blockchain the longest, he would need to have more computational power than the rest of the network combined in order to catch up (hence, "51% attack").
In regards to blockchain, reaching consensus means that at least 51% of the nodes on the network agree on the next global state of the network.
Blocks are batches of transactions with a hash of the previous block in the chain. This links blocks together (in a chain) because hashes are cryptographically derived from the block data.
SHA-2 (Secure Hash Algorithm 2) is a set of cryptographic hash functions designed by the NSA in 2001 and includes SHA-256.
Each transaction requires a fee since each Ethereum transaction uses computational resources to execute.
The way transaction fees on the Ethereum network were calculated changed with the London Upgrade of August 2021 (see EIP-1559).
Blockchain forks can be defined as changes in the protocol of the network or in situations when two or more blocks have the same block height.
When upgrades are needed in centrally-controlled software a company will publish a new version for the end-user. Blockchains work differently. Ethereum clients must update their software to implement the new fork rules.
Proof-of-work is the mechanism that allows the decentralized Ethereum network to come to consensus, or agree on things like account balances and the order of transactions.
Ethereum is moving to a proof-of-stake consensus mechanism in 2022. PoS has validators who stake ETH to participate in the system, rather than miners. This makes it much more energy efficient.
Any attacker that achieves 51% hashing power can effectively overturn network transactions, resulting in double spending.
Left tree: it suffices to present only a small number of nodes in a Merkle tree to give a proof of the validity of a branch.
Right tree: any attempt to change any part of the Merkle tree will eventually lead to an inconsistency somewhere up the chain.
An important scalability feature of Bitcoin is that the block is stored in a multi-level data structure. The "hash" of a block is actually only the hash of the block header, a roughly 200-byte piece of data that contains the timestamp, nonce, previous block hash and the root hash of a data structure called the Merkle tree storing all transactions in the block.
A Merkle tree is a type of binary tree, composed of a set of nodes with a large number of leaf nodes at the bottom of the tree containing the underlying data, a set of intermediate nodes where each node is the hash of its two children, and finally a single root node, also formed from the hash of its two children, representing the "top" of the tree. The purpose of the Merkle tree is to allow the data in a block to be delivered piecemeal: a node can download only the header of a block from one source, the small part of the tree relevant to them from another source, and still be assured that all of the data is correct. The reason why this works is that hashes propagate upward: if a malicious user attempts to swap in a fake transaction into the bottom of a Merkle tree, this change will cause a change in the node above, and then a change in the node above that, finally changing the root of the tree and therefore the hash of the block, causing the protocol to register it as a completely different block (almost certainly with an invalid proof of work).
The Merkle tree protocol is arguably essential to long-term sustainability. A "full node" in the Bitcoin network, one that stores and processes the entirety of every block, takes up about 15 GB of disk space in the Bitcoin network as of April 2014, and is growing by over a gigabyte per month. Currently, this is viable for some desktop computers and not phones, and later on in the future only businesses and hobbyists will be able to participate. A protocol known as "simplified payment verification" (SPV) allows for another class of nodes to exist, called "light nodes", which download the block headers, verify the proof of work on the block headers, and then download only the "branches" associated with transactions that are relevant to them. This allows light nodes to determine with a strong guarantee of security what the status of any Bitcoin transaction, and their current balance, is while downloading only a very small portion of the entire blockchain.
A Merkle tree is a tree in which every leaf node is labelled with the cryptographic hash of a data block, and every non-leaf node is labelled with the cryptographic hash of the labels of its child nodes.
Merkle tree allows efficient and secure verification of the contents of large data structures.
Ralph Merkle is one of the inventors of public-key cryptography and the inventor of cryptographic hashing.
Full nodes download every block and transaction and check them against Ethereums's consensus rules. A light node stores the header chain and requests everything else. Further, an archive node stores everything kept in the full node and builds an archive of historical states.
In Simplified Payment Verification mode, clients connect to an arbitrary full node and download only the block headers. They verify the chain headers connect together correctly and that the difficulty is high enough.
The idea of taking the underlying blockchain idea and applying it to other concepts also has a long history. In 2005, Nick Szabo came out with the concept of "secure property titles with owner authority", a document describing how "new advances in replicated database technology" will allow for a blockchain-based system for storing a registry of who owns what land, creating an elaborate framework including concepts such as homesteading, adverse possession and Georgian land tax. However, there was unfortunately no effective replicated database system available at the time, and so the protocol was never implemented in practice. After 2009, however, once Bitcoin's decentralized consensus was developed a number of alternative applications rapidly began to emerge:
● Namecoin - created in 2010, Namecoin is best described as a decentralized name registration database. In decentralized protocols like Tor, Bitcoin and BitMessage, there needs to be some way of identifying accounts so that other people can interact with them, but in all existing solutions the only kind of identifier available is a pseudorandom hash like 1LW79wp5ZBqaHW1jL5TCiBCrhQYtHagUWy. Ideally, one would like to be able to have an account with a name like "george". However, the problem is that if one person can create an account named "george" then someone else can use the same process to register "george" for themselves as well and impersonate them. The only solution is a first-to-file paradigm, where the first registrant succeeds and the second fails - a problem perfectly suited for the Bitcoin consensus protocol. Namecoin is the oldest, and most successful, implementation of a name registration system using such an idea.
● Colored coins - the purpose of colored coins is to serve as a protocol to allow people to create their own digital currencies - or, in the important trivial case of a currency with one unit, digital tokens, on the Bitcoin blockchain. In the colored coins protocol, one "issues" a new currency by publicly assigning a color to a specific Bitcoin UTXO, and the protocol recursively defines the color of other UTXO to be the same as the color of the inputs that the transaction creating them spent (some special rules apply in the case of mixed-color inputs). This allows users to maintain wallets containing only UTXO of a specific color and send them around much like regular bitcoins, backtracking through the blockchain to determine the color of any UTXO that they receive.
● Metacoins - the idea behind a metacoin is to have a protocol that lives on top of Bitcoin, using Bitcoin transactions to store metacoin transactions but having a different state transition function, APPLY'. Because the metacoin protocol cannot prevent invalid metacoin transactions from appearing in the Bitcoin blockchain, a rule is added that if APPLY'(S,TX) returns an error, the protocol defaults to APPLY'(S,TX) = S. This provides an easy mechanism for creating an arbitrary cryptocurrency protocol, potentially with advanced features that cannot be implemented inside of Bitcoin itself, but with a very low development cost since the complexities of mining and networking are already handled by the Bitcoin protocol.
Thus, in general, there are two approaches toward building a consensus protocol: building an independent network, and building a protocol on top of Bitcoin. The former approach, while reasonably successful in the case of applications like Namecoin, is difficult to implement; each individual implementation needs to bootstrap an independent blockchain, as well as building and testing all of the necessary state transition and networking code. Additionally, we predict that the set of applications for decentralized consensus technology will follow a power law distribution where the vast majority of applications would be too small to warrant their own blockchain, and we note that there exist large classes of decentralized applications, particularly decentralized autonomous organizations, that need to interact with each other.
The Bitcoin-based approach, on the other hand, has the flaw that it does not inherit the simplified payment verification features of Bitcoin. SPV works for Bitcoin because it can use blockchain depth as a proxy for validity; at some point, once the ancestors of a transaction go far enough back, it is safe to say that they were legitimately part of the state. Blockchain-based meta-protocols, on the other hand, cannot force the blockchain not to include transactions that are not valid within the context of their own protocols. Hence, a fully secure SPV meta-protocol implementation would need to backward scan all the way to the beginning of the Bitcoin blockchain to determine whether or not certain transactions are valid. Currently, all "light" implementations of Bitcoin-based meta-protocols rely on a trusted server to provide the data, arguably a highly suboptimal result especially when one of the primary purposes of a cryptocurrency is to eliminate the need for trust.
Nick Szabo is a computer scientist, legal scholar, and cryptographer known for his research in digital contracts and digital currency.
N. Szabo, Secure Property Titles with Owner Authority, 1998.
Namecoin is a key/value pair registration and transfer system based on the Bitcoin technology. Similarly, the Ethereum Name Service is a distributed, open, and extensible naming system based on the Ethereum blockchain that was started in 2016.
Tor is free, open-source software enabling anonymous communication by directing traffic through a worldwide, volunteer network.
Bitmessage is a decentralized, encrypted, peer-to-peer, trustless communications protocol released in November 2012.
In a first-to-file system, the right to the grant of a patent lies with the first person to file a patent application, regardless of the invention date.
The term "Colored Coins" loosely describes a class of methods for representing and managing real world assets on a blockchain.
In 2013, Buterin briefly worked with eToro CEO Yoni Assia on the Colored Coins project.
The Ethereum ERC20 Token Standard (proposed by Fabian Vogelsteller in 2015) allows for fungible tokens and is a similar concept to Colored Coins.
A metacoin protocol allows for advanced transactions (custom currencies, decentralized exchanges, derivatives etc) that are implemented on top of another blockchain. In this sense, colored coins are an example of a metacoin implementation.
Metacoins require a full client and cannot be supported by a Bitcoin light client. This was one of the reasons Vitalik decided to create Ethereum rather than build on the Bitcoin network.
Even without any extensions, the Bitcoin protocol actually does facilitate a weak version of a concept of "smart contracts". UTXO in Bitcoin can be owned not just by a public key, but also by a more complicated script expressed in a simple stack-based programming language. In this paradigm, a transaction spending that UTXO must provide data that satisfies the script. Indeed, even the basic public key ownership mechanism is implemented via a script: the script takes an elliptic curve signature as input, verifies it against the transaction and the address that owns the UTXO, and returns 1 if the verification is successful and 0 otherwise. Other, more complicated, scripts exist for various additional use cases. For example, one can construct a script that requires signatures from two out of a given three private keys to validate ("multisig"), a setup useful for corporate accounts, secure savings accounts and some merchant escrow situations. Scripts can also be used to pay bounties for solutions to computational problems, and one can even construct a script that says something like "this Bitcoin UTXO is yours if you can provide an SPV proof that you sent a Dogecoin transaction of this denomination to me", essentially allowing decentralized cross-cryptocurrency exchange.
However, the scripting language as implemented in Bitcoin has several important limitations:
● Lack of Turing-completeness - that is to say, while there is a large subset of computation that the Bitcoin scripting language supports, it does not nearly support everything. The main category that is missing is loops. This is done to avoid infinite loops during transaction verification; theoretically it is a surmountable obstacle for script programmers, since any loop can be simulated by simply repeating the underlying code many times with an if statement, but it does lead to scripts that are very space-inefficient. For example, implementing an alternative elliptic curve signature algorithm would likely require 256 repeated multiplication rounds all individually included in the code.
● Value-blindness - there is no way for a UTXO script to provide fine-grained control over the amount that can be withdrawn. For example, one powerful use case of an oracle contract would be a hedging contract, where A and B put in $1000 worth of BTC and after 30 days the script sends $1000 worth of BTC to A and the rest to B. This would require an oracle to determine the value of 1 BTC in USD, but even then it is a massive improvement in terms of trust and infrastructure requirement over the fully centralized solutions that are available now. However, because UTXO are all-or-nothing, the only way to achieve this is through the very inefficient hack of having many UTXO of varying denominations (eg. one UTXO of 2k for every k up to 30) and having the oracle pick which UTXO to send to A and which to B.
● Blockchain-blindness - UTXO are blind to blockchain data such as the nonce and previous hash. This severely limits applications in gambling, and several other categories, by depriving the scripting language of a potentially valuable source of randomness.
Thus, we see three approaches to building advanced applications on top of cryptocurrency: building a new blockchain, using scripting on top of Bitcoin, and building a meta-protocol on top of Bitcoin. Building a new blockchain allows for unlimited freedom in building a feature set, but at the cost of development time and bootstrapping effort. Using scripting is easy to implement and standardize, but is very limited in its capabilities, and meta-protocols, while easy, suffer from faults in scalability. With Ethereum, we intend to build a generalized framework that can provide the advantages of all three paradigms at the same time.
A smart contract is a program that runs on a blockchain with the objectives of reduction in trusted intermediators, arbitrations, enforcement costs and fraud losses.
Many ideas underlying Bitcoin contracts were first described by Nick Szabo in Formalizing and Securing Relationships on Public Networks.
The Elliptic Curve Digital Signature Algorithm (ECDSA) offers a variant of the Digital Signature Algorithm (DSA) which uses elliptic curve cryptography.
Multisig is a digital signature scheme which allows a group of users to sign a single document.
Decentralized exchanges (DEX) are a type of cryptocurrency exchange which allows for direct peer-to-peer cryptocurrency transactions without the need for an intermediary.
A programming language that is Turing complete is theoretically capable of expressing all tasks accomplishable by computers
A bitcoin wallet balance is the sum of the UTXOs controlled by the wallet's private keys. It doesn't have a defined account balance.
UTXO code is deliberately limited. Coins can't tell what they're being divided into - this creates value-blindness in the case of an oracle contract.
Ethereum solves value and blockchain blindness through the use of accounts with balance and nonce fields.
The intent of Ethereum is to merge together and improve upon the concepts of scripting, altcoins and on-chain meta-protocols, and allow developers to create arbitrary consensus-based applications that have the scalability, standardization, feature-completeness, ease of development and interoperability offered by these different paradigms all at the same time. Ethereum does this by building what is essentially the ultimate abstract foundational layer: a blockchain with a built-in Turing-complete programming language, allowing anyone to write smart contracts and decentralized applications where they can create their own arbitrary rules for ownership, transaction formats and state transition functions. A bare-bones version of Namecoin can be written in two lines of code, and other protocols like currencies and reputation systems can be built in under twenty. Smart contracts, cryptographic "boxes" that contain value and only unlock it if certain conditions are met, can also be built on top of our platform, with vastly more power than that offered by Bitcoin scripting because of the added powers of Turing-completeness, value-awareness, blockchain-awareness and state.
In Ethereum, the state is made up of objects called "accounts", with each account having a 20-byte address and state transitions being direct transfers of value and information between accounts. An Ethereum account contains four fields:
● The nonce, a counter used to make sure each transaction can only be processed once
● The account's current ether balance
● The account's contract code, if present
● The account's storage (empty by default)
"Ether" is the main internal crypto-fuel of Ethereum, and is used to pay transaction fees. In general, there are two types of accounts: externally owned accounts, controlled by private keys, and contract accounts, controlled by their contract code. An externally owned account has no code, and one can send messages from an externally owned account by creating and signing a transaction; in a contract account, every time the contract account receives a message its code activates, allowing it to read and write to internal storage and send other messages or create contracts in turn.
An Ethereum account is an entity with an ether (ETH) balance that can send transactions on Ethereum. Accounts can be user-controlled via private key (Externally-owned) or deployed as smart contracts (Contract account).
An Externally-owned account costs nothing to create, can initiate transactions and transactions can only be ETH/token transfers.
A Contract account costs ETH to create because you're using network storage, can only send transactions in response to receiving a transaction and transactions from an external account can trigger code.
"Messages" in Ethereum are somewhat similar to “transactions” in Bitcoin, but with three important differences. First, an Ethereum message can be created either by an external entity or a contract, whereas a Bitcoin transaction can only be created externally. Second, there is an explicit option for Ethereum messages to contain data. Finally, the recipient of an Ethereum message, if it is a contract account, has the option to return a response; this means that Ethereum messages also encompass the concept of functions.
The term "transaction" is used in Ethereum to refer to the signed data package that stores a message to be sent from an externally owned account. Transactions contain the recipient of the message, a signature identifying the sender, the amount of ether and the data to send, as well as two values called STARTGAS and GASPRICE. In order to prevent exponential blowup and infinite loops in code, each transaction is required to set a limit to how many computational steps of code execution it can spawn, including both the initial message and any additional messages that get spawned during execution. STARTGAS is this limit, and GASPRICE is the fee to pay to the miner per computational step. If transaction execution "runs out of gas", all state changes revert - except for the payment of the fees, and if transaction execution halts with some gas remaining then the remaining portion of the fees is refunded to the sender. There is also a separate transaction type, and corresponding message type, for creating a contract; the address of a contract is calculated based on the hash of the account nonce and transaction data.
An important consequence of the message mechanism is the "first class citizen" property of Ethereum - the idea that contracts have equivalent powers to external accounts, including the ability to send message and create other contracts. This allows contracts to simultaneously serve many different roles: for example, one might have a member of a decentralized organization (a contract) be an escrow account (another contract) between an paranoid individual employing custom quantum-proof Lamport signatures (a third contract) and a co-signing entity which itself uses an account with five keys for security (a fourth contract). The strength of the Ethereum platform is that the decentralized organization and the escrow contract do not need to care about what kind of account each party to the contract is.
Transactions are cryptographically signed instructions from accounts. An account will initiate a transaction to update the state of the Ethereum network.
Any node can broadcast a request for a transaction to be executed on the Ethereum Virtual Machine (EVM). After this happens, a miner will execute the transaction and propagate the resulting state change to the rest of the network.
At any given block in the chain, Ethereum has one and only one 'canonical' state, and the EVM is what defines the rules for computing a new valid state from block to block.
Ethereum is moving to a consensus mechanism called proof-of-stake (PoS) from proof-of-work (PoW) in 2022.
Gas refers to the unit that measures the amount of computational effort required to execute specific operations on the Ethereum network. Gas fees are paid in Ethereum's native currency, ether (ETH).
There are 2 main transaction types: Regular transactions from one wallet to another and contract deployment transactions without a 'to' address, where the data field is used for the contract code.
There are more transaction types, including legacy 'v' transactions and new Typed Transaction Envelope transactions defined by EIP-2718.
The Ethereum state transition function, APPLY(S,TX) -> S' can be defined as follows:
1. Check if the transaction is well-formed (ie. has the right number of values), the signature is valid, and the nonce matches the nonce in the sender's account. If not, return an error.
2. Calculate the transaction fee as STARTGAS * GASPRICE, and determine the sending address from the signature. Subtract the fee from the sender's account balance and increment the sender's nonce. If there is not enough balance to spend, return an error.
3. Initialize GAS = STARTGAS, and take off a certain quantity of gas per byte to pay for the bytes in the transaction.
4. Transfer the transaction value from the sender's account to the receiving account. If the receiving account does not yet exist, create it. If the receiving account is a contract, run the contract's code either to completion or until the execution runs out of gas.
5. If the value transfer failed because the sender did not have enough money, or the code execution ran out of gas, revert all state changes except the payment of the fees, and add the fees to the miner's account.
6. Otherwise, refund the fees for all remaining gas to the sender, and send the fees paid for gas consumed to the miner.
For example, suppose that the contract's code is:
if !contract.storage[msg.data[0]]:
contract.storage[msg.data[0]] = msg.data[1]
Note that in reality the contract code is written in the low-level EVM code; this example is written in Serpent, our high-level language, for clarity, and can be compiled down to EVM code. Suppose that the contract's storage starts off empty, and a transaction is sent with 10 ether value, 2000 gas, 0.001 ether gasprice, and two data fields: [ 2, 'CHARLIE' ][3]. The process for the state transition function in this case is as follows:
1. Check that the transaction is valid and well formed.
2. Check that the transaction sender has at least 2000 * 0.001 = 2 ether. If it is, then subtract 2 ether from the sender's account.
3. Initialize gas = 2000; assuming the transaction is 170 bytes long and the byte-fee is 5, subtract 850 so that there is 1150 gas left.
4. Subtract 10 more ether from the sender's account, and add it to the contract's account.
5. Run the code. In this case, this is simple: it checks if the contract's storage at index 2 is used, notices that it is not, and so it sets the storage at index 2 to the value CHARLIE. Suppose this takes 187 gas, so the remaining amount of gas is 1150 - 187 = 963
6. Add 963 * 0.001 = 0.963 ether back to the sender's account, and return the resulting state.
If there was no contract at the receiving end of the transaction, then the total transaction fee would simply be equal to the provided GASPRICE multiplied by the length of the transaction in bytes, and the data sent alongside the transaction would be irrelevant. Additionally, note that contract-initiated messages can assign a gas limit to the computation that they spawn, and if the sub-computation runs out of gas it gets reverted only to the point of the message call. Hence, just like transactions, contracts can secure their limited computational resources by setting strict limits on the sub-computations that they spawn.
The EVM behaves as a mathematical function would: Given an input, it produces a deterministic output. This is referred to as a state transition function.
The state is an enormous data structure called a modified Merkle Patricia Trie, which keeps all accounts linked by hashes and reducible to a single root hash stored on the blockchain.
Serpent is a deprecated assembly language that compiles to EVM code. Not recommended.
Solidity is recommeded as a high-level language for implementing smart contracts with Javascript/C++ style.
Vyper is the 2nd most popular language with a Pythonic style, although not as complete as Solidity.
Lisp Like Language (LLL) is good for close-to-the-metal optimizations
There are additional languages like Yul, Yul+, and Fe also.
Gas limit refers to the maximum amount of gas you are willing to consume on a transaction. More complicated transactions involving smart contracts require more computational work, so they require a higher gas limit than a simple payment. Although a transaction includes a limit, any gas not used in a transaction is returned to the user.
The code in Ethereum contracts is written in a low-level, stack-based bytecode language, referred to as "Ethereum virtual machine code" or "EVM code". The code consists of a series of bytes, where each byte represents an operation. In general, code execution is an infinite loop that consists of repeatedly carrying out the operation at the current program counter (which begins at zero) and then incrementing the program counter by one, until the end of the code is reached or an error or STOP or RETURN instruction is detected. The operations have access to three types of space in which to store data:
● The stack, a last-in-first-out container to which 32-byte values can be pushed and popped
● Memory, an infinitely expandable byte array
● The contract's long-term storage, a key/value store where keys and values are both 32 bytes. Unlike stack and memory, which reset after computation ends, storage persists for the long term.
The code can also access the value, sender and data of the incoming message, as well as block header data, and the code can also return a byte array of data as an output.
The formal execution model of EVM code is surprisingly simple. While the Ethereum virtual machine is running, its full computational state can be defined by the tuple (block_state, transaction, message, code, memory, stack, pc, gas), where block_state is the global state containing all accounts and includes balances and storage. Every round of execution, the current instruction is found by taking the pc-th byte of code, and each instruction has its own definition in terms of how it affects the tuple. For example, ADD pops two items off the stack and pushes their sum, reduces gas by 1 and increments pc by 1, and SSTORE pushes the top two items off the stack and inserts the second item into the contract's storage at the index specified by the first item, as well as reducing gas by up to 200 and incrementing pc by 1. Although there are many ways to optimize Ethereum via just-in-time compilation, a basic implementation of Ethereum can be done in a few hundred lines of code.
The EVM executes as a stack machine with a depth of 1024 items. Each item is a 256-bit word, which was chosen for the ease of use with 256-bit cryptography
Compiled smart contract bytecode executes as a number of EVM opcodes, which perform standard stack operations like XOR, AND, ADD, SUB, etc. The EVM also implements a number of blockchain-specific stack operations, such as ADDRESS, BALANCE, BLOCKHASH, etc. A list is available in the go-ethereum source code.
During execution, the EVM maintains a transient memory which does not persist between transactions. Contracts, however, do contain a Merkle Patricia storage trie associated with the account in question and part of the global state.
All EVM implementations must adhere to the specification described in the Ethereum Yellowpaper.
The Ethereum blockchain is in many ways similar to the Bitcoin blockchain, although it does have some differences. The main difference between Ethereum and Bitcoin with regard to the blockchain architecture is that, unlike Bitcoin, Ethereum blocks contain a copy of both the transaction list and the most recent state. Aside from that, two other values, the block number and the difficulty, are also stored in the block. The block validation algorithm in Ethereum is as follows:
1. Check if the previous block referenced exists and is valid.
2. Check that the timestamp of the block is greater than that of the referenced previous block and less than 15 minutes into the future
3. Check that the block number, difficulty, transaction root, uncle root and gas limit (various low-level Ethereum-specific concepts) are valid.
4. Check that the proof of work on the block is valid.
5. Let S[0] be the STATE_ROOT of the previous block.
6. Let TX be the block's transaction list, with n transactions. For all in in 0...n-1, setS[i+1] =
APPLY(S[i],TX[i]). If any applications returns an error, or if the total gas consumed in the block up
until this point exceeds the GASLIMIT, return an error.
7. Let S_FINAL be S[n], but adding the block reward paid to the miner.
8. Check if S_FINAL is the same as the STATE_ROOT. If it is, the block is valid; otherwise, it is not valid.
The approach may seem highly inefficient at first glance, because it needs to store the entire state with each block, but in reality efficiency should be comparable to that of Bitcoin. The reason is that the state is stored in the tree structure, and after every block only a small part of the tree needs to be changed. Thus, in general, between two adjacent blocks the vast majority of the tree should be the same, and therefore the data can be stored once and referenced twice using pointers (ie. hashes of subtrees). A special kind of tree known as a "Patricia tree" is used to accomplish this, including a modification to the Merkle tree concept that allows for nodes to be inserted and deleted, and not just changed, efficiently. Additionally, because all of the state information is part of the last block, there is no need to store the entire blockchain history - a strategy which, if it could be applied to Bitcoin, can be calculated to provide 5-20x savings in space.
Blocks are batches of transactions with a hash of the previous block in the chain. This links blocks together (in a chain) because hashes are cryptographically derived from the block data.
What's in an Ethereum block?
1. timestamp
2. blockNumber
3. baseFeePerGas
4. difficulty
5. mixHash
6. parentHash
7. transactions
8. stateRoot
9. nonce
The proof-of-work protocol, Ethash, requires miners to go through a race of trial and error to find the nonce for a block. Only blocks with a valid nonce can be added to the chain.
When racing to mine a block, a miner will repeatedly put a dataset, that you can only get from downloading and running the full chain (as a miner does), through a mathematical function. The dataset gets used to generate a mixHash below a target nonce, as dictated by the block difficulty.
The difficulty determines the target for the hash. The lower the target, the smaller the set of valid hashes. Once generated, this is incredibly easy for other miners and clients to verify.
The objective of proof-of-work is to extend the chain. The longest chain is most believable as the valid one because it's had the most computational work done. To consistently create malicious yet valid blocks, you'd need over 51% of the network mining power.
A major criticism of proof-of-work is the amount of energy output required to keep the network safe.
Ethereum is moving to a consensus mechanism called proof-of-stake from proof-of-work in 2022. At a high level, proof-of-stake has the same end goal as proof-of-work: to help the decentralized network reach consensus securely.
In general, there are three types of applications on top of Ethereum. The first category is financial applications, providing users with more powerful ways of managing and entering into contracts using their money. This includes sub-currencies, financial derivatives, hedging contracts, savings wallets, wills, and ultimately even some classes of full-scale employment contracts. The second category is semi-financial applications, where money is involved but there is also a heavy non-monetary side to what is being done; a perfect example is self-enforcing bounties for solutions to computational problems. Finally, there are applications such as online voting and decentralized governance that are not financial at all.
On-blockchain token systems have many applications ranging from sub-currencies representing assets such as USD or gold to company stocks, individual tokens representing smart property, secure unforgeable coupons, and even token systems with no ties to conventional value at all, used as point systems for incentivization. Token systems are surprisingly easy to implement in Ethereum. The key point to understand is that all a currency, or token systen, fundamentally is is a database with one operation: subtract X units from A and give X units to B, with the proviso that (i) X had at least X units before the transaction and (2) the transaction is approved by A. All that it takes to implement a token system is to implement this logic into a contract.
The basic code for implementing a token system in Serpent looks as follows:
from = msg.sender
to = msg.data[0]
value = msg.data[1]
if contract.storage[from] >= value:
contract.storage[from] = contract.storage[from] value
contract.storage[to] = contract.storage[to] + value
This is essentially a literal implementation of the "banking system" state transition function described further above in this document. A few extra lines of code need to be added to provide for the initial step of distributing the currency units in the first place and a few other edge cases, and ideally a function would be added to let other contracts query for the balance of an address. But that's all there is to it. Theoretically, Ethereum-based token systems acting as sub-currencies can potentially include another important feature that on-chain Bitcoin-based meta-currencies lack: the ability to pay transaction fees directly in that currency. The way this would be implemented is that the contract would maintain an ether balance with which it would refund ether used to pay fees to the sender, and it would refill this balance by collecting the internal currency units that it takes in fees and reselling them in a constant running auction. Users would thus need to "activate" their accounts with ether, but once the ether is there it would be reusable because the contract would refund it each time.
There are thousands of Ethereum tokens standardised by the ERC-20 fungible token standard, where by fungibility is the property of a good or a commodity whose individual units are essentially interchangeable
Additionally there are other ERC token standards for non-fungible tokens (NFT's).
Financial derivatives are the most common application of a "smart contract", and one of the simplest to implement in code. The main challenge in implementing financial contracts is that the majority of them require reference to an external price ticker; for example, a very desirable application is a smart contract that hedges against the volatility of ether (or another cryptocurrency) with respect to the US dollar, but doing this requires the contract to know what the value of ETH/USD is. The simplest way to do this is through a "data feed" contract maintained by a specific party (eg. NASDAQ) designed so that that party has the ability to update the contract as needed, and providing an interface that allows other contracts to send a message to that contract and get back a response that provides the price.
Given that critical ingredient, the hedging contract would look as follows:
1. Wait for party A to input 1000 ether.
2. Wait for party B to input 1000 ether.
3. Record the USD value of 1000 ether, calculated by querying the data feed contract, in storage, say this is $x.
4. After 30 days, allow A or B to "ping" the contract in order to send $x worth of ether (calculated by querying the data feed contract again to get the new price) to A and the rest to B.
Such a contract would have significant potential in crypto-commerce. One of the main problems cited about cryptocurrency is the fact that it's volatile; although many users and merchants may want the security and convenience of dealing with cryptographic assets, they many not wish to face that prospect of losing 23% of the value of their funds in a single day. Up until now, the most commonly proposed solution has been issuer-backed assets; the idea is that an issuer creates a sub-currency in which they have the right to issue and revoke units, and provide one unit of the currency to anyone who provides them (offline) with one unit of a specified underlying asset (eg. gold, USD). The issuer then promises to provide one unit of the underlying asset to anyone who sends back one unit of the crypto-asset. This mechanism allows any non-cryptographic asset to be "uplifted" into a cryptographic asset, provided that the issuer can be trusted.
In practice, however, issuers are not always trustworthy, and in some cases the banking infrastructure is too weak, or too hostile, for such services to exist. Financial derivatives provide an alternative. Here, instead of a single issuer providing the funds to back up an asset, a decentralized market of speculators, betting that the price of a cryptographic reference asset will go up, plays that role. Unlike issuers, speculators have no option to default on their side of the bargain because the hedging contract holds their funds in escrow. Note that this approach is not fully decentralized, because a trusted source is still needed to provide the price ticker, although arguably even still this is a massive improvement in terms of reducing infrastructure requirements (unlike being an issuer, issuing a price feed requires no licenses and can likely be categorized as free speech) and reducing the potential for fraud.
Decentralized Finance (DeFi) is a blockchain-based form of finance that does not rely on central financial intermediaries such as brokerages, exchanges, or banks to offer traditional financial instruments, and instead utilizes smart contracts on blockchains.
As of Nov 2021, approximately $106 billion USD was invested in DeFi according to defipulse.com, with a majority powered by Ethereum.
DAI is an example of an Ethereum based stable coin that is soft pegged to the US Dollar.
Chainlink is an example of a Ethereum based distributed oracle enabling the creation of hybrid smart contracts by connecting with the outside world.
Uniswap is an example of a decentralized finance protocol that is used to exchange cryptocurrencies.
The earliest alternative cryptocurrency of all, Namecoin, attempted to use a Bitcoin-like blockchain to provide a name registration system, where users can register their names in a public database alongside other data. The major cited use case is for a DNS system, mapping domain names like "bitcoin.org" (or, in Namecoin's case, "bitcoin.bit") to an IP address. Other use cases include email authentication and potentially more advanced reputation systems. Here is the basic contract to provide a Namecoin-like name registration system on Ethereum:
if !contract.storage[tx.data[0]]:
contract.storage[tx.data[0]] = tx.data[1]
The contract is very simple; all it is is a database inside the Ethereum network that can be added to, but not modified or removed from. Anyone can register a name with some value, and that registration then sticks forever. A more sophisticated name registration contract will also have a "function clause" allowing other contracts to query it, as well as a mechanism for the "owner" (ie. the first registerer) of a name to change the data or transfer ownership. One can even add reputation and web-of-trust functionality on top.
Ethereum Name Service (ENS) is a decentralized naming system allowing human-readable names like "bob.eth" to map to identifiers such as addresses, content hashes, and metadata.
EIP-4361 is an early proposal to standardize off-chain authentication for Ethereum accounts to establish sessions.
Over the past few years, there have emerged a number of popular online file storage startups, the most prominent being Dropbox, seeking to allow users to upload a backup of their hard drive and have the service store the backup and allow the user to access it in exchange for a monthly fee. However, at this point the file storage market is at times relatively inefficient; a cursory look at various existing solutions shows that, particularly at the "uncanny valley" 20-200 GB level at which neither free quotas nor enterprise-level discounts kick in, monthly prices for mainstream file storage costs are such that you are paying for more than the cost of the entire hard drive in a single month. Ethereum contracts can allow for the development of a decentralized file storage ecosystem, where individual users can earn small quantities of money by renting out their own hard drives and unused space can be used to further drive down the costs of file storage.
The key underpinning piece of such a device would be what we have termed the "decentralized Dropbox contract". This contract works as follows. First, one splits the desired data up into blocks, encrypting each block for privacy, and builds a Merkle tree out of it. One then makes a contract with the rule that, every N blocks, the contract would pick a random index in the Merkle tree (using the previous block hash, accessible from contract code, as a source of randomness), and give X ether to the first entity to supply a transaction with a simplified payment verification-like proof of ownership of the block at that particular index in the tree. When a user wants to re-download their file, they can use a micropayment channel protocol (eg. pay 1 szabo per 32 kilobytes) to recover the file; the most fee-efficient approach is for the payer not to publish the transaction until the end, instead replacing the transaction with a slightly more lucrative one with the same nonce after every 32 kilobytes.
An important feature of the protocol is that, although it may seem like one is trusting many random nodes not to decide to forget the file, one can reduce that risk down to near-zero by splitting the file into many pieces via secret sharing, and watching the contracts to see each piece is still in some node's possession. If a contract is still paying out money, that provides a cryptographic proof that someone out there is still storing the file.
Decentralized storage systems consist of a peer-to-peer network of user-operators who hold a portion of the overall data, creating a resilient file storage sharing system.
Ethereum itself can be used as a decentralized storage system, and it is when it comes to code storage in all the smart contracts. However, when it comes to large amounts of data, that isn't what Ethereum was designed for.
In most cases, instead of storing all data on-chain, the hash of where the data is located on a chain gets stored. This way, the entire chain doesn't need to scale to keep all of the data.
Filecoin (FIL) is a decentralized storage network based on the Interplanetary File Storage (IPFS) protocol.
Storj (STORJ) is an open-source platform that leverages the blockchain to provide end-to-end encrypted cloud storage services.
Swarm (BZZ) is a system of peer-to-peer networked nodes that create a decentralised storage and communication service. The system is economically self-sustaining due to a built-in incentive system enforced through smart contracts on the Ethereum blockchain.
The general concept of a "decentralized organization" is that of a virtual entity that has a certain set of members or shareholders which, perhaps with a 67% majority, have the right to spend the entity's funds and modify its code. The members would collectively decide on how the organization should allocate its funds. Methods for allocating a DAO's funds could range from bounties, salaries to even more exotic mechanisms such as an internal currency to reward work. This essentially replicates the legal trappings of a traditional company or nonprofit but using only cryptographic blockchain technology for enforcement. So far much of the talk around DAOs has been around the "capitalist" model of a "decentralized autonomous corporation" (DAC) with dividend-receiving shareholders and tradable shares; an alternative, perhaps described as a "decentralized autonomous community", would have all members have an equal share in the decision making and require 67% of existing members to agree to add or remove a member. The requirement that one person can only have one membership would then need to be enforced collectively by the group.
A general outline for how to code a DO is as follows. The simplest design is simply a piece of self-modifying code that changes if two thirds of members agree on a change. Although code is theoretically immutable, one can easily get around this and have de-facto mutability by having chunks of the code in separate contracts, and having the address of which contracts to call stored in the modifiable storage. In a simple implementation of such a DAO contract, there would be three transaction types, distinguished by the data provided in the transaction:
● [0,i,K,V] to register a proposal with index i to change the address at storage index K to value V
● [0,i] to register a vote in favor of proposal i
● [2,i] to finalize proposal i if enough votes have been made
The contract would then have clauses for each of these. It would maintain a record of all open storage changes, along with a list of who voted for them. It would also have a list of all members. When any storage change gets to two thirds of members voting for it, a finalizing transaction could execute the change. A more sophisticated skeleton would also have built-in voting ability for features like sending a transaction, adding members and removing members, and may even provide for Liquid Democracy-style vote delegation (ie. anyone can assign someone to vote for them, and assignment is transitive so if A assigns B and B assigns C then C determines A's vote). This design would allow the DO to grow organically as a decentralized community, allowing people to eventually delegate the task of filtering out who is a member to specialists, although unlike in the "current system" specialists can easily pop in and out of existence over time as individual community members change their alignments.
An alternative model is for a decentralized corporation, where any account can have zero or more shares, and two thirds of the shares are required to make a decision. A complete skeleton would involve asset management functionality, the ability to make an offer to buy or sell shares, and the ability to accept offers (preferably with an order-matching mechanism inside the contract). Delegation would also exist Liquid Democracy-style, generalizing the concept of a "board of directors".
In the future, more advanced mechanisms for organizational governance may be implemented; it is at this point that a decentralized organization (DO) can start to be described as a decentralized autonomous organization (DAO). The difference between a DO and a DAO is fuzzy, but the general dividing line is whether the governance is generally carried out via a political-like process or an “automatic” process; a good intuitive test is the “no common language” criterion: can the organization still function if no two members spoke the same language? Clearly, a simple traditional shareholder-style corporation would fail, whereas something like the Bitcoin protocol would be much more likely to succeed. Robin Hanson’s futarchy, a mechanism for organizational governance via prediction markets, is a good example of what truly “autonomous” governance might look like. Note that one should not necessarily assume that all DAOs are superior to all DOs; automation is simply a paradigm that is likely to have have very large benefits in certain particular places and may not be practical in others, and many semi-DAOs are also likely to exist.
A decentralized autonomous organization (DAO) is an organization represented by rules encoded as a computer program that is transparent, controlled by the organization members. A DAO's financial transaction record and program rules are maintained on a blockchain.
MakerDAO is a project on Ethereum and a DAO created in 2014. The project is managed by people around the world who hold its governance token, MKR. MKR holders are responsible for governing the Maker Protocol, which includes adjusting policy for the Dai stablecoin, choosing new collateral types, and improving governance itself.
The ENS DAO is a DAO, created in 2021, that governs the ENS protocol, a decentralized naming system. The project is managed by people around the world who hold its governance token, ENS.
Once a contract is in the blockchain, it is final and cannot be changed. Certain parameters, of course, can be changed if they are allowed to change via the original code. One method of updating contracts is to use a versioning system. For example, you could have an entryway contract that just forwards all calls to the most recent version of the contract, as defined by an updatable address parameter.
Futarchy is a form of government proposed by economist Robin Hanson, in which elected officials define measures of national wellbeing, and prediction markets are used to determine which policies will have the most positive effect.
1. Savings wallets. Suppose that Alice wants to keep her funds safe, but is worried that she will lose or someone will hack her private key. She puts ether into a contract with Bob, a bank, as follows:
● Alice alone can withdraw a maximum of 1% of the funds per day.
● Bob alone can withdraw a maximum of 1% of the funds per day, but Alice has the ability to make a transaction with her key shutting off this ability.
● Alice and Bob together can withdraw anything.
Normally, 1% per day is enough for Alice, and if Alice wants to withdraw more she can contact Bob for help. If Alice's key gets hacked, she runs to Bob to move the funds to a new contract. If she loses her key, Bob will get the funds out eventually. If Bob turns out to be malicious, then she can turn off his ability to withdraw.
2. Crop insurance. One can easily make a financial derivatives contract but using a data feed of the weather instead of any price index. If a farmer in Iowa purchases a derivative that pays out inversely based on the precipitation in Iowa, then if there is a drought, the farmer will automatically receive money and if there is enough rain the farmer will be happy because their crops would do well.
3. A decentralized data feed. For financial contracts for difference, it may actually be possible to decentralize the data feed via a protocol called "SchellingCoin". SchellingCoin basically works as follows: N parties all put into the system the value of a given datum (eg. the ETH/USD price), the values are sorted, and everyone between the 25th and 75th percentile gets one token as a reward. Everyone has the incentive to provide the answer that everyone else will provide, and the only value that a large number of players can realistically agree on is the obvious default: the truth. This creates a decentralized protocol that can theoretically provide any number of values, including the ETH/USD price, the temperature in Berlin or even the result of a particular hard computation.
4. Smart multi-signature escrow. Bitcoin allows multisignature transaction contracts where, for example, three out of a given five keys can spend the funds. Ethereum allows for more granularity; for example, four out of five can spend everything, three out of five can spend up to 10% per day, and two out of five can spend up to 0.5% per day. Additionally, Ethereum multisig is asynchronous - two parties can register their signatures on the blockchain at different times and the last signature will automatically send the transaction.
5. Cloud computing. The EVM technology can also be used to create a verifiable computing environment, allowing users to ask others to carry out computations and then optionally ask for proofs that computations at certain randomly selected checkpoints were done correctly. This allows for the creation of a cloud computing market where any user can participate with their desktop, laptop or specialized server, and spot-checking together with security deposits can be used to ensure that the system is trustworthy (ie. nodes cannot profitably cheat). Although such a system may not be suitable for all tasks; tasks that require a high level of inter-process communication, for example, cannot easily be done on a large cloud of nodes. Other tasks, however, are much easier to parallelize; projects like SETI@home, folding@home and genetic algorithms can easily be implemented on top of such a platform.
6. Peer-to-peer gambling. Any number of peer-to-peer gambling protocols, such as Frank Stajano and Richard Clayton's Cyberdice, can be implemented on the Ethereum blockchain. The simplest gambling protocol is actually simply a contract for difference on the next block hash, and more advanced protocols can be built up from there, creating gambling services with near-zero fees that have no ability to cheat.
7. Prediction markets. Provided an oracle or SchellingCoin, prediction markets are also easy to implement, and prediction markets together with SchellingCoin may prove to be the first mainstream application of futarchy as a governance protocol for decentralized organizations.
8. On-chain decentralized marketplaces, using the identity and reputation system as a base.
Applications that run with a blockchain backend are called distributed apps (dapps).
Today there are many dapps around finance, collectibles, gaming, prediction markets, exchanges, marketplaces, etc.
Some dapp lists (current as of Nov 2021) are:
ethereum.org dapps
stateofthedapps.com dapps
dappradar.com dapps
In game theory, a focal point (or Schelling point) is a solution that people tend to choose by default in the absence of communication.
A SchellingCoin allows for a minimal-trust universal data feed.
Cyberdice is a peer-to-peer gambling in the presence of cheaters paper by Frank Stajano and Richard Clayton.
The "Greedy Heaviest Observed Subtree" (GHOST) protocol is an innovation first introduced by Yonatan Sompolinsky and Aviv Zohar in December 2013. The motivation behind GHOST is that blockchains with fast confirmation times currently suffer from reduced security due to a high stale rate - because blocks take a certain time to propagate through the network, if miner A mines a block and then miner B happens to mine another block before miner A's block propagates to B, miner B's block will end up wasted and will not contribute to network security. Furthermore, there is a centralization issue: if miner A is a mining pool with 30% hashpower and B has 10% hashpower, A will have a risk of producing a stale block 70% of the time (since the other 30% of the time A produced the last block and so will get mining data immediately) whereas B will have a risk of producing a stale block 90% of the time. Thus, if the block interval is short enough for the stale rate to be high, A will be substantially more efficient simply by virtue of its size. With these two effects combined, blockchains which produce blocks quickly are very likely to lead to one mining pool having a large enough percentage of the network hashpower to have de facto control over the mining process.
As described by Sompolinsky and Zohar, GHOST solves the first issue of network security loss by including stale blocks in the calculation of which chain is the "longest"; that is to say, not just the parent and further ancestors of a block, but also the stale children of the block's ancestors (in Ethereum jargon, "uncles") are added to the calculation of which block has the largest total proof of work backing it. To solve the second issue of centralization bias, we go beyond the protocol described by Sompolinsky and Zohar, and also allow stales to be registered into the main chain to receive a block reward: a stale block receives 93.75% of its base reward, and the nephew that includes the stale block receives the remaining 6.25%. Transaction fees, however, are not awarded to uncles.
Ethereum implements a simplified version of GHOST which only goes down five levels. Specifically, a stale block can only be included as an uncle by the 2nd to 5th generation child of its parent, and not any block with a more distant relation (eg. 6th generation child of a parent, or 3rd generation child of a grandparent). This was done for several reasons. First, unlimited GHOST would include too many complications into the calculation of which uncles for a given block are valid. Second, unlimited GHOST with compensation as used in Ethereum removes the incentive for a miner to mine on the main chain and not the chain of a public attacker. Finally, calculations show that five-level GHOST with incentivization is over 95% efficient even with a 15s block time, and miners with 25% hashpower show centralization gains of less than 3%.
The GHOST protocol was introduced by Sompolinsky and Zohar in "Accelerating Bitcoin’s Transaction Processing", 2013.
It appears that Ethereum no longer uses the GHOST protocol. Ethereum does, however, use a modified version of the Inclusive protocol as described by Lewenberg, Sompolinsky and Zohar in "Inclusive Block Chain Protocols", 2015..
Yonatan Sompolinsky discusses Ethereum dropping the GHOST protocol in this talk: SPECTRE - BPASE '18 at 23m38s.
When a miner finds a valid block, another miner may have published a competing block which is added to the tip of the blockchain first. This valid, but stale, block can be included by newer blocks as ommers and receive a partial block reward. The term "ommer" is the preferred gender-neutral term for the sibling of a parent block, but this is also sometimes referred to as an "uncle".
Because every transaction published into the blockchain imposes on the network the cost of needing to download and verify it, there is a need for some regulatory mechanism, typically involving transaction fees, to prevent abuse. The default approach, used in Bitcoin, is to have purely voluntary fees, relying on miners to act as the gatekeepers and set dynamic minimums. This approach has been received very favorably in the Bitcoin community particularly because it is "market-based", allowing supply and demand between miners and transaction senders determine the price. The problem with this line of reasoning is, however, that transaction processing is not a market; although it is intuitively attractive to construe transaction processing as a service that the miner is offering to the sender, in reality every transaction that a miner includes will need to be processed by every node in the network, so the vast majority of the cost of transaction processing is borne by third parties and not the miner that is making the decision of whether or not to include it. Hence, tragedy-of-the-commons problems are very likely to occur.
However, as it turns out this flaw in the market-based mechanism, when given a particular inaccurate simplifying assumption, magically cancels itself out. The argument is as follows. Suppose that:
1. A transaction leads to k operations, offering the reward kR to any miner that includes it where R is set by the sender and k and R are (roughly) visible to the miner beforehand.
2. An operation has a processing cost of C to any node (ie. all nodes have equal efficiency)
3. There are N mining nodes, each with exactly equal processing power (ie. 1/N of total)
4. No non-mining full nodes exist.
A miner would be willing to process a transaction if the expected reward is greater than the cost. Thus, the expected reward is kR/N since the miner has a 1/N chance of processing the next block, and the processing cost for the miner is simply kC. Hence, miners will include transactions where kR/N > kC, or R > NC. Note that R is the per-operation fee provided by the sender, and is thus a lower bound on the benefit that the sender derives from the transaction, and NC is the cost to the entire network together of processing an operation. Hence, miners have the incentive to include only those transactions for which the total utilitarian benefit exceeds the cost.
However, there are several important deviations from those assumptions in reality:
1. The miner does pay a higher cost to process the transaction than the other verifying nodes, since the extra verification time delays block propagation and thus increases the chance the block will become a stale.
2. There do exist non-mining full nodes.
3. The mining power distribution may end up radically inegalitarian in practice.
4. Speculators, political enemies and crazies whose utility function includes causing harm to the network do exist, and they can cleverly set up contracts whose cost is much lower than the cost paid by other verifying nodes.
Point 1 above provides a tendency for the miner to include fewer transactions, and point 2 increases NC; hence, these two effects at least partially cancel each other out. Points 3 and 4 are the major issue; to solve them we simply institute a floating cap: no block can have more operations than BLK_LIMIT_FACTOR times the long-term exponential moving average. Specifically:
blk.oplimit = floor((blk.parent.oplimit * (EMAFACTOR - 1) + floor(parent.opcount * BLK_LIMIT_FACTOR)) / EMA_FACTOR)
BLK_LIMIT_FACTOR and EMA_FACTOR are constants that will be set to 65536 and 1.5 for the time being, but will likely be changed after further analysis.
Since each transaction requires computational resources to execute, each transaction requires a fee. Gas refers to the fee required to conduct a transaction successfully.
Gas fees help keep the Ethereum network secure. By requiring a fee for every computation executed on the network, we prevent bad actors from spamming the network. In order to avoid accidental or hostile infinite loops or other computational wastage in code, each transaction is required to set a limit to how many computational steps of code execution it can use.
The tragedy of the commons is a situation in which individual users, who have open access to a resource unhampered by shared social structures or formal rules act independently according to their own self-interest and, contrary to the common good of all users, cause depletion of the resource through their uncoordinated action.
Around 2019 a hidden "fee" was identified with blockchains: Maximal extractable value (MEV). This refers to the maximum value that can be extracted from block production in excess of the standard block reward and gas fees by including, excluding, and changing the order of transactions in a block.
MEV applies to both proof-of-work and proof-of-stake blockchains since it comes from being able to order transactions in a block.
For further MEV details see "Flash Boys 2.0: Frontrunning, Transaction Reordering, and Consensus Instability in Decentralized Exchanges" (Daian et al, 2019) and Flashbots: Frontrunning the MEV crisis.
An important note is that the Ethereum virtual machine is Turing-complete; this means that EVM code can encode any computation that can be conceivably carried out, including infinite loops. EVM code allows looping in two ways. First, there is a JUMP instruction that allows the program to jump back to a previous spot in the code, and a JUMPI instruction to do conditional jumping, allowing for statements like while x < 27: x = x * 2. Second, contracts can call other contracts, potentially allowing for looping through recursion. This naturally leads to a problem: can malicious users essentially shut miners and full nodes down by forcing them to enter into an infinite loop? The issue arises because of a problem in computer science known as the halting problem: there is no way to tell, in the general case, whether or not a given program will ever halt.
As described in the state transition section, our solution works by requiring a transaction to set a maximum number of computational steps that it is allowed to take, and if execution takes longer computation is reverted but fees are still paid. Messages work in the same way. To show the motivation behind our solution, consider the following examples:
● An attacker creates a contract which runs an infinite loop, and then sends a transaction activating that loop to the miner. The miner will process the transaction, running the infinite loop, and wait for it to run out of gas. Even though the execution runs out of gas and stops halfway through, the transaction is still valid and the miner still claims the fee from the attacker for each computational step.
● An attacker creates a very long infinite loop with the intent of forcing the miner to keep computing for such a long time that by the time computation finishes a few more blocks will have come out and it will not be possible for the miner to include the transaction to claim the fee. However, the attacker will be required to submit a value for STARTGAS limiting the number of computational steps that execution can take, so the miner will know ahead of time that the computation will take an excessively large number of steps.
● An attacker sees a contract with code of some form like send(A,contract.storage[A]); contract.storage[A] = 0, and sends a transaction with just enough gas to run the first step but not the second (ie. making a withdrawal but not letting the balance go down). The contract author does not need to worry about protecting against such attacks, because if execution stops halfway through the changes get reverted.
● A financial contract works by taking the median of nine proprietary data feeds in order to minimize risk. An attacker takes over one of the data feeds, which is designed to be modifiable via the variable-address-call mechanism described in the section on DAOs, and converts it to run an infinite loop, thereby attempting to force any attempts to claim funds from the financial contract to run out of gas. However, the financial contract can set a gas limit on the message to prevent this problem.
The alternative to Turing-completeness is Turing-incompleteness, where JUMP and JUMPI do not exist and only one copy of each contract is allowed to exist in the call stack at any given time. With this system, the fee system described and the uncertainties around the effectiveness of our solution might not be necessary, as the cost of executing a contract would be bounded above by its size. Additionally, Turing-incompleteness is not even that big a limitation; out of all the contract examples we have conceived internally, so far only one required a loop, and even that loop could be removed by making 26 repetitions of a one-line piece of code. Given the serious implications of Turing-completeness, and the limited benefit, why not simply have a Turing-incomplete language? In reality, however, Turing-incompleteness is far from a neat solution to the problem. To see why, consider the following contracts:
C0: call(C1); call(C1);
C1: call(C2); call(C2);
C2: call(C3); call(C3);
...
C49: call(C50); call(C50);
C50: (run one step of a program and record the change in storage)
Now, send a transaction to A. Thus, in 51 transactions, we have a contract that takes up 250 computational steps. Miners could try to detect such logic bombs ahead of time by maintaining a value alongside each contract specifying the maximum number of computational steps that it can take, and calculating this for contracts calling other contracts recursively, but that would require miners to forbid contracts that create other contracts (since the creation and execution of all 50 contracts above could easily be rolled into a single contract). Another problematic point is that the address field of a message is a variable, so in general it may not even be possible to tell which other contracts a given contract will call ahead of time. Hence, all in all, we have a surprising conclusion: Turing-completeness is surprisingly easy to manage, and the lack of Turing-completeness is equally surprisingly difficult to manage unless the exact same controls are in place - but in that case why not just let the protocol be Turing-complete?
Gas limit refers to the maximum amount of gas you are willing to consume on a transaction. More complicated transactions involving smart contracts require more computational work than a simple payments.
For example, if you put a gas limit of L for a transaction, and the transaction consumed C gas where C<L, you would get back the remaining L-C gas.
However, if L is insufficient, the EVM will consume your L gas units attempting to fulfill the transaction and it will not complete. The EVM then reverts any changes, but since the miner has already done L gas units worth of work, that gas is consumed.
The Ethereum network includes its own built-in currency, ether, which serves the dual purpose of providing a primary liquidity layer to allow for efficient exchange between various types of digital assets and, more importantly, of providing a mechanism for paying transaction fees. For convenience and to avoid future argument (see the current mBTC/uBTC/satoshi debate in Bitcoin), the denominations will be pre-labelled:
● 1: wei
● 10^12: szabo
● 10^15: finney
● 10^18: ether
This should be taken as an expanded version of the concept of "dollars" and "cents" or "BTC" and "satoshi". In the near future, we expect "ether" to be used for ordinary transactions, "finney" for microtransactions and "szabo" and "wei" for technical discussions around fees and protocol implementation.
The issuance model will be as follows:
● Ether will be released in a currency sale at the price of 1337-2000 ether per BTC, a mechanism intended to fund the Ethereum organization and pay for development that has been used with success by a number of other cryptographic platforms. Earlier buyers will benefit from larger discounts. The BTC received from the sale will be used entirely to pay salaries and bounties to developers, researchers and projects in the cryptocurrency ecosystem.
● 0.099x the total amount sold will be allocated to early contributors who participated in development before BTC funding or certainty of funding was available, and another 0.099x will be allocated to long-term research projects.
● 0.26x the total amount sold will be allocated to miners per year forever after that point.
Issuance Breakdown
The permanent linear supply growth model reduces the risk of what some see as excessive wealth concentration in Bitcoin, and gives individuals living in present and future eras a fair chance to acquire currency units, while at the same time discouraging depreciation of ether because the "supply growth rate" as a percentage still tends to zero over time. We also theorize that because coins are always lost over time due to carelessness, death, etc, and coin loss can be modeled as a percentage of the total supply per year, that the total currency supply in circulation will in fact eventually stabilize at a value equal to the annual issuance divided by the loss rate (eg. at a loss rate of 1%, once the supply reaches 26X then 0.26X will be mined and 0.26X lost every year, creating an equilibrium).
Group | At launch | After 1 year | After 5 years |
---|---|---|---|
Currency units | 1.198X | 1.458X | 2.498X |
Purchasers | 83.5% | 68.6% | 40.0% |
Reserve spent pre-sale | 8.26% | 6.79% | 3.96% |
Reserve used post-sale | 8.26% | 6.79% | 3.96% |
Miners | 0% | 17.8% | 52.0% |
Despite the linear currency issuance, just like with Bitcoin over time the supply growth rate nevertheless tends to zero.
Ethereum does not have a fixed supply because a fixed supply would also require a fixed security budget for the Ethereum network. Rather than arbitrarily fix Ethereum's security, Ethereum's monetary policy is best described as "minimum issuance to secure the network".
The London Upgrade was implemented in Aug 2021, to make transacting on Ethereum more predictable for users by overhauling Ethereum's transaction-fee-mechanism. This upgrade included offsetting the ETH issuance by burning a percentage of transaction fees (see EIP-1559).
The supply of Ethereum should decrease after the network transitions from proof of work to proof-of-stake.
According to simulations from Ethereum tracker Ultrasound Money, after the transition to proof-of-stake, the supply of ETH is set to decline 2% annually. If current rates hold, the blockchain will start burning more Ethereum than it produces with each new block.
EIP-3675 details the status of the consensus mechanism upgrade on Ethereum Mainnet that introduces proof-of-stake (expected in 2022 as of Nov 2021).
The Bitcoin mining algorithm basically works by having miners compute SHA256 on slightly modified versions of the block header millions of times over and over again, until eventually one node comes up with a version whose hash is less than the target (currently around 2190). However, this mining algorithm is vulnerable to two forms of centralization. First, the mining ecosystem has come to be dominated by ASICs (application-specific integrated circuits), computer chips designed for, and therefore thousands of times more efficient at, the specific task of Bitcoin mining. This means that Bitcoin mining is no longer a highly decentralized and egalitarian pursuit, requiring millions of dollars of capital to effectively participate in. Second, most Bitcoin miners do not actually perform block validation locally; instead, they rely on a centralized mining pool to provide the block headers. This problem is arguably worse: as of the time of this writing, the top two mining pools indirectly control roughly 50% of processing power in the Bitcoin network, although this is mitigated by the fact that miners can switch to other mining pools if a pool or coalition attempts a 51% attack.
The current intent at Ethereum is to use a mining algorithm based on randomly generating a unique hash function for every 1000 nonces, using a sufficiently broad range of computation to remove the benefit of specialized hardware. Such a strategy will certainly not reduce the gain of centralization to zero, but it does not need to. Note that each individual user, on their private laptop or desktop, can perform a certain quantity of mining activity almost for free, paying only electricity costs, but after the point of 100% CPU utilization of their computer additional mining will require them to pay for both electricity and hardware. ASIC mining companies need to pay for electricity and hardware starting from the first hash. Hence, if the centralization gain can be kept to below this ratio, (E + H) / E, then even if ASICs are made there will still be room for ordinary miners.
Additionally, we intend to design the mining algorithm so that mining requires access to the entire blockchain, forcing miners to store the entire blockchain and at least be capable of verifying every transaction. This removes the need for centralized mining pools; although mining pools can still serve the legitimate role of evening out the randomness of reward distribution, this function can be served equally well by peer-to-peer pools with no central control. It additionally helps fight centralization, by increasing the number of full nodes in the network so that the network remains reasonably decentralized even if most ordinary users prefer light clients.
Bitcoin uses the hashcash double SHA-256 function for mining.
An ASIC (application-specific integrated circuit) is an integrated circuit customized for a particular (rather than general) use. In mining hardware, ASICs were the next step of development after CPUs, GPUs and FPGAs.
Ethash is the PoW algorithm for Ethereum 1.0. It uses a variant of SHA3 often referred to as Keccak-256.
Ethash was developed with a strong focus on protection from ASIC miners, however in April 2018, first ASIC miners for Ethash were announced by Bitmain. Afterwards, EIP-1057 proposed moving to the ProgPoW algorithm, however this stagnated.
Hashrate is the total combined computational power used to mine transactions on a PoW blockchain and is a metric for the security of a blockchain network.
Ethermine (245 TH/s) and F2Pool (153 TH/s) are the largest Ethereum mining pools (as at Nov 2021).
Total hashrate of the Ethereum network is ~800 TH/s while the Bitcoin network is significantly greater at ~150 EH/s or ~150 million TH/s (as at Nov 2021).
Ethereum is moving to a consensus mechanism called proof-of-stake from proof-of-work in 2022. PoS requires users to stake 32 ETH to become a validator in the network. Validators are responsible for the same thing as miners in PoW: ordering transactions and creating new blocks so that all nodes can agree on the state of the network.
As of Nov 2021, there are 250,000+ active validators providing 8+ million ETH (currently USD$35 billion) as economic security for Ethereum 2.0.
One common concern about Ethereum is the issue of scalability. Like Bitcoin, Ethereum suffers from the flaw that every transaction needs to be processed by every node in the network. With Bitcoin, the size of the current blockchain rests at about 20 GB, growing by about 1 MB per hour. If the Bitcoin network were to process Visa's 2000 transactions per second, it would grow by 1 MB per three seconds (1 GB per hour, 8 TB per year). Ethereum is likely to suffer a similar growth pattern, worsened by the fact that there will be many applications on top of the Ethereum blockchain instead of just a currency as is the case with Bitcoin, but ameliorated by the fact that Ethereum full nodes need to store just the state instead of the entire blockchain history.
The problem with such a large blockchain size is centralization risk. If the blockchain size increases to, say, 100 TB, then the likely scenario would be that only a very small number of large businesses would run full nodes, with all regular users using light SPV nodes. In such a situation, there arises the potential concern that the full nodes could band together and all agree to cheat in some profitable fashion (eg. change the block reward, give themselves BTC). Light nodes would have no way of detecting this immediately. Of course, at least one honest full node would likely exist, and after a few hours information about the fraud would trickle out through channels like Reddit, but at that point it would be too late: it would be up to the ordinary users to organize an effort to blacklist the given blocks, a massive and likely infeasible coordination problem on a similar scale as that of pulling off a successful 51% attack. In the case of Bitcoin, this is currently a problem, but there exists a blockchain modification suggested by Peter Todd which will alleviate this issue.
In the near term, Ethereum will use two additional strategies to cope with this problem. First, because of the blockchain-based mining algorithms, at least every miner will be forced to be a full node, creating a lower bound on the number of full nodes. Second and more importantly, however, we will include an intermediate state tree root in the blockchain after processing each transaction. Even if block validation is centralized, as long as one honest verifying node exists, the centralization problem can be circumvented via a verification protocol. If a miner publishes an invalid block, that block must either be badly formatted, or the state S[n] is incorrect. Since S[0] is known to be correct, there must be some first state S[i] that is incorrect where S[i-1] is correct. The verifying node would provide the index i, along with a "proof of invalidity" consisting of the subset of Patricia tree nodes needing to process APPLY(S[i-1],TX[i]) -> S[i]. Nodes would be able to use those nodes to run that part of the computation, and see that the S[i] generated does not match the S[i] provided.
Another, more sophisticated, attack would involve the malicious miners publishing incomplete blocks, so the full information does not even exist to determine whether or not blocks are valid. The solution to this is a challenge-response protocol: verification nodes issue "challenges" in the form of target transaction indices, and upon receiving a node a light node treats the block as untrusted until another node, whether the miner or another verifier, provides a subset of Patricia nodes as a proof of validity.
Ethereum’s vision is to be more scalable and secure, but also to remain decentralized. Achieving these 3 qualities is a problem known as the scalability trilemma.
The main goal of scalability is to increase transaction speed (faster finality), and transaction throughput (high transactions per second), without sacrificing decentralization or security.
On-chain scaling requires changes to the Ethereum protocol (layer 1 Mainnet). Sharding is currently the main focus for this method of scaling.
Off-chain solutions are implemented separately from layer 1 Mainnet - they require no changes to the existing Ethereum protocol. Some solutions, known as "layer 2" solutions, derive their security directly from layer 1 Ethereum consensus, such as rollups or state channels.
As of Nov 2021, Bitcoin processes ~3 txn/second, while Etherem processes ~14 txn/second. Ethereum sharding will, hopefully, provide thousands of transactions per second.
The contract mechanism described above allows anyone to build what is essentially a command line application run on a virtual machine that is executed by consensus across the entire network, allowing it to modify a globally accessible state as its “hard drive”. However, for most people, the command line interface that is the transaction sending mechanism is not sufficiently user-friendly to make decentralization an attractive mainstream alternative. To this end, a complete “decentralized application” should consist of both low-level business-logic components, whether implemented entirely on Ethereum, using a combination of Ethereum and other systems (eg. a P2P messaging layer, one of which is currently planned to be put into the Ethereum clients) or other systems entirely, and high-level graphical user interface components. The Ethereum client’s design is to serve as a web browser, but include support for a “eth” Javascript API object, which specialized web pages viewed in the client will be able to use to interact with the Ethereum blockchain. From the point of view of the “traditional” web, these web pages are entirely static content, since the blockchain and other decentralized protocols will serve as a complete replacement for the server for the purpose of handling user-initiated requests. Eventually, decentralized protocols, hopefully themselves in some fashion using Ethereum, may be used to store the web pages themselves.
Applications that run with a blockchain backend are called distributed apps (dapps).
web3.js is a collection of libraries that allow you to interact with a local or remote ethereum node using HTTP, IPC or WebSocket.
The Ethereum protocol was originally conceived as an upgraded version of a cryptocurrency, providing advanced features such as on-blockchain escrow, withdrawal limits and financial contracts, gambling markets and the like via a highly generalized programming language. The Ethereum protocol would not "support" any of the applications directly, but the existence of a Turing-complete programming language means that arbitrary contracts can theoretically be created for any transaction type or application. What is more interesting about Ethereum, however, is that the Ethereum protocol moves far beyond just currency. Protocols and decentralized applications around decentralized file storage, decentralized computation and decentralized prediction markets, among dozens of other such concepts, have the potential to substantially increase the efficiency of the computational industry, and provide a massive boost to other peer-to-peer protocols by adding for the first time an economic layer. Finally, there is also a substantial array of applications that have nothing to do with money at all.
The concept of an arbitrary state transition function as implemented by the Ethereum protocol provides for a platform with unique potential; rather than being a closed-ended, single-purpose protocol intended for a specific array of applications in data storage, gambling or finance, Ethereum is open-ended by design, and we believe that it is extremely well-suited to serving as a foundational layer for a very large number of both financial and non-financial protocols in the years to come.
A Short History of Ethereum 2015 - 2019: An overview of the upgrades and hard forks of Ethereum’s past, with an eye towards what lies ahead.
Ethereum 2 vision: A digital future on a global scale. Grow Ethereum until it's powerful enough to help all of humanity.
Notes
1. A sophisticated reader may notice that in fact a Bitcoin address is the hash of the elliptic curve public key, and not the public key itself. However, it is in fact perfectly legitimate cryptographic terminology to refer to the pubkey hash as a public key itself. This is because Bitcoin's cryptography can be considered to be a custom digital signature algorithm, where the public key consists of the hash of the ECC pubkey, the signature consists of the ECC pubkey concatenated with the ECC signature, and the verification algorithm involves checking the ECC pubkey in the signature against the ECC pubkey hash provided as a public key and then verifying the ECC signature against the ECC pubkey.
2. Technically, the median of the 11 previous blocks.
3. Internally, 2 and "CHARLIE" are both numbers, with the latter being in big-endian base 256 representation. Numbers can be at least 0 and at most 2^256-1.
Further Reading
1. Intrinsic value: https://tinyurl.com/BitcoinMag-IntrinsicValue
2. Smart property: Smart property
3. Smart contracts: Smart contracts
4. B-money: http://www.weidai.com/bmoney.txt
5. Reusable proofs of work: http://www.finney.org/~hal/rpow/ (fixed link)
6. Secure property titles with owner authority: http://szabo.best.vwh.net/securetitle.html (fixed link)
7. Bitcoin whitepaper: http://bitcoin.org/bitcoin.pdf
8. Namecoin: https://namecoin.org/
9. Zooko's triangle: http://en.wikipedia.org/wiki/Zooko's_triangle
10. Colored coins whitepaper: https://tinyurl.com/coloredcoin-whitepaper
11. Mastercoin whitepaper: https://github.com/mastercoin-MSC/spec
12. Decentralized autonomous corporations, Bitcoin Magazine: https://tinyurl.com/Bootstrapping-DACs
13. Simplified payment verification: https://en.bitcoin.it/wiki/Scalability#Simplifiedpaymentverification
14. Merkle trees: http://en.wikipedia.org/wiki/Merkle_tree
15. Patricia trees: http://en.wikipedia.org/wiki/Patricia_tree
16. GHOST: http://www.cs.huji.ac.il/~avivz/pubs/13/btc_scalability_full.pdf (fixed link)
17. StorJ and Autonomous Agents, Jeff Garzik: https://tinyurl.com/storj-agents
18. Mike Hearn on Smart Property at Turing Festival: http://www.youtube.com/watch?v=Pu4PAMFPo5Y (video missing)
19. Ethereum RLP: https://github.com/ethereum/wiki/wiki/%5BEnglish%5D-RLP
20. Ethereum Merkle Patricia trees: https://github.com/ethereum/wiki/wiki/%5BEnglish%5D-Patricia-Tree
21. Peter Todd on Merkle sum trees: http://sourceforge.net/p/bitcoin/mailman/message/31709140/ (link broken)
The following is an annotated version of the 2008 Bitcoin Whitepaper by the pseudonymous Satoshi Nakamoto where the original is displayed in dark text and annotations in lighter text.
This was a very useful exercise and it helped me appreciate both the history leading to Satoshi's discovery and also why Bitcoin and blockchains are such an important invention. The succinctness of the paper also stands out as quite beautiful.
Ultimately Bitcoin is a work of genius. The fact that a trusted, permissionless, distributed, immutable ledger exists and has processed enormous value for many years is truly amazing. It demarcates an important point for human history.
Without further ado, may I present...
Satoshi Nakamoto
satoshin@gmx.com
www.bitcoin.org
1. Introduction
2. Transactions
3. Timestamp Server
4. Proof-of-Work
5. Network
6. Incentive
7. Reclaiming Disk Space
8. Simplified Payment Verification
9. Combining and Splitting Value
10. Privacy
11. Calculations
12. Conclusion
References
The original Bitcoin whitepaper (2008) is displayed in dark text with annotations below in lighter text.
Bitcoin (₿) is a decentralized digital currency, without a central bank or single administrator, that can be sent from user to user on the peer-to-peer bitcoin network without the need for intermediaries.
Satoshi Nakamoto is the name used by the presumed pseudonymous person or persons who developed bitcoin.
bitcoin.org was registered on 18 Aug 2008.
Abstract. A purely peer-to-peer version of electronic cash would allow online payments to be sent directly from one party to another without going through a financial institution. Digital signatures provide part of the solution, but the main benefits are lost if a trusted third party is still required to prevent double-spending. We propose a solution to the double-spending problem using a peer-to-peer network. The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work. The longest chain not only serves as proof of the sequence of events witnessed, but proof that it came from the largest pool of CPU power. As long as a majority of CPU power is controlled by nodes that are not cooperating to attack the network, they'll generate the longest chain and outpace attackers. The network itself requires minimal structure. Messages are broadcast on a best effort basis, and nodes can leave and rejoin the network at will, accepting the longest proof-of-work chain as proof of what happened while they were gone.
Commerce on the Internet has come to rely almost exclusively on financial institutions serving as trusted third parties to process electronic payments. While the system works well enough for most transactions, it still suffers from the inherent weaknesses of the trust based model. Completely non-reversible transactions are not really possible, since financial institutions cannot avoid mediating disputes. The cost of mediation increases transaction costs, limiting the minimum practical transaction size and cutting off the possibility for small casual transactions, and there is a broader cost in the loss of ability to make non-reversible payments for non- reversible services. With the possibility of reversal, the need for trust spreads. Merchants must be wary of their customers, hassling them for more information than they would otherwise need. A certain percentage of fraud is accepted as unavoidable. These costs and payment uncertainties can be avoided in person by using physical currency, but no mechanism exists to make payments over a communications channel without a trusted party.
What is needed is an electronic payment system based on cryptographic proof instead of trust, allowing any two willing parties to transact directly with each other without the need for a trusted third party. Transactions that are computationally impractical to reverse would protect sellers from fraud, and routine escrow mechanisms could easily be implemented to protect buyers. In this paper, we propose a solution to the double-spending problem using a peer-to-peer distributed timestamp server to generate computational proof of the chronological order of transactions. The system is secure as long as honest nodes collectively control more CPU power than any cooperating group of attacker nodes.
One of the most valuable properties of many blockchain applications is trustlessness: the ability of the application to continue operating in an expected way without needing to rely on a specific actor to behave in a specific way even when their interests might change.
Double-spending is a potential flaw in a digital cash scheme in which the same single digital token can be spent more than once.
Peer-to-peer computing or networking is a distributed application architecture that partitions tasks or workloads between peers.
Trusted timestamping is the process of securely keeping track of the creation and modification time of a document. Security here means that no one—not even the owner of the document—should be able to change it once it has been recorded.
In a distributed system, the nodes are clients, servers or peers. A peer may sometimes serve as client, sometimes server.
We define an electronic coin as a chain of digital signatures. Each owner transfers the coin to the next by digitally signing a hash of the previous transaction and the public key of the next owner and adding these to the end of the coin. A payee can verify the signatures to verify the chain of ownership.
The problem of course is the payee can't verify that one of the owners did not double-spend the coin. A common solution is to introduce a trusted central authority, or mint, that checks every transaction for double spending. After each transaction, the coin must be returned to the mint to issue a new coin, and only coins issued directly from the mint are trusted not to be double-spent. The problem with this solution is that the fate of the entire money system depends on the company running the mint, with every transaction having to go through them, just like a bank.
We need a way for the payee to know that the previous owners did not sign any earlier transactions. For our purposes, the earliest transaction is the one that counts, so we don't care about later attempts to double-spend. The only way to confirm the absence of a transaction is to be aware of all transactions. In the mint based model, the mint was aware of all transactions and decided which arrived first. To accomplish this without a trusted party, transactions must be publicly announced [1], and we need a system for participants to agree on a single history of the order in which they were received. The payee needs proof that at the time of each transaction, the majority of nodes agreed it was the first received.
A digital signature (2) is a mathematical scheme for verifying the authenticity of digital messages or documents.
A cryptographic hash function is a mathematical algorithm that maps data of an arbitrary size to a bit array of a fixed size (the hash value). It is a one-way function, in that it is practically infeasible to reverse the computation.
A transaction is a transfer of Bitcoin value that is broadcast to the network and collected into blocks.
Public-key cryptography, or asymmetric cryptography, is a cryptographic system that uses pairs of keys. Each pair consists of a public key (which may be known to others) and a private key (which may not be known by anyone except the owner).
A mint is an industrial facility which manufactures coins that can be used as currency.
[1] W. Dai, "b-money", 1998
Wei Dai is a computer engineer known for contributions to cryptography and cryptocurrencies.
The solution we propose begins with a timestamp server. A timestamp server works by taking a hash of a block of items to be timestamped and widely publishing the hash, such as in a newspaper or Usenet post [2-5]. The timestamp proves that the data must have existed at the time, obviously, in order to get into the hash. Each timestamp includes the previous timestamp in its hash, forming a chain, with each additional timestamp reinforcing the ones before it.
To form a distributed timestamp server as a peer-to-peer network, bitcoin uses a proof-of-work system. This work is often called bitcoin mining.
Blocks contain the transactions on the bitcoin network. The on-chain transaction processing capacity of the network is limited by the average block creation time and the block size limit. These jointly constrain the network's throughput.
A blockchain is a list of cryptographically linked blocks typically managed by a peer-to-peer network for use as a publicly distributed ledger.
[2] H. Massias, X.S. Avila, and J.-J. Quisquater, "Design of a secure timestamping service with minimal trust requirements," In 20th Symposium on Information Theory in the Benelux, May 1999.
[3] S. Haber, W.S. Stornetta, "How to time-stamp a digital document," In Journal of Cryptology, vol 3, no 2, pages 99-111, 1991.
[4] D. Bayer, S. Haber, W.S. Stornetta, "Improving the efficiency and reliability of digital time-stamping," In Sequences II: Methods in Communication, Security and Computer Science, pages 329-334, 1993.
[5] S. Haber, W.S. Stornetta, "Secure names for bit-strings," In Proceedings of the 4th ACM Conference on Computer and Communications Security, pages 28-35, April 1997.
To implement a distributed timestamp server on a peer-to-peer basis, we will need to use a proof-of-work system similar to Adam Back's Hashcash [6], rather than newspaper or Usenet posts. The proof-of-work involves scanning for a value that when hashed, such as with SHA-256, the hash begins with a number of zero bits. The average work required is exponential in the number of zero bits required and can be verified by executing a single hash.
For our timestamp network, we implement the proof-of-work by incrementing a nonce in the block until a value is found that gives the block's hash the required zero bits. Once the CPU effort has been expended to make it satisfy the proof-of-work, the block cannot be changed without redoing the work. As later blocks are chained after it, the work to change the block would include redoing all the blocks after it.
The proof-of-work also solves the problem of determining representation in majority decision making. If the majority were based on one-IP-address-one-vote, it could be subverted by anyone able to allocate many IPs. Proof-of-work is essentially one-CPU-one-vote. The majority decision is represented by the longest chain, which has the greatest proof-of-work effort invested in it. If a majority of CPU power is controlled by honest nodes, the honest chain will grow the fastest and outpace any competing chains. To modify a past block, an attacker would have to redo the proof-of-work of the block and all blocks after it and then catch up with and surpass the work of the honest nodes. We will show later that the probability of a slower attacker catching up diminishes exponentially as subsequent blocks are added.
To compensate for increasing hardware speed and varying interest in running nodes over time, the proof-of-work difficulty is determined by a moving average targeting an average number of blocks per hour. If they're generated too fast, the difficulty increases.
Proof-of-work is a cryptographic zero-knowledge proof in which one party (the prover) proves to others (the verifiers) that an amount of computational effort has been expended.
Bitcoin uses the hashcash double SHA-256 function for mining. This article by Ken Shirriff explains Bitcoin mining in details, right down to the hex data and network traffic.
[6] A. Back, "Hashcash - a denial of service counter-measure", 2002.
A nonce (number once) is an arbitrary number that can be used just once in a cryptographic communication.
PoW (one-CPU-one-vote) is a type of Sybil attack defence.
A Sybil attack is when an attacker creates a large number of pseudonymous identities and uses them to gain a large influence. Imposing economic costs may be used to make Sybil attacks more expensive.
Difficulty is a measure of how difficult it is to find a hash below a given target. A block is only accepted by the network if its hash meets the network's difficulty target.
The steps to run the network are as follows:
Bitcoin forks can be defined as changes in the protocol of the bitcoin network or in situations when two or more blocks have the same block height.
The fork is resolved when subsequent blocks are added and one of the chains becomes longer than the alternative.
Forks require consensus to be resolved or else a permanent split emerges.
A soft fork is a change that makes some block, that was previously legal, illegal.
A hard fork is a change that makes some block, that was previously illegal, legal.
With a soft fork the network will occasionally split but quickly converge, whereas with a hard fork the network will split into two separate chains, which will continue growing in parallel indefinitely.
By convention, the first transaction in a block is a special transaction that starts a new coin owned by the creator of the block. This adds an incentive for nodes to support the network, and provides a way to initially distribute coins into circulation, since there is no central authority to issue them. The steady addition of a constant of amount of new coins is analogous to gold miners expending resources to add gold to circulation. In our case, it is CPU time and electricity that is expended.
The incentive can also be funded with transaction fees. If the output value of a transaction is less than its input value, the difference is a transaction fee that is added to the incentive value of the block containing the transaction. Once a predetermined number of coins have entered circulation, the incentive can transition entirely to transaction fees and be completely inflation free.
The incentive may help encourage nodes to stay honest. If a greedy attacker is able to assemble more CPU power than all the honest nodes, he would have to choose between using it to defraud people by stealing back his payments, or using it to generate new coins. He ought to find it more profitable to play by the rules, such rules that favour him with more new coins than everyone else combined, than to undermine the system and the validity of his own wealth.
Satoshi Nakamoto mined the Bitcoin genesis block on 3 Jan 2009 and annouced v0.1 of the project on 9 Jan 2009.
The genesis block reward was 50 bitcoin and embedded in the transaction was the text: "The Times Jan/03/2009 Chancellor on brink of second bailout for banks".
A miner finding a new block collects all transaction fees in the block and a pre-determined block reward of newly created bitcoins.
Transaction fees are optional and miners can choose which transactions to process and prioritize those paying higher fees.
Bitcoin supply is limited and will approach a maximum of 21 million with issuance permanently halting c. 2140.
Once the latest transaction in a coin is buried under enough blocks, the spent transactions before it can be discarded to save disk space. To facilitate this without breaking the block's hash, transactions are hashed in a Merkle Tree [7][2][5], with only the root included in the block's hash. Old blocks can then be compacted by stubbing off branches of the tree. The interior hashes do not need to be stored.
A block header with no transactions would be about 80 bytes. If we suppose blocks are generated every 10 minutes, 80 bytes * 6 * 24 * 365 = 4.2MB per year. With computer systems typically selling with 2GB of RAM as of 2008, and Moore's Law predicting current growth of 1.2GB per year, storage should not be a problem even if the block headers must be kept in memory.
A Merkle tree is a tree in which every leaf node is labelled with the cryptographic hash of a data block, and every non-leaf node is labelled with the cryptographic hash of the labels of its child nodes.
Merkle tree allows efficient and secure verification of the contents of large data structures.
Ralph Merkle is one of the inventors of public-key cryptography and the inventor of cryptographic hashing.
[7] R.C. Merkle, "Protocols for public key cryptosystems," In Proc. 1980 Symposium on Security and Privacy, IEEE Computer Society, pages 122-133, April 1980.
[2] H. Massias, X.S. Avila, and J.-J. Quisquater, "Design of a secure timestamping service with minimal trust requirements," In 20th Symposium on Information Theory in the Benelux, May 1999.
[5] S. Haber, W.S. Stornetta, "Secure names for bit-strings," In Proceedings of the 4th ACM Conference on Computer and Communications Security, pages 28-35, April 1997.
It is possible to verify payments without running a full network node. A user only needs to keep a copy of the block headers of the longest proof-of-work chain, which he can get by querying network nodes until he's convinced he has the longest chain, and obtain the Merkle branch linking the transaction to the block it's timestamped in. He can't check the transaction for himself, but by linking it to a place in the chain, he can see that a network node has accepted it, and blocks added after it further confirm the network has accepted it.
As such, the verification is reliable as long as honest nodes control the network, but is more vulnerable if the network is overpowered by an attacker. While network nodes can verify transactions for themselves, the simplified method can be fooled by an attacker's fabricated transactions for as long as the attacker can continue to overpower the network. One strategy to protect against this would be to accept alerts from network nodes when they detect an invalid block, prompting the user's software to download the full block and alerted transactions to confirm the inconsistency. Businesses that receive frequent payments will probably still want to run their own nodes for more independent security and quicker verification.
Full nodes download every block and transaction and check them against Bitcoin's consensus rules
In Simplified Payment Verification mode, clients connect to an arbitrary full node and download only the block headers.
They verify the chain headers connect together correctly and that the difficulty is high enough.
The level of difficulty required to obtain confidence the remote node is not feeding you fictional transactions depends on your threat model.
By changing how deeply buried the block must be, you can trade off confirmation time vs cost of an attack.
Running a full node is the only way you can use Bitcoin in a trustless way.
A full node verifies all rules of the network: no bitcoins are spent unless owned, no coins are spent twice, inflation happens according to the schedule, blocks meet the network difficulty, etc.
Although it would be possible to handle coins individually, it would be unwieldy to make a separate transaction for every cent in a transfer. To allow value to be split and combined, transactions contain multiple inputs and outputs. Normally there will be either a single input from a larger previous transaction or multiple inputs combining smaller amounts, and at most two outputs: one for the payment, and one returning the change, if any, back to the sender.
It should be noted that fan-out, where a transaction depends on several transactions, and those transactions depend on many more, is not a problem here. There is never the need to extract a complete standalone copy of a transaction's history.
A transaction typically references previous transaction outputs as new transaction inputs and dedicates all input Bitcoin values to new outputs.
An Unspent Transaction Output (UTXO) defines an output of a transaction that has not been spent, i.e. can be used as an input in a new transaction.
A bitcoin wallet balance is actually the sum of the UTXOs controlled by the wallet's private keys.
The traditional banking model achieves a level of privacy by limiting access to information to the parties involved and the trusted third party. The necessity to announce all transactions publicly precludes this method, but privacy can still be maintained by breaking the flow of information in another place: by keeping public keys anonymous. The public can see that someone is sending an amount to someone else, but without information linking the transaction to anyone. This is similar to the level of information released by stock exchanges, where the time and size of individual trades, the "tape", is made public, but without telling who the parties were.
As an additional firewall, a new key pair should be used for each transaction to keep them from being linked to a common owner. Some linking is still unavoidable with multi-input transactions, which necessarily reveal that their inputs were owned by the same owner. The risk is that if the owner of a key is revealed, linking could reveal other transactions that belonged to the same owner.
While Bitcoin can support strong privacy, many ways of using it are usually not very private.
With proper understanding, bitcoin can be used in a very private and anonymous way.
Transactions alone can't identify a person since addresses and transaction IDs are random numbers.
However, if any of the addresses in a transaction's past or future can be tied to an identity, it might be possible to deduce who owns the other addresses.
The encouraged practice of using a new address for every transaction is intended to make this attack more difficult.
We consider the scenario of an attacker trying to generate an alternate chain faster than the honest chain. Even if this is accomplished, it does not throw the system open to arbitrary changes, such as creating value out of thin air or taking money that never belonged to the attacker. Nodes are not going to accept an invalid transaction as payment, and honest nodes will never accept a block containing them. An attacker can only try to change one of his own transactions to take back money he recently spent.
The race between the honest chain and an attacker chain can be characterized as a Binomial Random Walk. The success event is the honest chain being extended by one block, increasing its lead by +1, and the failure event is the attacker's chain being extended by one block, reducing the gap by -1.
The probability of an attacker catching up from a given deficit is analogous to a Gambler's Ruin problem. Suppose a gambler with unlimited credit starts at a deficit and plays potentially an infinite number of trials to try to reach breakeven. We can calculate the probability he ever reaches breakeven, or that an attacker ever catches up with the honest chain, as follows [8]:
Given our assumption that p > q, the probability drops exponentially as the number of blocks the attacker has to catch up with increases. With the odds against him, if he doesn't make a lucky lunge forward early on, his chances become vanishingly small as he falls further behind.
We now consider how long the recipient of a new transaction needs to wait before being sufficiently certain the sender can't change the transaction. We assume the sender is an attacker who wants to make the recipient believe he paid him for a while, then switch it to pay back to himself after some time has passed. The receiver will be alerted when that happens, but the sender hopes it will be too late.
The receiver generates a new key pair and gives the public key to the sender shortly before signing. This prevents the sender from preparing a chain of blocks ahead of time by working on it continuously until he is lucky enough to get far enough ahead, then executing the transaction at that moment. Once the transaction is sent, the dishonest sender starts working in secret on a parallel chain containing an alternate version of his transaction.
The recipient waits until the transaction has been added to a block and z blocks have been linked after it. He doesn't know the exact amount of progress the attacker has made, but assuming the honest blocks took the average expected time per block, the attacker's potential progress will be a Poisson distribution with expected value:
To get the probability the attacker could still catch up now, we multiply the Poisson density for each amount of progress he could have made by the probability he could catch up from that point:
Rearranging to avoid summing the infinite tail of the distribution...
Converting to C code...
#include <math.h>
double AttackerSuccessProbability(double q, int z)
{
double p = 1.0 - q;
double lambda = z * (q / p);
double sum = 1.0;
int i, k;
for (k = 0; k <= z; k++)
{
double poisson = exp(-lambda);
for (i = 1; i <= k; i++)
poisson *= lambda / i;
sum -= poisson * (1 - pow(q / p, z - k));
}
return sum;
}
Running some results, we can see the probability drop off exponentially with z.
q=0.1
z=0 P=1.0000000
z=1 P=0.2045873
z=2 P=0.0509779
z=3 P=0.0131722
z=4 P=0.0034552
z=5 P=0.0009137
z=6 P=0.0002428
z=7 P=0.0000647
z=8 P=0.0000173
z=9 P=0.0000046
z=10 P=0.0000012
q=0.3
z=0 P=1.0000000
z=5 P=0.1773523
z=10 P=0.0416605
z=15 P=0.0101008
z=20 P=0.0024804
z=25 P=0.0006132
z=30 P=0.0001522
z=35 P=0.0000379
z=40 P=0.0000095
z=45 P=0.0000024
z=50 P=0.0000006
Solving for P less than 0.1%...
P < 0.001
q=0.10 z=5
q=0.15 z=8
q=0.20 z=11
q=0.25 z=15
q=0.30 z=24
q=0.35 z=41
q=0.40 z=89
q=0.45 z=340
In mathematics a random walk is a random process that describes a path that consists of a succession of random steps on some mathematical space such as the integers.
The term gambler's ruin is a statistical concept, most commonly expressed as the fact that a gambler playing a game with negative expected value will eventually go broke, regardless of their betting system.
The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events in a fixed interval of time or space if these events occur with a constant mean rate and independently of the time since the last event.
A mining pool is the pooling of resources by miners, who share their processing power over a network, to split the reward equally, according to the amount of work they contributed to the probability of finding a block.
Any mining pool that achieves 51% hashing power can effectively overturn network transactions, resulting in double spending.
A model of the attack can be simulated via Monte Carlo methods.
Only 6 blocks or 1 hour is enough to make reversal computationally impractical.
We have proposed a system for electronic transactions without relying on trust. We started with the usual framework of coins made from digital signatures, which provides strong control of ownership, but is incomplete without a way to prevent double-spending. To solve this, we proposed a peer-to-peer network using proof-of-work to record a public history of transactions that quickly becomes computationally impractical for an attacker to change if honest nodes control a majority of CPU power. The network is robust in its unstructured simplicity. Nodes work all at once with little coordination. They do not need to be identified, since messages are not routed to any particular place and only need to be delivered on a best effort basis. Nodes can leave and rejoin the network at will, accepting the proof-of-work chain as proof of what happened while they were gone. They vote with their CPU power, expressing their acceptance of valid blocks by working on extending them and rejecting invalid blocks by refusing to work on them. Any needed rules and incentives can be enforced with this consensus mechanism.
The word consensus comes from Latin meaning "agreement, accord".
A fundamental problem in distributed computing is to coordinate processes to reach consensus on some data value that is needed during computation.
Proof-of-work is an example of a permissionless consensus protocol that allows anyone in the network to join dynamically and participate without prior permission.
A number of permissionless consensus protocols exist including proof-of-stake which selects validators in proportion to their quantity of holdings in the associated cryptocurrency and does not incentivise extreme amounts of energy consumption.
[1] W. Dai, "http://www.weidai.com/bmoney.txt", 1998
[2] H. Massias, X.S. Avila, and J.-J. Quisquater, "Design of a secure timestamping service with minimal trust requirements," In 20th Symposium on Information Theory in the Benelux, May 1999.
[3] S. Haber, W.S. Stornetta, "How to time-stamp a digital document," In Journal of Cryptology, vol 3, no 2, pages 99-111, 1991.
[4] D. Bayer, S. Haber, W.S. Stornetta, "Improving the efficiency and reliability of digital time-stamping," In Sequences II: Methods in Communication, Security and Computer Science, pages 329-334, 1993.
[5] S. Haber, W.S. Stornetta, "Secure names for bit-strings," In Proceedings of the 4th ACM Conference on Computer and Communications Security, pages 28-35, April 1997.
[6] A. Back, "Hashcash - a denial of service counter-measure," http://www.hashcash.org/papers/hashcash.pdf, 2002.
[7] R.C. Merkle, "Protocols for public key cryptosystems," In Proc. 1980 Symposium on Security and Privacy, IEEE Computer Society, pages 122-133, April 1980.
[8] W. Feller, "An introduction to probability theory and its applications," 1957.
Topic modelling is a useful approach for automatically organising a large collection of documents into topics. Automatic is the key here - we don’t require predefined document labels nor even predefined topics to perform the organisation. Topic modelling falls under a broader collection of methods within Natural Language Processing and Machine Learning.
For example, thousands of New York Times articles (our collection of documents, also known as a corpus) can be organised into a number of topics. This could enable readers to visually explore stories within a particular topic or enable the publisher to recommend articles based on reader preferences.
Many methods have been developed for topic modelling over the past few decades. Most methods rely on the fact that a document is typically about a particular subject and has words that frequently occur together. For example, an email planning your next holiday will likely contain the words ‘flight’ and ‘hotel’ more often than emails about other subjects. It is this statistical regularity that allows latent topics to be discovered, sometimes as if by magic.
The only data required for topic modelling is the text within each document. In machine learning terminology, this is an unsupervised learning method - meaning that we don’t require labelled data as is the case with supervised learning methods. The number of topics to automatically discover typically needs to be specified which can be tricky and may depend on how the results will be used. Interpretation of the discovered topics can also be tricky; while they can often be summarised by their most important words, they do not necessarily correspond to topics we expected to find.
The following sections will describe (in roughly chronological order of discovery) the intuitions behind a few different methods of applying topic modelling.
In order to discover topics within a corpus, the contents of all documents must be converted into a numerical format that becomes the input for a topic modelling method or algorithm. How this conversion is performed depends on a number of factors, for example, the chosen topic modelling method, the content of the documents themselves and also which language modelling simplifications are acceptable while still achieving the desired outcome.
One numerical format, that we will use below, is called a term-document matrix (where a term is simply a word from the vocabulary over all documents.) This format happens to discards word order and simply counts the number of occurrences of each term within a document. This simplification of ignoring word order (also known as a bag-of-words model) greatly reduces the complexity of representing the original documents and can still produce good results.
The term-document matrix is a sparse matrix containing mainly zeros (since documents typically don’t use all words) and represents each document as a column. Each entry is a count of the number of word occurrences in the document. While term-document matrices are common in information retrieval literature, for large datasets a document-term matrix having documents as rows can be easier to work with and is more common in software libraries.
The steps described here are very simple. There are many extensions that can be applied, for example, an entry in the term-document matrix can be extended to include the notion of term frequency–inverse document frequency (TF-IDF) which is a term weighting scheme. Regarding the vocabulary, in order to reduce the size of the term-document matrix, and also remove information that is often not useful, we can exclude common stop words (e.g. the, and, on) and likewise rare words that may only appear in a few documents. Additionally, word stemming can be applied such that only root words are included for analysis (e.g fished and fishes become fish.)
Latent Semantic Analysis is a method that was developed in the late 1980’s by Deerwester, Dumais and others 1. It takes as input a term-document matrix and factorises this into 3 new matrices which are able to jointly reconstruct the original matrix. This is achieved by applying a linear algebra method called Singular Value Decomposition (SVD) where the factorisation can be represented as M = UΣV’.
Each factored matrix gives us an insight in potential topics latent in the original term-document matrix. Specifically, matrix U maps words to a topic space, matrix V maps documents to the same topic space and matrix Σ is a diagonal matrix that tells us the ‘strength’ of each topic - whereby strong topics capture more variance in the underlying data.
Knowing the strength (or variance) of the different topics is key. Strong topics allow us to model much of the original data while weak topics can be likened to modelling noise. Given this, we can choose to keep the top K topics and discard the rest. Specifically, a variant of SVD known as Truncated SVD does precisely this and is an instance of dimensionality reduction such that we find the most important aspects of the data to enable the best approximate reconstruction of the original term-document matrix. These important aspects are the topics we are seeking to model.
Choosing an appropriate number of topics to keep can be difficult and may be dependent on the problem domain you are modelling. There are several approaches, one of the simplest being to choose a cut-off value within the strength matrix Σ, either automatically or after plotting the values to identify where the values plateau.
Given we have our truncated matrix factorisation with K topics, we also have a vector representation of each document over the topics, namely the columns of ΣV’. Documents with similar topics tend to be close within this reduced latent space. The final step to topic modelling with LSA is to apply a clustering method (for example K-means or a hierarchical clustering algorithm) over these resulting vectors.
While LSA is simple to construct, a common criticism is that the model doesn’t explain why the matrix factorisation should generate good topics. The next method, PLSA, takes a different approach and attempts to solve this issue.
Probabilistic Latent Semantic Analysis is a probabilistic version of LSA and was developed in 1999 by Hoffman 2. While PLSA and LSA attempt to solve a similar problem regarding topic modelling of a corpus, their approaches are very different. PLSA is a probabilistic generative model compared to LSA which is a deterministic model that uses linear algebra. Like LSA, PLSA works with a term-document matrix constructed from the collection of documents.
A probabilistic generative model describes a process of how data came to be via a sequence of probabilistic steps - in our case this is the story of how our documents may have been written or generated. Then, given this model or story, statistical inference is used to infer hidden/unobserved variables - which in this case happen to represent different topics.
The PLSA model assumes the following generative story, in which the actual parameter values will be inferred from the observed documents later:
The steps in this model and it’s variable dependencies can be shown using ‘plate’ notation in the following form:
In the PLSA model, we can see that choosing a word w is conditionally independent to documents given a topic (i.e. p(w|t) = p(w|d,t) which says that the probability of choosing a word given a topic is equal to the probability of choosing a word given a topic and a document.) The model also allows multiple topics within each document.
The next step is to determine the topic distribution over the documents by solving the model. Statistical inference is used to estimate the parameters of the PLSA model and give us a probability regarding all topics over each document (i.e. the probability of a topic given a document, p(t|d)). This essentially finds parameters that maximise the probability of generating the original observed term-document matrix. In practice, the Expectation Maximization algorithm can be used to estimate the parameters in an iterative manner.
There is a correspondence between the estimated probabilities of PLSA and the factored matrices of LSA, namely:
PLSA learns the probabilities of a topic, p(t), directly from the original data which can be a limitation if extending the model to new documents. In essence, PLSA is only a generative model for the corpus it is estimated on and not new documents.
Another potential issue is that the number of parameters grows linearly with the number of documents which can make solving the model computationally difficult for large corpuses and also lead to overfitting (however other estimation algorithms, for example Tempered EM, can help with this and improve generalisation as can extensions to the PLSA model itself.)
Even though PLSA can exhibit limitations regarding new documents and overfitting, in many experiments it outperforms LSA. The next method, LDA, extends the PLSA generative model to address these issues.
Latent Dirichlet Allocation was developed in 2003 by Blei, Ng, and Jordan 3. It is a generalisation of PLSA and overcomes some of its shortcomings. LDA is a probabilistic generative model similar to PLSA. However, unlike PLSA, LDA allows for additional information to be encoded with the use of probabilistic distributions over topics and words. Like the other methods presented, LDA operates with a term-document matrix.
The distribution that LDA uses to model topics and words is the Dirichlet distribution. This is able to model uncertainty when sampling from a fixed number of distinct categories and is a common prior in Bayesian statistics. LDA when applied to topic modelling typically uses a prior Dirichlet distribution for topics-per-document and another for words-per-topic.
The advantage of this approach is that the model is able to encapsulate the assumption that a document tends to cover only a small number of topics; this sparse topic property helps prevent overfitting of data, especially with a smaller corpus. A Dirichlet prior can also result in better word disambiguation and generally provides better assignment of topics over documents. If an uninformative flat (uniform) Dirichlet prior is used, the LDA model is equivalent to PLSA.
The LDA model assumes the following generative story:
The steps in this model and it’s variable dependencies can be shown using ‘plate’ notation in the following form:
Given that LDA allows for uncertainty through the use of prior distributions, it can be considered a Bayesian version of PLSA. Unfortunately this extra complexity comes at a cost. Estimation of LDA parameters is intractable and require computationally intensive algorithms to find approximate solutions, for example MCMC methods (such as Gibbs sampling, Metropolis-Hastings) and Variational Bayesian methods (an extension to the EM algorithm used to solve PLSA.)
Not all variables need to be inferred from the data. Some parameters can be set a priori and treated as hyperparameters, for example α and β parameters of the Dirichlet distributions can be fixed with values that depend on the number of topics and vocabulary size. Many extensions to LDA exist where additional variables are added to the model to capture different properties of documents.
While LDA will often produce better topic models than either LSA or PLSA, it still has shortcomings. The number of discovered topics is still fixed and the term-document matrix still does not capture sentence structure. Inference with LDA is also computationally expensive and tends not to scale well with an increasing number of documents. There are many ongoing areas of research regarding LDA including optimising inference algorithms, extending LDA to new domains, embedding LDA within other graphical models and integration with other NLP approaches like word2vec.
There are several other methods of topic modelling (e.g. Non-negative Matrix Factorization) and many extensions exist to the models detailed in this post. In a future post I plan to cover the empirical side of topic modelling including interpretation and evaluation of models, implementation via various code libraries (e.g. Scikit Learn and Gensim) and visualisation of resulting topics for a corpus (e.g using pyLDAvis). Additionally, I plan to produce a series of posts applying topic modelling to specific datasets and analysis of the results. Thanks for reading and I hope the information presented here is helpful for someone new to topic modelling.