I first came across the word ergodicity in a book about information theory, which I was reading because I was interested in the (unicycle riding, flame throwing trumpet playing mathematician) Claude Shannon.* Then I discovered NN Taleb and Ole Peters talking about ergodicity in a totally different context. Put simply: the concept is about calculating 2 types of average: ensemble and time. If the average of both is the same, then the source is ergodic.
Many ways to calculate an average
You would think that the concept of “average” would be simple. Except it isn’t – there is “mean”, “median”, and “mode” average. There’s “arithmetic” and “geometric” average. But why do we need yet more (“time” v “ensemble”) averages? Let me explain.
Many of these ideas related to natural language processing – which I’m interested in because I think analysis financial text can provide valuable signals to investors. Berkshire Hathaway is the most obvious example, Amazon too, but also closer to home Games Workshop. Clear thought, and directness of communication seem to be characteristics of many multi-baggers. I’m still working on this – identifying multi baggers is one application. I’d suggest more controversially both WPP and Burford also have well written Annual Reports, and these have been multi-baggers in the past, though consensus at the moment is that they won’t compound so well in the future. Rather than looking for holes in the numbers, perhaps it would be possible to see warning flags and use text analysis (for instance changes in accounting policy or how management changes the investment story) to highlight hidden risks? Thereby avoiding big blows in previously high performing multi-baggers. Maybe.
I digress. I was talking about (“time” v “ensemble”) averages in the context of analysing text. A time average can be a difficult concept mathematically, because letter or word frequencies will converge to an average – but you do need enough text for this to happen. Word and letter frequency also follow the Zipf distribution – the most frequently occurring (“e” in the case of a letter, “the” in the case of a word) are far more likely than the least frequent to occur (“x” or “z” in the case of a letter, or “picayune” in the case of a word).
A text beginning with the sentence: “The quick brown fox jumps over the lazy dog” does not represent of letter frequency of English. It contains all the letters in the English alphabet once, and assuming it is part of a longer English text, it will take time for the letter frequency to converge to the expected average. But once a text is a few pages long, the word and letter frequencies should settle down, and the statistical properties not vary over time (stationarity).
To deal with this problem, information theorists used “ensemble average” which is the average of all the text a source can produce. That is, averaging the first letter (or word) of all possible text in the English language – it doesn’t have to be the first letter, it could be 2nd or 3rd or nth letter – to generate an ensemble average.
To convince you that a text can be stationary (not differ over time) but the time average and an ensemble average can still be different, I’ll use John Pierce’s example. Consider a source which a third of the time starts with option 1 (A and B alternating), a third of the time starts with option 2 (B and A alternating), and a third of the time produces a string of the same letter (E) option 3.
- A B A B A B A B A B A B
- B A B A B A B A B A B A
- E E E E E E E E E E E E E
The sources is stationary – its properties do not change over time. But the source is not ergodic – the time average and the ensemble average are different.
|Probability of||Time avg sequence (1)||Time avg sequence (2)||Time avg sequence (3)||Ensemble Average|
Assuming an ergodic source is an important assumption of information theory. It means that the statistics from one message can apply equally well to all messages the source generates. And that probabilities/word frequencies won’t change over time. This is a useful assumption for cryptography too. But it is a well labelled assumption that texts are approximately ergodic. Naturally there is some variation, and for my work comparing how company communications differ both over time and between companies, I’m assuming the time and the ensemble are different enough to possibly signal something interesting.
But what has this to do with Taleb and Peters?
Taleb (among other things) is concerned with problems of gamblers ruin. He explains ergodicity with a totally different example: imagine 100 gamblers in a casino in Monte Carlo, some lose, some win, one (gambler 28) goes bust. This is irrelevant to all the other gamblers. This is the ensemble probability of ruin.
Now this time one person, the same person (who Taleb calls Theodorus Ibn Warqa for some reason), goes to the casino 100 days in a row. Except he doesn’t because on day 28 he goes bust. He can’t come back to the casino to gamble, because he hasn’t got any money. There will be no day 29.
The latter is a time probability of ruin. These two are not equivalent – so non ergodic. The key concept is that if you make favourable bets, but with a small chance of ruin, over time you will be ruined. Stated like this it’s obvious.
Positive expectation bets don’t make sense, if risk of ruin grows with time.
I think this is something that most successful quant funds (Winton, Renaissance, Ed Thorp) would agree with, but what many other smart people (the whole economics profession) miss. The economists assume ergodicity, when it is dangerous to do so. Successful hedge funds must not only make money, but they must survive through time, where the time series of prices is a wild log-Levy distribution. There are other useful applications of this (apart from making money!) because time series of prices are similar to word frequency in texts – Pete Brown and Robert Merton, who made a fortune at Renaissance, had a background in the mathematics of statistical machine translation.
Prices over time are not the only place that wild distributions occur. In stock market indices, very often the bulk of the wealth is created by a small number of companies (the Berkshire Hathaways, Amazon, and to a lesser extent Games Workshop). It is a dirty little secret of portfolio management, that a few investments lose a ton of money, and few are responsible for the bulk of the gains, and 80-90% don’t do much at all.
Here’s a link to a of study compound returns from nearly 62,000 global common stocks during the 1990 to 2018 period. It should be easy to imagine 62,000 stocks as an ensemble average. And perhaps counter to what most people believe, most did no better than cash. The top-performing 1.3% of firms account for the $US 44.7 trillion in global stock market wealth creation from 1990 to 2018. Outside the US, less than one percent of firms account for the $US 16.0 trillion in net wealth creation.
A different way to beat the odds
What I find fascinating about this is Claude Shannon, the founder of information theory, had a totally different approach to investing to all the hedge funds above. He was obviously someone who understood ergodicity, and the difference between time and ensemble perspective. But he did not go down the route of setting up a hedge fund with a high frequency trading system. Instead he made few investments, and had a concentrated buy and hold style in companies (Hewlett Packard, Teledyne and Motorola that were 98% of his portfolio in 1981) that were multi-baggers on the positive side of this distribution. ** Normally such a concentrated portfolio would be seen as “high risk”. But one tenet of ruin problems is that: in games with unfavourable odds, the gambler should reduce the number of bets he makes (common sense) which logically means increase the amount staked in each bet (uncommon sense).
History shows that Shannon picked companies that multi-bagged and didn’t blow up *** – all the more remarkable because Shannon invested in technology hardware companies. A sector that for decades Warren Buffett and Charlie Munger steered cleared of because they thought it was too hard.
Theory versus Reality
I once had dinner with an economics professor who told me that it wasn’t possible to identify such companies, because it contradicted Efficient Market Theory. I politely pointed out that when EMT was contradicted by reality, the theory needed to change, not reality. He then ignored me. The economics professor was too intelligent to admit he might be wrong.
Shannon’s electrical engineering background (which meant he understood both hardward and probability) suggests his investment performance was probably not luck. He was the equivalent of the (very rare) individual gambler that goes into the casino at Monte Carlo, and identifies when the odds are occasionally in his favour. And, in fact, this is not an analogy: Shannon had earlier invented wearable technology to help play roulette, he really was able to do that.
I’ve had to disable my comments – too much Russian spam – but very interested what others make of this. Do let me know your thoughts via twitter!
Photo by Stephen Dawson
* An Introduction to Information Theory John Pierce https://lesen.amazon.de/kp/embed?asin=B008TVLR0O&preview=newtab&linkCode=kpe&ref_=cm_sw_r_kb_dp_x1HJDbMRKVA2M
** Fortune’s Formula – William Poundstone
** OK – yes, I know several decades later Hewlett Packard bought Autonomy, and Motorola has had a mixed couple of decades. If you’d bought Motorola at the peak in 2000, you’d still be underwater – but the shares have ten bagged from their trough in the financial crisis. Shannon died in 2001 – most people tend not to worry about the risk of ruin after their funeral.
100 Baggers – Stocks that return 100-to-1 Christopher Mayer
Skin in the Game – NN Taleb https://lesen.amazon.de/kp/embed?asin=B077QY23RV&preview=newtab&linkCode=kpe&ref_=cm_sw_r_kb_dp_i3HJDb0NAM111
Ole Peters https://ergodicityeconomics.com/