Wood from the Trees

So large technology companies look at what we reveal about ourselves and use it for their gain?  I’m trying the opposite, looking at what companies reveal about themselves…and trying to use it for my own gain.  Analysing the text of company financial statements, maybe there are patterns and relationships, not obvious at first glance – but valuable signals that text mining approaches can pick up nonetheless?  Hunting for signals in vast forests of information.

I started with simple strategies: like identifying words which might contain some signal “difficult” “opportunities” and simple heuristics (count the frequency of the word “growth” relative to the frequency of the word “margin”).  Now I’m moving on to more sophisticated machine learning techniques.  My main source is Matthew Jockers book: Text Analysis with R for Students of Literature (Quantitative Methods in the Humanities and Social Sciences). http://amzn.eu/cqbbMuq

Text mining

In his book, Jockers shows how to take 50 texts by known authors, then see if the computer can identify the author of an anonymous text.  Each author has a distinctive style, and one of the best ways of identifying this is comparing the usage of high frequency words like “the”, “of”, “and” and “to”.  It’s then possible to use a technique known as Euclidian distance, to show every single texts distance from the other texts.  Two documents which are close to each other, have more in common.  Two documents which are further away from each other, have less in common. A friend who watches “Manhunt” on Netflix tells me this is how they caught the Unabomber. Come to think of it, Euclidean distance is probably how Netflix recommendation system works too.  If she likes “Manhunt”, she probably likes “Suits” too.  That sort of thing.

As an example, assume that you have three financial reports, BT, Unilever and National Grid.  Then assume that there are two words that you have counted.  Let’s say “growth” and “margin”.


Comparing frequency of “growth” versus “margin(s)” in Financial Reports
“growth” “margin”
BT 10 1
Unilever  34  26
National Grid 6  1

These can be plotted as an X, Y co-ordinates on a two dimensional graph. Once plotted you can measure the distance between the different companies.   In this case BT is much closer (as measured by frequency of these two words) to National Grid than it is to Unilever. Click to enlarge.

With only two dimensions this is straightforward, you could do it by hand with a ruler if you really wanted to.  It’s that simple.   The complexity comes from adding more companies (58 in my case).  And more words (24 in my case) – see appendix at the bottom for a list of both the words that I used and another list of the companies.  I deliberately avoided high frequency words like “the” and “to” because I’m not interested in who writes the text, I’m more interested in signal words like cold hard “cash”.   I also chose smaller companies, because large fund managers with lots of machine learning fire power are unlikely to be interested in smaller companies.  Whereas if I can find a small company that will increase in value more than a 1000% I’m very interested.

Comparing the distance between all these words/features in higher dimensional space gives an idea of the similarity or difference.  There’s a good section on Euclidian distance in Wikipedia if you are interested. https://en.wikipedia.org/wiki/Euclidean_distance

Tree Diagram / Dendogram

The trouble is the human mind isn’t very good at imagining information displayed in more than 3 dimensions.  I suppose I could display the results in a 58×24 matrix, but that wouldn’t be much better.  Fortunately this is what R is better at than Excel. I won’t bore you with too many details, but I had to learn how to use “for” loops and “regex” too.  Which means that when surrounded by young programmers with beards in Berlin bars, I can sound like I can fit into the conversation.   Anyway, using a couple of functions dist() and hclust() I can plot the results on a dendrogram for inspection.  (click to enlarge).


First thing to notice is that I’m not clever enough to label each company with a ticker and its share price performance over the following 5 years.  I did try.  Instead the numbers represent company tickers, in alphabetical order.  So company number 1 is AFS (Amiad Water Systems, a supplier of water filtration systems) number 2 is AIEA (Airea which makes carpets) to number 58 which is ZOO Digital Group (adds subtitles to films).   See Appendix II. 

One caveat I should note, this is the same sample of companies and share price performance that I previously tried sentiment analysis on.  However, this time I took the entire text from the financial reports, not just the outlook statements.  I’m going to need some new samples eventually.  It’s just webscraping financial data and share price performance is not my strong point, so I wanted to try this new clustering technique on what I already had, to see if it was worth pursing before I spent lots of time building a new sample.


So – let’s play “bagger” bingo, match up the numbers with the best performing companies:

The first blue arrow points to ID number 5.  This was the best performing share from July 2012 to July 2017: BOTB, up 1884%.    Interestingly the cluster analysis dendogram thinks this was closest to ID number 25 NMD.  The sixth best performing stock, up 201% over the same time period.

The second best performing stock CRL +1415%, didn’t work.  The analysis thought it was most similar to PEG which went nowhere over 5 years. 

The second blue arrow points to the third best performing TRD (ID 47) +837% was matched with the fourth best performing MWE (ID 24) +374% . 



Ranking performance with text similarity
 Company Performance Rank Closest to Rank
5 BOTB +1884% 1 25 NMD 6
6 CRL +1415%  2 28 PEG low
47 TRD +837%  3 24 MWE 4
24 MWE +374% 4 47 TRD 3
2 AIEA +215% 5 39 STE low
25 NMD +201% 6 5 BOTB 1
Share price performance from 27 Jul 2012 to 27 Jul 2017

I’m amazed, but it actually seems to work.  Particularly given that I’m much more interested in the tails of the distribution, ie are companies that increase 5 to 10 times in value giving signals that others are not?  I should probably do some statistical test to confirm this, if you get even a small number of people in a room, there’s a surprisingly high probability two of them will share a birthday.  If anyone knows the statistical test to use and how to apply it, please leave a comment.  Also I should say that I had trouble webscraping some of the companies share price performance data, so although there are 58 company statements, I have share price performance for under 50 of them.  That said to the naked eye it looks impressive.  Interestingly number 48, out there all on its own in a separate space on the left was TRE (Trading Emissions) which fell 89% over the following 5 years.

The downside of the machine learning approach is that I can’t identify any particular heuristics which I can use.  That is, the approach is something of a “black box”. But I can read through the statements which the computer thinks are similar.  I put links to the 2012 results from BOTB and NMD, TRD and MWE below – so you can read through them too if you’d like.

The most obvious similarity is how downbeat they are.  It rather confirms previous research I’ve done, which suggests there is an inverse relationship between positive sentiment words like “confident” and later performance.  This makes the approach harder for Chief Executives and PR companies to game.  Companies in 2012 with strong share price rises over the following five years seem to be much more willing to talk about their losses and their failures, yet still point to reasons to remain optimistic about the future while acknowledging their problems.  Companies that are always “upbeat” are rarely the best performers.  I wonder if this is true of people’s posts on social media that technology companies are analysing.













Appendix : list of words that I used

“profit*”, “grow*”, “margin*”, “adjusted”, “statutory”, “future”, “past”,”lower”, “higher”, “opportunit*”, “loss”, “strong*”, “organic*”, “weak*”, “difficult*”, “product*”, “service*”, “cash”, “cashflow”, “goodwill”, “intangible*”,”writedown*”, “broadly”, “marginally”

Words followed by an asterisk I decided to include any derivative words from the simple stem.  So “profit*” would include both “profitS” (plural) and “profitABILITY” (abstract noun).  “Strong” would include the adjective / noun “strong” but also the adverb “strongly”.

Appendix II: small companies that I used.  Ticker and ID

1 AFS 2 AIEA 3 AVG 4 BMTO 5 BOTB 6 CRL 7 CTO 8 DXSP 9 FIH 10 FLK 11 FSD 12 GEEC 13 HAIK 14 HSM 15 HYDG 16 HYNS 17 IND 18 ITQ 19 JIM 20 LDSG 21 LFI 22 LRM 23 MSI 24 MWE 25 NMD 26 NMRP 27 ODX 28 PEG 29 PEN 30 PIL 31 PRP 32 SDI 33 SDM 34 SIXH 35 SMJ 36 SND 37 SPSY 38 SSY 39 STE 40 STM 41 SUN 42 TCA 43 THAL 44 TJI 45              TMMG 46 TND 47 TRD 48 TRE 49 TSG 50 TXH 51 UNG 52 V22O 53 VEN2 54 VLE 55 VNET 56 WALG 57 WOR 58ZOO

Photo by Valeriy Andrushko on Unsplash

7 Responses

  1. ian

    Great article as usual Bruce. I’d never seen a “dendrogram” before so thanks for the learning!

    Your comments towards the end “Companies in 2012 with strong share price rises over the following five years seem to be much more willing to talk about their losses and their failures, yet still point to reasons to remain optimistic about the future while acknowledging their problems” is very interesting. This is exactly what Jim Collins refers to as the Stockdale Paradox in “Good to Great”. The best companies are brutally honest about what’s not going well today, but at the same time never lose faith in the mission.

    So in terms of a take-away heuristic – it’s surely to look for similar financial reports which adopt this same approach?

    1. Bruce Packard

      Yes – I think that is right. This approach is a “black box” – we don’t really know why the computer thinks that BOTB is similar to NMD. Or why it clusters TRD with MWE.
      So I had to read reports with my own eyes, to see if I can recognise the patterns that the computer sees. To me it does seem like there is an honesty and realism in the way these companies communicate with investors.
      I’ve put the links at the bottom of the post so that people have a look for themselves.

  2. John Barnard

    Bruce – this type of analysis has been used a lot in chemistry, to identify potentially useful drugs (e.g. if you have a good drug, “similar” molecules might also be useful). I was involved in some of this work, and basically it works a bit, but not spectacularly. It’s probably most often been used as a filtering technique to reduce the number of molcules that need to be looked at in more detail. Much of the methodology is applicable in many fields, and lots of variants have been tried. There are many descriptors that can be used, such as occurrence of particular groups of atoms (analogous to keywords in Annual Reports) or basic properties like solubility (cf accounts data). There are many distance/smilarity metrics other than Euclidean distance, and many clustering methods yielding dendrograms. I don’t know how good your chemistry is, but you could have a look at Barnard, Downs & Willett “Chemical similarity searching”, Journal of Chemical Information and Computer Sciences, 1998, 38, 983-996, or Downs & Barnard “Clustering methods and their uses in computational chemistry”, Reviews in Computational Chemistry, 18, 1-40 (ed Boyd & Lipkowitz & Boyd, Wiley 2002).

    1. Bruce Packard

      Very interesting John. To be honest I was rather surprised this approach worked at all. Was there anything in your experience that worked better that you would recommend trying?

      1. John Barnard

        Generally (and somewhat counter-intuitively) people found that very simple descriptors worked as well as more complex ones (e.g. presence or absence of chemical “functional groups” worked as well as sophisticated descriptors of 3-dimensional structural features). This seems to accord with your experience. We found that Ward’s (hierarchical) and K-means (non-hierarchical) clustering methods worked best, but cluster analysis can be a tricky thing – it is perfectly possible to cluster random data points, but the results are rather meaningless and may be misleading. E.g. an object may be less similar to objects in the same cluster than it is to those in different ones – this is a problem with the dataset (which may not lie in “natural” distinct clusters), rather than with the analysis technique. There are some methods for choosing the “best” number of clusters, and for “cluster significance analysis”, which are discussed in our review.

    1. Bruce Packard

      Thanks! That’s a really interesting article. I thought the observation that companies were making their communication longer, and more repetitive was also fascinating: for instance the “number of words quadrupled, to more than 60,000 in 2017 from about 15,000 in 1995. Over roughly the same period, the amount of boilerplate and repetition increased; for the median firm, redundant sentences also quadrupled.”