So large technology companies look at what we reveal about ourselves and use it for their gain? I’m trying the opposite, looking at what companies reveal about themselves…and trying to use it for my own gain. Analysing the text of company financial statements, maybe there are patterns and relationships, not obvious at first glance – but valuable signals that text mining approaches can pick up nonetheless? Hunting for signals in vast forests of information.
I started with simple strategies: like identifying words which might contain some signal “difficult” “opportunities” and simple heuristics (count the frequency of the word “growth” relative to the frequency of the word “margin”). Now I’m moving on to more sophisticated machine learning techniques. My main source is Matthew Jockers book: Text Analysis with R for Students of Literature (Quantitative Methods in the Humanities and Social Sciences). http://amzn.eu/cqbbMuq
In his book, Jockers shows how to take 50 texts by known authors, then see if the computer can identify the author of an anonymous text. Each author has a distinctive style, and one of the best ways of identifying this is comparing the usage of high frequency words like “the”, “of”, “and” and “to”. It’s then possible to use a technique known as Euclidian distance, to show every single texts distance from the other texts. Two documents which are close to each other, have more in common. Two documents which are further away from each other, have less in common. A friend who watches “Manhunt” on Netflix tells me this is how they caught the Unabomber. Come to think of it, Euclidean distance is probably how Netflix recommendation system works too. If she likes “Manhunt”, she probably likes “Suits” too. That sort of thing.
As an example, assume that you have three financial reports, BT, Unilever and National Grid. Then assume that there are two words that you have counted. Let’s say “growth” and “margin”.
|Comparing frequency of “growth” versus “margin(s)” in Financial Reports|
These can be plotted as an X, Y co-ordinates on a two dimensional graph. Once plotted you can measure the distance between the different companies. In this case BT is much closer (as measured by frequency of these two words) to National Grid than it is to Unilever. Click to enlarge.
With only two dimensions this is straightforward, you could do it by hand with a ruler if you really wanted to. It’s that simple. The complexity comes from adding more companies (58 in my case). And more words (24 in my case) – see appendix at the bottom for a list of both the words that I used and another list of the companies. I deliberately avoided high frequency words like “the” and “to” because I’m not interested in who writes the text, I’m more interested in signal words like cold hard “cash”. I also chose smaller companies, because large fund managers with lots of machine learning fire power are unlikely to be interested in smaller companies. Whereas if I can find a small company that will increase in value more than a 1000% I’m very interested.
Comparing the distance between all these words/features in higher dimensional space gives an idea of the similarity or difference. There’s a good section on Euclidian distance in Wikipedia if you are interested. https://en.wikipedia.org/wiki/Euclidean_distance
Tree Diagram / Dendogram
The trouble is the human mind isn’t very good at imagining information displayed in more than 3 dimensions. I suppose I could display the results in a 58×24 matrix, but that wouldn’t be much better. Fortunately this is what R is better at than Excel. I won’t bore you with too many details, but I had to learn how to use “for” loops and “regex” too. Which means that when surrounded by young programmers with beards in Berlin bars, I can sound like I can fit into the conversation. Anyway, using a couple of functions dist() and hclust() I can plot the results on a dendrogram for inspection. (click to enlarge).
First thing to notice is that I’m not clever enough to label each company with a ticker and its share price performance over the following 5 years. I did try. Instead the numbers represent company tickers, in alphabetical order. So company number 1 is AFS (Amiad Water Systems, a supplier of water filtration systems) number 2 is AIEA (Airea which makes carpets) to number 58 which is ZOO Digital Group (adds subtitles to films). See Appendix II.
One caveat I should note, this is the same sample of companies and share price performance that I previously tried sentiment analysis on. However, this time I took the entire text from the financial reports, not just the outlook statements. I’m going to need some new samples eventually. It’s just webscraping financial data and share price performance is not my strong point, so I wanted to try this new clustering technique on what I already had, to see if it was worth pursing before I spent lots of time building a new sample.
So – let’s play “bagger” bingo, match up the numbers with the best performing companies:
The first blue arrow points to ID number 5. This was the best performing share from July 2012 to July 2017: BOTB, up 1884%. Interestingly the cluster analysis dendogram thinks this was closest to ID number 25 NMD. The sixth best performing stock, up 201% over the same time period.
The second best performing stock CRL +1415%, didn’t work. The analysis thought it was most similar to PEG which went nowhere over 5 years.
The second blue arrow points to the third best performing TRD (ID 47) +837% was matched with the fourth best performing MWE (ID 24) +374% .
|Ranking performance with text similarity|
|5 BOTB||+1884%||1||25 NMD||6|
|6 CRL||+1415%||2||28 PEG||low|
|47 TRD||+837%||3||24 MWE||4|
|24 MWE||+374%||4||47 TRD||3|
|2 AIEA||+215%||5||39 STE||low|
|25 NMD||+201%||6||5 BOTB||1|
Share price performance from 27 Jul 2012 to 27 Jul 2017
I’m amazed, but it actually seems to work. Particularly given that I’m much more interested in the tails of the distribution, ie are companies that increase 5 to 10 times in value giving signals that others are not? I should probably do some statistical test to confirm this, if you get even a small number of people in a room, there’s a surprisingly high probability two of them will share a birthday. If anyone knows the statistical test to use and how to apply it, please leave a comment. Also I should say that I had trouble webscraping some of the companies share price performance data, so although there are 58 company statements, I have share price performance for under 50 of them. That said to the naked eye it looks impressive. Interestingly number 48, out there all on its own in a separate space on the left was TRE (Trading Emissions) which fell 89% over the following 5 years.
The downside of the machine learning approach is that I can’t identify any particular heuristics which I can use. That is, the approach is something of a “black box”. But I can read through the statements which the computer thinks are similar. I put links to the 2012 results from BOTB and NMD, TRD and MWE below – so you can read through them too if you’d like.
The most obvious similarity is how downbeat they are. It rather confirms previous research I’ve done, which suggests there is an inverse relationship between positive sentiment words like “confident” and later performance. This makes the approach harder for Chief Executives and PR companies to game. Companies in 2012 with strong share price rises over the following five years seem to be much more willing to talk about their losses and their failures, yet still point to reasons to remain optimistic about the future while acknowledging their problems. Companies that are always “upbeat” are rarely the best performers. I wonder if this is true of people’s posts on social media that technology companies are analysing.
Appendix : list of words that I used
“profit*”, “grow*”, “margin*”, “adjusted”, “statutory”, “future”, “past”,”lower”, “higher”, “opportunit*”, “loss”, “strong*”, “organic*”, “weak*”, “difficult*”, “product*”, “service*”, “cash”, “cashflow”, “goodwill”, “intangible*”,”writedown*”, “broadly”, “marginally”
Words followed by an asterisk I decided to include any derivative words from the simple stem. So “profit*” would include both “profitS” (plural) and “profitABILITY” (abstract noun). “Strong” would include the adjective / noun “strong” but also the adverb “strongly”.
Appendix II: small companies that I used. Ticker and ID
1 AFS 2 AIEA 3 AVG 4 BMTO 5 BOTB 6 CRL 7 CTO 8 DXSP 9 FIH 10 FLK 11 FSD 12 GEEC 13 HAIK 14 HSM 15 HYDG 16 HYNS 17 IND 18 ITQ 19 JIM 20 LDSG 21 LFI 22 LRM 23 MSI 24 MWE 25 NMD 26 NMRP 27 ODX 28 PEG 29 PEN 30 PIL 31 PRP 32 SDI 33 SDM 34 SIXH 35 SMJ 36 SND 37 SPSY 38 SSY 39 STE 40 STM 41 SUN 42 TCA 43 THAL 44 TJI 45 TMMG 46 TND 47 TRD 48 TRE 49 TSG 50 TXH 51 UNG 52 V22O 53 VEN2 54 VLE 55 VNET 56 WALG 57 WOR 58ZOO