Cluster One

Last year I was playing around with Euclidean Distance, https://brucepackard.com/wood-from-the-trees/ taking a sample of companies from 5 years ago, and seeing if the best performing companies used similar words with similar frequency. The technique comes from Mathew Jocker’s book, and has been used in the past for authorship attribution for documents where authorship is disputed. Two documents which are close to each other, have more in common. Two documents which are further away from each other, have less in common.
With the help of my tame programmer friend (thanks Tom!) I’ve finally got round to webscraping a new sample of companies to repeat the text analysis technique on.
My new sample comes from stockopedia, taking 37 stocks below £400m market cap and going back 5 years and digging out their financial results from 2014. The best performer was Sopheon (an IT company) up +1424%. See the table below:

TickerNameMkt capPrice chg 5y
£m (%)
EKTElecktron Technology98+1005
AAZAnglo Asian Mining145+694
TSGTrans Siberian Gold83+617
IPXImpax Asset Mngt357+425

The worst performers were EnQuest and SOCO International (both oil and gas exploration), both down 83% over the same time period.

I did wonder if I should exclude resources stocks from my text analysis, because it’s one area where presumably randomness plays a larger role than management skill. I can’t name any billionaires who got rich by investing consistently well in small mining or oil and gas companies. For now I’ve kept them in. Finally after a discussion with Richard Beddard on Twitter, about Judges Scientific (which was in the sample), I decided to add SDI (which wasn’t in the original sample because it was just too small). I kept the keywords the same as my original analysis last year, but changed the companies – this time the sample was 37 companies v 58 last time.
Running new sample was relatively quick, the code still worked. But then I spent hours messing around trying to label the dendogram. I eventually figured it out.


Bit mixed to be honest. SPE + 1400% and EKT +1000% both technology groups, but the computer didn’t think they were very similar. Instead SPE (Sopheon) was seen as much more similar to JDG (Judges Scientific) +67% and SDI (Scientific Digital Imaging) +257%. Start from the right, and look 5-10 companies in to see the JDG, SPE, SDI cluster.

I reckon this is not as disappointing as it first looks, because JDG has been a high performing “multi-bagger” up from £1 ten years ago to £36 today. It’s just that the vast bulk of the performance came in the first five years of this decade. SDI has also been a multi bagger. I should note that most of Sopheon’s growth has been organic, whereas JDG and SDI have “buy and build” strategies. But the computer clusters these three high performers together based on their text alone.
The next two best performers were resources stocks, Anglo Asian Mining (AAZ) +693% and Trans-Siberian Gold (TSG) +618%. Despite both being high performing resources stocks, the computer didn’t think that they were very similar from a text analysis point of view. In fact, it thought that AAZ had the most in common with ENQ -83% (the worst performing stock). Next time I’ll probably remove oil & gas and mining stocks from the analysis.
The next two best performing stocks were Impax Asset Management (IPX) +424% and Liontrust Asset Management (LIO) +185%. You can probably guess from their names that these are both asset managers. The computer did think these were similar although bizarrely also thought that Telford Homes (TEF) +2% was a close fit.


I still think that this is interesting – especially as no one else seems to do this sort of analysis. There is a paradox, in that if everyone can see an “obvious” opportunity, it probably isn’t an obvious opportunity. It’s hard to have high conviction ideas about stocks that are 10 baggers, because if everyone could see the potential then they would bid up the price. So I’m fascinated that the computer has made a cluster with SDI, JDG and SPE.
SDI makes scientific instruments and so does Judges Scientific. But Judges Scientific’s strategy is to buy and operate scientific instrument manufacturers with debt financing, whereas SDI has mostly issued shares. As Richard Beddard points out :https://knowledge.sharescope.co.uk/2019/07/10/how-to-work-out-whether-a-firm-is-good-acquirer/

“SDI is buying companies that complement the ones it already owns whereas Judges generally does not care. As long as a business is worth acquiring on its own merits, Judges Scientific is interested, whereas many of SDI’s businesses, broadly focused on imaging and microbiology, supply other SDI businesses. SDI believes the firms it buys are more valuable within SDI, because outside it they are too small to achieve ‘critical mass’.”

I’d be interested in what he makes of Sopheon. Perhaps this little cluster reveals that there is no “right answer” to questions of value creation:

  • “Should management make acquisitions or grow organically?”,
  • “Should they buy companies by issuing equity or taking on debt?”,
  • “Expand in existing areas or diversify into new ones?”

– there is only clear thinking, successful execution and good communication. If my text clustering analysis can identify the first two via good communication, it could be extremely valuable.

I’ve got some ideas about what I might do in future – Jockers book goes on to do some work with Support Vector Machines, which is the technology used in self driving cars. I’m also thinking about how I might extend the analysis to see how companies communication changes over time: whether there might be signals that a company is beginning to struggle (cautiously optimistic? Second half weighting?) or perhaps has turned a corner. I’m also reading Tim Steer’s book, and thought maybe some words like “accruals” “other receivables” might be interesting to include and expand the vocabulary of the computer to see if it could detect companies where the management were playing games with accounting assumptions. Finally I’ve just order a book on Keras and deep learning – maybe it will be way above my head but I thought I wouldn’t know until I’d at least tried to read it.

I’ve had to disable my comments section because too much Russian spam was coming in. Feel free to contact me on twitter with suggestions though.

Photo by Christopher Gower on Unsplash
Appendix : list of words that I used
“profit”, “grow”, “margin”, “adjusted”, “statutory”, “future”, “past”,”lower”, “higher”, “opportunit”, “loss”, “strong”, “organic”, “weak”, “difficult”, “product”, “service”, “cash”, “cashflow”, “goodwill”, “intangible”,”writedown”, “broadly”, “marginally”

Words followed by an asterisk I decided to include any derivative words from the simple stem. So “profit*” would include both “profitS” (plural) and “profitABILITY” (abstract noun). “Strong” would include the adjective / noun “strong” but also the adverb “strongly”.
Appendix : Companies used in the sample