Saturday, May 9, 2015

A Forest of Sentences

Machine learning as a service is awesome!

I really like the idea of the Google Prediction API. As in really like it. I especially like that it supports both numbers and strings for the training inputs, out of the box.

I quickly found that it is a bit fiddly to set up for just playing around with ideas though, and you need to pay for some of their cloud storage for your training data.

That lead me down the rabbit hole of whether I could use RandomForest algorithms (currently regarded as pretty awesome for minimal-configuration machine learning) to perform the same sort of basic machine learning tasks as suit Google's service.

I decided to start with Google's "Hello Prediction" example which classifies sentences as being in either English, Spanish or French.

The obvious issue here is that these algorithms support input arrays of floats for training, not just my random assortment of sentences - cue rabbit hole number two.

I'd been interested in fuzzy grouping of similar sentences/strings for a long time, and had had some small successes (and epic failures) trailing an idea where I would use a set of three "landmark" sentences to "localise" a new sentence in three-dimensional space. The position in each dimension would be calculated using the Levenshtein Distance or a similar, from each landmark. I had hoped this would allow for sentence grouping and of course cool three-dimensional scatter diagrams. As I said, this didn't really work.

That work did give me an idea for creating vector representations of strings though:
  1. Randomly select a small (100-200) subset all the training inputs for a feature. In the case of Hello Prediction, this was 200 of the sentences that were in the training set. These sentences become the "landmarks" for the input array generator.
  2. For each sentence in the training data, measure the distance (I used the FuzzyWuzzy token set ratio) between the training sentence and each landmark (divided by the training sentence length). This creates a 200 element array of distances, per training data sentence, in my example.
  3. Train a RandomForestRegressor using the languages (mapped to integers) as the targets and the 200-element arrays as input features.
  4. Profit?
For each new sentence, perform the same translation to a 200 element array and pass that to the model's predict() method. This seems to work remarkably well, though it sometimes classifies Spanish sentences as French:

python forest_of_sentences.py

Loading training data... Done!
Preparing vectoriser... Done!
Preparing training data... Done!
Training RandomForest... Done!
Testing...

Hello mate, how are you? I need to find the toilet. ('English', 99)
Bonjour compagnon , comment êtes-vous? Je dois trouver les toilettes . ('French', 87)
Hola amigo, ¿cómo estás? Necesito encontrar el inodoro. ('Spanish', 89)

Done!

This worked-example code is available on my GitHub, and I'll attempt to apply it to some more difficult problems in future posts.