Post-training quantization in FastText (or How to shrink your FastText model in 90%)

In one experiment using a very large text database I got at the end of training using train_supervised()in FastText a serialized model with more than 1Gb.

This behavior occurs because the mechanics of FastText deals with all computation embedded in the model itself: label encoding, parsing, TF-IDF transformation, word-embeddings, calculate the WordNGrams using bag-of-tricks, fit, calculate probabilities and the re-application of the label encoding.

As you noticed in a corpus with more than 200.000 words and wordNGrams > 3this can escalate very quickly in terms of storage.

As I wrote before it’s really nice then we have a good model, but the real value comes when you put this model in production; and this productionize machine learning it’s a barrier that separates girls/boy from woman/man.

With a large storage and memory footprint it’s nearly impossible to make production-ready machine learning models, and in terms of high performance APIs large models with a huge memory footprint can be a big blocker in any decent ML Project.

To solve this kind of problem FastText provides a good way to compress the size of the model with little impact in performance. This is called port-training quantization.

The main idea of Quantization it’s to reduce the size of original model compressing the vectors of the embeddings using several techniques since simple truncation or hashing. Probably this paper (Shu, Raphael, and Hideki Nakayama. “Compressing word embeddings via deep compositional code learning.”) it’s one of the best references of this kind of technique.

This is the performance metric of one vanilla model with full model:`Recall:0.79`

I used the following command in Python for the quantization, model saving and reload:

# Quantize the model
model.quantize(input=None,
                  qout=False,
                  cutoff=0,
                  retrain=False,
                  epoch=None,
                  lr=None,
                  thread=None,
                  verbose=None,
                  dsub=2,
                  qnorm=False,
                 )

# Save Quantized model
model.save_model('model_quantized.bin')

# Model Quantized Load
model_quantized = fastText.load_model('model_quantized.bin')

I made the retraining using the quantized model and I got the following results:

# Training Time: 00:02:46
# Recall: 0.78

info_old_model = os.path.getsize('model.bin') / 1024.0
info_new_model = os.path.getsize('model_quantized.bin') / 1024.0

print(f'Old Model Size (MB): {round(info_old_model, 0)}')
print(f'New Model Size (MB): {round(info_new_model, 0)}')

# Old Model Size (MB): 1125236.0
# New Model Size (MB): 157190.0

As we can see after the shrink in the vanilla model using quantization we had the Recall: 0.78 against 0.79 with a model 9x lighter in terms of space and memory footprint if we need to put this model in production.