# Post-training quantization in FastText (or How to shrink your FastText model in 90%)

2019 Mar 22In one experiment using a very large text database I got at the end of training using `train_supervised()`

in FastText a serialized model with more than 1Gb.

This behavior occurs because the mechanics of FastText deals with all computation embedded in the model itself: label encoding, parsing, TF-IDF transformation, word-embeddings, calculate the WordNGrams using bag-of-tricks, fit, calculate probabilities and the re-application of the label encoding.

As you noticed in a corpus with more than 200.000 words and `wordNGrams > 3`

this can escalate very quickly in terms of storage.

As I wrote before it’s really nice then we have a good model, but the real value comes **when you put this model in production**; and this productionize machine learning it’s a barrier that separates girls/boy from woman/man.

With a large storage and memory footprint it’s **nearly impossible to make production-ready machine learning models,** and in terms of high performance APIs large models with a huge memory footprint can be a big blocker in any decent ML Project.

To solve this kind of problem FastText provides a good way to compress the size of the model with little impact in performance. This is called port-training quantization.

The main idea of Quantization it’s to reduce the size of original model compressing the vectors of the embeddings using several techniques since simple truncation or hashing. Probably **this paper** (**Shu, Raphael, and Hideki Nakayama. “Compressing word embeddings via deep compositional code learning.”**) it’s one of the best references of this kind of technique.

This is the performance metric of one vanilla model with full model:``Recall:0.79``

I used the following command in Python for the quantization, model saving and reload:

```
# Quantize the model
model.quantize(input=None,
qout=False,
cutoff=0,
retrain=False,
epoch=None,
lr=None,
thread=None,
verbose=None,
dsub=2,
qnorm=False,
)
# Save Quantized model
model.save_model('model_quantized.bin')
# Model Quantized Load
model_quantized = fastText.load_model('model_quantized.bin')
```

I made the retraining using the quantized model and I got the following results:

```
# Training Time: 00:02:46
# Recall: 0.78
info_old_model = os.path.getsize('model.bin') / 1024.0
info_new_model = os.path.getsize('model_quantized.bin') / 1024.0
print(f'Old Model Size (MB): {round(info_old_model, 0)}')
print(f'New Model Size (MB): {round(info_new_model, 0)}')
# Old Model Size (MB): 1125236.0
# New Model Size (MB): 157190.0
```

As we can see after the shrink in the vanilla model using quantization we had the `Recall: 0.78`

against `0.79`

with a **model 9x lighter** in terms of space and memory footprint if we need to put this **model in production**.