Reproducibility in FastText
2019 Mar 22A few days ago I wrote about FastText and one thing that is not clear in docs it’s about how to make the experiments reproducible in a deterministic day.
In default settings of train_supervised()
method i’m using the thread
parameter with multiprocessing.cpu_count() - 1
as value.
This means that we’re using all the CPUs available for training. As a result, this implies a shorter training time if we’re using multicore servers or machines.
However, this implicates in a totally non-deterministic result because of the optimization algorithm used by fastText (asynchronous stochastic gradient descent, or Hogwild - paper here), the obtained vectors will be different, even if initialized identically.
This very gentle guide of FastText with Gensim states that:
for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).
So for that particular reason the main assumption here it’s even playing in a very stocastic environment of experimentation we’ll consider only the impact of data volume itself and abstract this information from the results, for the reason that this stocastic issue can play for both experiments.
To make reproducible experiments the only thing that it’s needed it’s to change the value of thread
parameter from multiprocessing.cpu_count() - 1
to 1
.
So for the sake of reproducibility the training time will take longer (in my experiments I’m facing an increase of 8000% in the training time.