Facebook FastText - Automatic Hyperparameter optimization with Autotune

Disclaimer: some of the information in this blog post might be incorrect and as FastText it’s very fast-paced to correct and adjust things probably some parts of this post may be can be out-of-date very soon too. If you have some correction or feedback feel free to comment.

I’m finishing some experiments with the new feature of FastText for hyperparametrization for the training time called Autotune.

What is Autotune?

From the press release the description of Autotune is:

[…]This feature automatically determines the best hyperparameters for your data set in order to build an efficient text classifier[…].

[…]FastText then uses the allotted time to search for the hyperparameters that give the best performance on the validation set.[…].

[…]Our strategy to explore various hyperparameters is inspired by existing tools, such as Nevergrad, but tailored to fastText by leveraging the specific structure of models. Our autotune explores hyperparameters by sampling, initially in a large domain that shrinks around the best combinations found over time[…]

Autotune Strategy

Checking the code we can find the search strategy for the Autotune follows:

For all parameters, the Autotuner have an updater (method updateArgGauss()) that considers a random number provided by a Gaussian distribution function (coeff) and set an update number between a single standard deviation (parameters startSigma and endSigma) and based on these values the coefficients have an update.

Each parameter has a specific range for the startSigma and endSigma that it’s fixed in the updateArgGauss method.

Updates for each coefficient can be linear (i.e. updateCoeff + val) or power (i.e. pow(2.0, coeff); updateCoeff * val) and depends from the first random gaussian random number that are inside of standard deviation.

After each validation (that uses a different combination of parameters) one score (f1-score only) it’s stored and the best one will be used to train the full model using the best combination of parameters.

Arguments Range

  • epoch1 to 100
  • learning rate0.01 to 5.00
  • dimensions1 to 1000
  • wordNgrams1 to 5
  • loss: Only softmax
  • bucket size10000 to 10000000
  • minn (min length of char ngram): 1 to 3
  • maxn (max length of char ngram): 1 to minn + 3
  • dsub (size of each sub-vector): 1 to 4

Clarification posted in issues in FastText project.

In terms of metrics for optimization there’s only the f1score and labelf1score metrics.

Advantages

  • In some domains where the FaxtText models are not so critical in terms of accuracy/recall/precision, the Timeboxing optimization can be very useful
  • Extreme simplicity for implementation. It’s just to call more args in the train_supervised()
  • Source code transparent where we can check some of the behaviors
  • The search strategy it’s simple and has some boundaries that cut extreme training parameters (e.g. Learning Rate=10.0, Epoch=10000, WordNGrams=70, etc)

Disadvantages

  • FastText still doesn’t provide any log about the convergence. In that case, maybe a log for each model tested could be nice.
  • Maybe the search strategy could be a bit clarified in terms of boundaries, parameter initialization and so on
  • Boundaries parameters `startSigma` and `endSigma` follow a Gaussian distribution and I think this maybe can be explained in docs
  • Same for the hardcoded parameters that define the boundaries for each parameter. Something like _Based in some empirical tests we got these values. However, you can test a certain amount of combinations an open a PR if you find some good intervals. _
  • Autotune maybe can process in several combination with not so good parameters before starting a good sequence of optimization (i.e. in a search space budget of 100 combinations the first 70 can be not so useful). The main idea of Autotune it’s to be “automatic” but could be useful have some option/configuration to a more broader or optimized configuration.

The Jupyter Notebook can be found in my Github.