Depending on the cost function F that we will select, we might face different problems. When the Sum of Squared Errors is selected as our cost function then the value of θF(Wj)/θWj gets larger and larger as we increase the size of the training dataset. Thus the λ must be adapted to significantly smaller values.
One way to resolve this problem is to divide the λ with 1/N, where N is the size of the training data. So the update step of the algorithm can be rewritten as:
1
| Wj = Wj - (λ/N)*θF(Wj)/θWj |
You can read more about this on Wilson et al. paper “The general inefficiency of batch training for gradient descent learning”.
Finally another way to resolve this problem is by selecting a cost function that is not affected by the number of train examples that we use, such as the Mean Squared Errors.
This technique was used in the online gradient descent code by tingrtu in Criteo Ad Click Competition organized by Kaggle.
Reference : http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-descent/
No comments:
Post a Comment