Learning theory provides probabilitic guarantees on the deviation between training error of a certain size and true errors. I don’t think people actually use these bounds literally, rather these bounds are used to get some sense of trends. For example, the wisdom that the number of samples needed for learning grows linearly with the number of parameters in the model which can be justfied because the sample complexity of ERM increases linearly with the VC dimension (Ng & Jordan, 2002) (Shalev-Shwartz & Ben-David, 2014)

This formula comes with a large number of conditions. This is for 0-1 loss on a binary classification problem for the ERM algorithm in the agnostic setting. In this post I have put up an online script that calculates the sample complexities for some hypothesis spaces. Most useful case perhaps is the setting when “is_validation” is set to 1. This case can basically help us design the number of samples to keep in our dataset for good generalization to the test data. Of course if the test data has a different distribution then these bounds won’t be useful.


  1. Ng, A., & Jordan, I. M. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in NIPS 14.
  2. Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.