understanding_logistic_regression module¶
 Filename: understanding_logistic_regression.py¶
The choice of Loglinear distribution can be justified through the fact that the log linear distribution achieves maximum entropy subject to constraints over means of the feature values. It turns out that doing so amounts to MLE over the conditional loglikelihood (so not really MLE but close enough).
In this script at each iteration we would sample the LogLinearBinaryRV. And by finding the parameters that maximize the regularized conditional loglikelihood through stochastic gradient descent we would try to recover the true parameters of the probability distribution that generate the data.
Since this is a problem of recovering the parameters of a parametric model This is definitely a statistical problem. Note that in the sense that a statistical method is a correct algorithm for figuring out the true parameters for a distribution from the data a statistician is a computer scientist creating an algorithm.
For example the fact that MLE is a consistent estimator is a proof that an algorithm that solves the MLE problem is the correct algorithm for figuring out the parameters of a distribution.
Aside from these stated goals we also answer the following questions.
 Was training even necessary? As in could I have attained the same level performance even without training the SGD classifier. This is answered by just finding the performance of the untrained classifier.
 Can I improve the performance further? One way is to find out the performance of the bayes optimal predictor based on the true generating distribution. If the performance of the fitted method on the holdout set is close to the bayes optimal predictor then we know that we have done the best we could.
 How sensitive is the performance of the ideal bayes optimal predictor to the distribution of the true data? (Very sensitive, most importantly it is sensitive to the bandwidth)
 How sensitive is it to changes in magnitude?
 What is the performance of the naive bayes estimator? Well Jordan/Ng already gave theorems about the performance of the generativediscriminative pairs.
NOTE: Progressive validation¶
PV is a method for estimating the test set performance of a hypothesis learnt by a learning algorithm using just the training set. This technique was originally presented by John Langford in a very different way in his thesis. But one sentence from his thesis works well: “The progressive validation protocol has a learner repeatedly commit to a hypothesis before it is given a new example.” This is exactly what we are doing with sgd based learning. At every point during training the learner has formed a hypothesis based on the examples that were presented to it so far and we ask the learner to predict the outcome for an example before using that example for training and we keep an average of the average loss incurred by our learner during the entire training process. We can compare it to the standard crossvalidation based estimates of performance and holdout based estimates of performance.

class
understanding_logistic_regression.
LogLinearBinaryRV
(coeff)[source]¶ Bases:
object
This class represents a loglinear conditional distribution p(y  X) that can be initialized with certain coefficients and which can be sampled and it can tell us the probability of the output.

class
understanding_logistic_regression.
PngPlotter
(X=None, Y=None, marker='o', title='', xlabel='', ylabel='')[source]¶ Bases:
object
'''
================================================
 Filename: understanding_logistic_regression.py
================================================
 Description: This script demonstrates the following:
 1. The SGDClassifier class in sklearn
 2. The use of progressive validation
 3. The fact that accurate recovery of true parameters of the
 statistical model that generated data is not necessary for good
 predictive performance.
 4. The fact that conditional loglikelihood maximization (as opposed
 to joint MLE) can still recover the true parameters of the
 conditional distribution that generated the data.
 Author: Pushpendre Rastogi
 Created: Wed Jul 22 17:55:17 2015 (0400)
 LastUpdated:
 By:
 Update #: 248
The choice of Loglinear distribution can be justified through the fact
that the log linear distribution
achieves maximum entropy subject to constraints over means of the
feature values. It turns out that doing so amounts to MLE over the
conditional loglikelihood (so not really MLE but close enough).
In this script at each iteration we would sample the LogLinearBinaryRV.
And by finding the parameters that maximize the regularized conditional
loglikelihood through stochastic gradient descent we would try to
recover the true parameters of the probability distribution that
generate the data.
Since this is a problem of recovering the parameters of a parametric model
This is definitely a statistical problem. Note that in the sense
that a statistical method is a correct algorithm for figuring out
the true parameters for a distribution from the data a statistician
is a computer scientist creating an algorithm.
For example the fact that MLE is a consistent estimator is a proof
that an algorithm that solves the MLE problem is the correct
algorithm for figuring out the parameters of a distribution.
Aside from these stated goals we also answer the following questions.
1. Was training even necessary?
As in could I have attained the same level performance even without
training the SGD classifier. This is answered by just finding the
performance of the untrained classifier.
2. Can I improve the performance further?
One way is to find out the performance of the bayes optimal
predictor based on the true generating distribution. If the performance
of the fitted method on the holdout set is close to the bayes optimal
predictor then we know that we have done the best we could.
3. How sensitive is the performance of the ideal bayes optimal predictor to the
distribution of the true data?
(Very sensitive, most importantly it is sensitive to the bandwidth)
4. How sensitive is it to changes in magnitude?
5. What is the performance of the naive bayes estimator?
Well Jordan/Ng already gave theorems about the performance of the
generativediscriminative pairs.
============================
NOTE: Progressive validation
============================
PV is a method for
estimating the test set performance of a hypothesis learnt by a
learning algorithm using just the training set.
This technique was originally presented by John Langford in a very
different way in his thesis. But one sentence from his thesis works well:
"The progressive validation protocol has a learner repeatedly commit
to a hypothesis before it is given a new example."
This is exactly what we are doing with sgd based learning. At every
point during training the learner has formed a hypothesis based on
the examples that were presented to it so far and we ask the learner
to predict the outcome for an example before using that example for
training and we keep an average of the average loss incurred by our
learner during the entire training process.
We can compare it to the standard crossvalidation based estimates
of performance and holdout based estimates of performance.

'''
from __future__ import division
from __future__ import print_function
from __future__ import with_statement
from sklearn import linear_model
import argparse
import numpy
import numpy as np
from typecheck import typecheck
def loglinear_prob(X, coeff):
'''The probability p(y = 1  X) and coeff values
'''
return 1 / (1 + np.exp(np.dot(X, coeff)))
class LogLinearBinaryRV(object):
''' This class represents a loglinear conditional distribution
p(y  X) that can be initialized with certain coefficients
and which can be sampled and it can tell us the probability of
the output.
'''
def __init__(self, coeff):
self.coeff = coeff
self.independent_var = len(coeff)
def sample(self):
noise = (
np.random.random(
(self.independent_var,))  0.5) * args.noise_level
coeff_alignment_sign = (1 if np.random.random() > 0.5 else 1)
coeff_alignment = self.coeff * \
args.coeff_strength * coeff_alignment_sign
X = (noise + coeff_alignment)
# Having two classes of coefficients is like having double the
# number of features.
positive_class_prob = self.probability(X)
sample = (1
if numpy.random.random() < positive_class_prob
else 0)
return (sample, X)
def probability(self, X):
''' Return the true probability of class 1 given the input '''
return loglinear_prob(X, self.coeff)
def predict(self, X):
return 1 if self.probability(X) > 0.5 else 0
class PngPlotter(object):
def __init__(
self,
X=None,
Y=None,
marker='o',
title='',
xlabel='',
ylabel=''):
''' X and Y are lists of X and Y values of points '''
self.X = [] if X is None else X
self.Y = [] if Y is None else Y
self.marker = marker
def add_data(self, x, y):
self.X.append(x)
self.Y.append(y)
def render(self):
pass
def log_loss(y, pyeq1):
return  numpy.log(pyeq1
if y == 1
else 1  pyeq1)
def orthogonal_portion(v, e):
''' Return the portion of v that is not along e.
'''
e = e / np.linalg.norm(e)
return v  e * np.dot(v, e)
@typecheck(list, (linear_model.SGDClassifier, LogLinearBinaryRV))
def get_holdout_accuracy(holdout_set, predictor_obj):
if isinstance(predictor_obj, linear_model.SGDClassifier):
prediction_fnc = lambda X: predictor_obj.predict(X)[0]
elif isinstance(predictor_obj, LogLinearBinaryRV):
prediction_fnc = predictor_obj.predict
else:
raise NotImplementedError
return np.mean([y == prediction_fnc(X) for (y, X) in holdout_set])
if __name__ == '__main__':
arg_parser = argparse.ArgumentParser(
description='Understanding Logistic Regression')
arg_parser.add_argument('loss', default='log', type=str,
help='Objective that is optimized, [logperceptronhinge]')
arg_parser.add_argument('penalty', default='l2', type=str,
help='Penalty type, [l1l2elasticnet]')
arg_parser.add_argument('alpha', default=0.1, type=float,
help='The weight of the regularizer')
arg_parser.add_argument('train_sample_count', default=10000, type=int)
arg_parser.add_argument(
'validation_sample_count', default=10000, type=int)
arg_parser.add_argument('coeff_strength', default=1.0, type=float,
help='How close to true coeff is p(X)?')
arg_parser.add_argument('noise_level', default=2.0, type=float,
help='The strength of noise to add.')
arg_parser.add_argument(
'bandwidth', default=1.0, type=float, help='Default={1.0}')
arg_parser.add_argument(
'prob_dim', default=100, type=int, help='Default={100}')
args = arg_parser.parse_args()
# The true coefficients that define the loglinear distribution p(yx)
coeff = np.random.random_sample(
size=(args.prob_dim,)) / np.sqrt(args.prob_dim / 3)
coeff *= args.bandwidth
LogLinearConditionalPDFObject = LogLinearBinaryRV(coeff)
clsfr = linear_model.SGDClassifier(
loss=args.loss, # log, perceptron, hinge
penalty=args.penalty, # l1, l2, elasticnet
alpha=args.alpha, # Alpha is the weight of the regularizer
fit_intercept=False, # Fit the bias/intercept term or not
warm_start=False) # When set to True, reuse the solution of the previous call
prog_valid_plotter = PngPlotter(title='Progressive Validation')
coeff_error_plotter = PngPlotter(
title='Relative Squared Error of ditribution parameters')
prog_loss_plotter = PngPlotter(title='Progressive Log Loss')
prog_valid, coeff_error, prog_regret_log_loss = 0.0, 0.0, 0.0
# Create a holdout set based on samples from the
holdout_set = [
LogLinearConditionalPDFObject.sample() for _ in range(
args.validation_sample_count)]
# Find the holdout accuracy of the bayes optimal classifier
holdout_accuracy_bayes_optimal = get_holdout_accuracy(
holdout_set, LogLinearConditionalPDFObject)
# Performance of the majority baseline
mbp = np.mean([y for (y, X) in holdout_set])
mbp = max(mbp, 1  mbp)
print('\n mbp', mbp, )
# Fit once before starting progressive validation.
(y, X) = LogLinearConditionalPDFObject.sample()
clsfr.partial_fit(numpy.expand_dims(X, 0), [y], classes=[0, 1])
clsfr.coef_ = np.zeros(clsfr.coef_.shape)
# Holdout accuracy before starting training
holdout_accuracy_before_training = get_holdout_accuracy(holdout_set, clsfr)
for idx in range(args.train_sample_count):
(y, X) = LogLinearConditionalPDFObject.sample()
y_hat = clsfr.predict(X)[0]
prog_valid *= float(idx) / float(idx + 1)
prog_valid += (1.0 if y_hat == y else 0.0) / (idx + 1)
coeff_error = np.linalg.norm(
clsfr.coef_  coeff) / np.linalg.norm(coeff)
true_log_prob = log_loss(
y,
LogLinearConditionalPDFObject.probability(X))
estimated_log_prob = log_loss(
y,
loglinear_prob(
X,
clsfr.coef_.squeeze()))
prog_regret_log_loss *= float(idx) / float(idx + 1)
prog_regret_log_loss += numpy.abs(true_log_prob 
estimated_log_prob) / float(idx + 1)
prog_valid_plotter.add_data(idx, prog_valid)
coeff_error_plotter.add_data(idx, coeff_error)
prog_loss_plotter.add_data(idx, prog_regret_log_loss)
clsfr.partial_fit(numpy.expand_dims(X, 0), [y], classes=[0, 1])
# Now test the performance of the fitted logistic classifier
holdout_accuracy_after_training = get_holdout_accuracy(holdout_set, clsfr)
print(
'\n prog_valid', prog_valid,
'\n holdout_accuracy_bayes_optimal', holdout_accuracy_bayes_optimal,
'\n holdout_accuracy_before_training', holdout_accuracy_before_training,
'\n holdout_accuracy_after_training', holdout_accuracy_after_training,
)
# Find out the recovered parameters versus the true parameters
# NOTE: clsfr.coef_ is the recovered parameter and coeff was the
# true parameter
erroneous_vec_fraction = np.linalg.norm(
orthogonal_portion(clsfr.coef_, coeff)) / np.linalg.norm(clsfr.coef_)
print(
'\n erroneous_vec_fraction', erroneous_vec_fraction,
'\n prog_regret_log_loss', prog_regret_log_loss
)
prog_valid_plotter.render()
coeff_error_plotter.render()
prog_loss_plotter.render()
# Log Loss of the optimal predictor.
log_loss_optimal = np.mean(
[log_loss(y, LogLinearConditionalPDFObject.probability(X))
for (y, X)
in holdout_set])
log_loss_estimated = np.mean(
[log_loss(y, loglinear_prob(X, clsfr.coef_.squeeze()))
for (y, X)
in holdout_set])
print('\n log_loss_optimal', log_loss_optimal,
'\n log_loss_estimated', log_loss_estimated, )
'''
Look at how the logloss truy is minimized but the validation error
is basically the same even when we use the hinge loss.
log 0.885 0.237379757873 0.0708640472617
log 0.894 0.128038074657 0.0550941428819
log 0.874 0.133218984176 0.0914931968861
hinge 0.895 0.473717543562 0.215030860105
hinge 0.89 0.434286919844 0.216065130295
hinge 0.87 0.462364313401 0.240906415183
perceptron 0.81 1.00405480928 0.619301226865
perceptron 0.767 1.00245422266 0.6231903911
perceptron 0.819 0.999211001065 0.601749697036
Note that I am only able to recover the
python understanding_logistic_regression.py alpha 0.01 coeff_strength 0.0 train_sample_count 80000
0.519575 0.684376979344 0.0391865540893
0.5225 0.519575
0.692045918504 0.692365926667
python understanding_logistic_regression.py alpha 0.01 coeff_strength 1.0 train_sample_count 80000
0.7309375 0.0683319657749 0.0253992530515
0.7264 0.7309375
0.585110828186 0.585081212155
python understanding_logistic_regression.py alpha 0.01 coeff_strength 8.0 train_sample_count 80000
0.9997 0.151844644062 0.000821186056188
0.9994 0.9997
0.00516971678716 0.00518374450427
Really what this tells me is that there are limits to recovering the
true parameters of the true system and if we only try to build a
good predictor we can often more succeed more spectacularly than if
we try to estimate the parameters of the system that generated the
data.
mbp 0.5174
prog_valid 0.727
holdout_accuracy_bayes_optimal 0.7546
holdout_accuracy_before_training 0.5174
holdout_accuracy_after_training 0.7432
erroneous_vec_fraction 0.546445772363
prog_regret_log_loss 0.216014496595
log_loss_optimal 0.530312347206
log_loss_estimated 0.548764473949
'''