# understanding_logistic_regression module¶

## | Filename: understanding_logistic_regression.py¶

Description: This script demonstrates the following:
1. The SGDClassifier class in sklearn
2. The use of progressive validation
3. The fact that accurate recovery of true parameters of the
statistical model that generated data is not necessary for good
predictive performance.
4. The fact that conditional loglikelihood maximization (as opposed
to joint MLE) can still recover the true parameters of the
conditional distribution that generated the data.
Author: Pushpendre Rastogi
Created: Wed Jul 22 17:55:17 2015 (-0400)
Last-Updated:
By:
Update #: 248

The choice of Loglinear distribution can be justified through the fact that the log linear distribution achieves maximum entropy subject to constraints over means of the feature values. It turns out that doing so amounts to MLE over the conditional loglikelihood (so not really MLE but close enough).

In this script at each iteration we would sample the LogLinearBinaryRV. And by finding the parameters that maximize the regularized conditional log-likelihood through stochastic gradient descent we would try to recover the true parameters of the probability distribution that generate the data.

Since this is a problem of recovering the parameters of a parametric model This is definitely a statistical problem. Note that in the sense that a statistical method is a correct algorithm for figuring out the true parameters for a distribution from the data a statistician is a computer scientist creating an algorithm.

For example the fact that MLE is a consistent estimator is a proof that an algorithm that solves the MLE problem is the correct algorithm for figuring out the parameters of a distribution.

Aside from these stated goals we also answer the following questions.

1. Was training even necessary? As in could I have attained the same level performance even without training the SGD classifier. This is answered by just finding the performance of the untrained classifier.
2. Can I improve the performance further? One way is to find out the performance of the bayes optimal predictor based on the true generating distribution. If the performance of the fitted method on the holdout set is close to the bayes optimal predictor then we know that we have done the best we could.
3. How sensitive is the performance of the ideal bayes optimal predictor to the distribution of the true data? (Very sensitive, most importantly it is sensitive to the bandwidth)
4. How sensitive is it to changes in magnitude?
5. What is the performance of the naive bayes estimator? Well Jordan/Ng already gave theorems about the performance of the generative-discriminative pairs.

## NOTE: Progressive validation¶

PV is a method for estimating the test set performance of a hypothesis learnt by a learning algorithm using just the training set. This technique was originally presented by John Langford in a very different way in his thesis. But one sentence from his thesis works well: “The progressive validation protocol has a learner repeatedly commit to a hypothesis before it is given a new example.” This is exactly what we are doing with sgd based learning. At every point during training the learner has formed a hypothesis based on the examples that were presented to it so far and we ask the learner to predict the outcome for an example before using that example for training and we keep an average of the average loss incurred by our learner during the entire training process. We can compare it to the standard cross-validation based estimates of performance and holdout based estimates of performance.

class `understanding_logistic_regression.``LogLinearBinaryRV`(coeff)[source]

Bases: `object`

This class represents a loglinear conditional distribution p(y | X) that can be initialized with certain coefficients and which can be sampled and it can tell us the probability of the output.

`predict`(X)[source]
`probability`(X)[source]

Return the true probability of class 1 given the input

`sample`()[source]
class `understanding_logistic_regression.``PngPlotter`(X=None, Y=None, marker='o', title='', xlabel='', ylabel='')[source]

Bases: `object`

`add_data`(x, y)[source]
`render`()[source]
`understanding_logistic_regression.``log_loss`(y, pyeq1)[source]
`understanding_logistic_regression.``loglinear_prob`(X, coeff)[source]

The probability p(y = 1 | X) and coeff values

`understanding_logistic_regression.``orthogonal_portion`(v, e)[source]

Return the portion of v that is not along e.

```'''
================================================
| Filename: understanding_logistic_regression.py
================================================

| Description: This script demonstrates the following:
|  1. The SGDClassifier class in sklearn
|  2. The use of progressive validation
|  3. The fact that accurate recovery of true parameters of the
|     statistical model that generated data is not necessary for good
|     predictive performance.
|  4. The fact that conditional loglikelihood maximization (as opposed
|     to joint MLE) can still recover the true parameters of the
|     conditional distribution that generated the data.
| Author: Pushpendre Rastogi
| Created: Wed Jul 22 17:55:17 2015 (-0400)
| Last-Updated:
|           By:
|     Update #: 248

The choice of Loglinear distribution can be justified through the fact
that the log linear distribution
achieves maximum entropy subject to constraints over means of the
feature values. It turns out that doing so amounts to MLE over the
conditional loglikelihood (so not really MLE but close enough).

In this script at each iteration we would sample the LogLinearBinaryRV.
And by finding the parameters that maximize the regularized conditional
log-likelihood through stochastic gradient descent we would try to
recover the true parameters of the probability distribution that
generate the data.

Since this is a problem of recovering the parameters of a parametric model
This is definitely a statistical problem. Note that in the sense
that a statistical method is a correct algorithm for figuring out
the true parameters for a distribution from the data a statistician
is a computer scientist creating an algorithm.

For example the fact that MLE is a consistent estimator is a proof
that an algorithm that solves the MLE problem is the correct
algorithm for figuring out the parameters of a distribution.

Aside from these stated goals we also answer the following questions.

1. Was training even necessary?
As in could I have attained the same level performance even without
training the SGD classifier. This is answered by just finding the
performance of the untrained classifier.
2. Can I improve the performance further?
One way is to find out the performance of the bayes optimal
predictor based on the true generating distribution. If the performance
of the fitted method on the holdout set is close to the bayes optimal
predictor then we know that we have done the best we could.
3. How sensitive is the performance of the ideal bayes optimal predictor to the
distribution of the true data?
(Very sensitive, most importantly it is sensitive to the bandwidth)
4. How sensitive is it to changes in magnitude?
5. What is the performance of the naive bayes estimator?
generative-discriminative pairs.

============================
NOTE: Progressive validation
============================

PV is a method for
estimating the test set performance of a hypothesis learnt by a
learning algorithm using just the training set.
This technique was originally presented by John Langford in a very
different way in his thesis. But one sentence from his thesis works well:
"The progressive validation protocol has a learner repeatedly commit
to a hypothesis before it is given a new example."
This is exactly what we are doing with sgd based learning. At every
point during training the learner has formed a hypothesis based on
the examples that were presented to it so far and we ask the learner
to predict the outcome for an example before using that example for
training and we keep an average of the average loss incurred by our
learner during the entire training process.
We can compare it to the standard cross-validation based estimates
of performance and holdout based estimates of performance.

------------------------------------------------------------------
'''
from __future__ import division
from __future__ import print_function
from __future__ import with_statement
from sklearn import linear_model
import argparse
import numpy
import numpy as np
from typecheck import typecheck

def loglinear_prob(X, coeff):
'''The probability p(y = 1 | X) and coeff values
'''
return 1 / (1 + np.exp(-np.dot(X, coeff)))

class LogLinearBinaryRV(object):

''' This class represents a loglinear conditional distribution
p(y | X) that can be initialized with certain coefficients
and which can be sampled and it can tell us the probability of
the output.
'''

def __init__(self, coeff):
self.coeff = coeff
self.independent_var = len(coeff)

def sample(self):
noise = (
np.random.random(
(self.independent_var,)) - 0.5) * args.noise_level
coeff_alignment_sign = (1 if np.random.random() > 0.5 else -1)
coeff_alignment = self.coeff * \
args.coeff_strength * coeff_alignment_sign
X = (noise + coeff_alignment)
# Having two classes of coefficients is like having double the
# number of features.
positive_class_prob = self.probability(X)
sample = (1
if numpy.random.random() < positive_class_prob
else 0)
return (sample, X)

def probability(self, X):
''' Return the true probability of class 1 given the input '''
return loglinear_prob(X, self.coeff)

def predict(self, X):
return 1 if self.probability(X) > 0.5 else 0

class PngPlotter(object):

def __init__(
self,
X=None,
Y=None,
marker='o',
title='',
xlabel='',
ylabel=''):
''' X and Y are lists of X and Y values of points '''
self.X = [] if X is None else X
self.Y = [] if Y is None else Y
self.marker = marker

self.X.append(x)
self.Y.append(y)

def render(self):
pass

def log_loss(y, pyeq1):
return - numpy.log(pyeq1
if y == 1
else 1 - pyeq1)

def orthogonal_portion(v, e):
''' Return the portion of v that is not along e.
'''
e = e / np.linalg.norm(e)
return v - e * np.dot(v, e)

@typecheck(list, (linear_model.SGDClassifier, LogLinearBinaryRV))
def get_holdout_accuracy(holdout_set, predictor_obj):
if isinstance(predictor_obj, linear_model.SGDClassifier):
prediction_fnc = lambda X: predictor_obj.predict(X)[0]
elif isinstance(predictor_obj, LogLinearBinaryRV):
prediction_fnc = predictor_obj.predict
else:
raise NotImplementedError
return np.mean([y == prediction_fnc(X) for (y, X) in holdout_set])

if __name__ == '__main__':
arg_parser = argparse.ArgumentParser(
description='Understanding Logistic Regression')
help='Objective that is optimized, [log|perceptron|hinge]')
help='Penalty type, [l1|l2|elasticnet]')
help='The weight of the regularizer')
'--validation_sample_count', default=10000, type=int)
help='How close to true coeff is p(X)?')
help='The strength of noise to add.')
'--bandwidth', default=1.0, type=float, help='Default={1.0}')
'--prob_dim', default=100, type=int, help='Default={100}')
args = arg_parser.parse_args()
# The true coefficients that define the log-linear distribution p(y|x)
coeff = np.random.random_sample(
size=(args.prob_dim,)) / np.sqrt(args.prob_dim / 3)
coeff *= args.bandwidth
LogLinearConditionalPDFObject = LogLinearBinaryRV(coeff)
clsfr = linear_model.SGDClassifier(
loss=args.loss,        # log, perceptron, hinge
penalty=args.penalty,  # l1, l2, elasticnet
alpha=args.alpha,      # Alpha is the weight of the regularizer
fit_intercept=False,   # Fit the bias/intercept term or not
warm_start=False)      # When set to True, reuse the solution of the previous call
prog_valid_plotter = PngPlotter(title='Progressive Validation')
coeff_error_plotter = PngPlotter(
title='Relative Squared Error of ditribution parameters')
prog_loss_plotter = PngPlotter(title='Progressive Log Loss')
prog_valid, coeff_error, prog_regret_log_loss = 0.0, 0.0, 0.0
# Create a holdout set based on samples from the
holdout_set = [
LogLinearConditionalPDFObject.sample() for _ in range(
args.validation_sample_count)]

# Find the holdout accuracy of the bayes optimal classifier
holdout_accuracy_bayes_optimal = get_holdout_accuracy(
holdout_set, LogLinearConditionalPDFObject)
# Performance of the majority baseline
mbp = np.mean([y for (y, X) in holdout_set])
mbp = max(mbp, 1 - mbp)
print('\n mbp', mbp, )
# Fit once before starting progressive validation.
(y, X) = LogLinearConditionalPDFObject.sample()
clsfr.partial_fit(numpy.expand_dims(X, 0), [y], classes=[0, 1])
clsfr.coef_ = np.zeros(clsfr.coef_.shape)
# Holdout accuracy before starting training
holdout_accuracy_before_training = get_holdout_accuracy(holdout_set, clsfr)
for idx in range(args.train_sample_count):
(y, X) = LogLinearConditionalPDFObject.sample()
y_hat = clsfr.predict(X)[0]
prog_valid *= float(idx) / float(idx + 1)
prog_valid += (1.0 if y_hat == y else 0.0) / (idx + 1)
coeff_error = np.linalg.norm(
clsfr.coef_ - coeff) / np.linalg.norm(coeff)
true_log_prob = log_loss(
y,
LogLinearConditionalPDFObject.probability(X))
estimated_log_prob = log_loss(
y,
loglinear_prob(
X,
clsfr.coef_.squeeze()))
prog_regret_log_loss *= float(idx) / float(idx + 1)
prog_regret_log_loss += numpy.abs(true_log_prob -
estimated_log_prob) / float(idx + 1)
clsfr.partial_fit(numpy.expand_dims(X, 0), [y], classes=[0, 1])
# Now test the performance of the fitted logistic classifier
holdout_accuracy_after_training = get_holdout_accuracy(holdout_set, clsfr)
print(
'\n prog_valid', prog_valid,
'\n holdout_accuracy_bayes_optimal', holdout_accuracy_bayes_optimal,
'\n holdout_accuracy_before_training', holdout_accuracy_before_training,
'\n holdout_accuracy_after_training', holdout_accuracy_after_training,
)
# Find out the recovered parameters versus the true parameters
# NOTE: clsfr.coef_ is the recovered parameter and coeff was the
# true parameter
erroneous_vec_fraction = np.linalg.norm(
orthogonal_portion(clsfr.coef_, coeff)) / np.linalg.norm(clsfr.coef_)
print(
'\n erroneous_vec_fraction', erroneous_vec_fraction,
'\n prog_regret_log_loss', prog_regret_log_loss
)
prog_valid_plotter.render()
coeff_error_plotter.render()
prog_loss_plotter.render()
# Log Loss of the optimal predictor.
log_loss_optimal = np.mean(
[log_loss(y, LogLinearConditionalPDFObject.probability(X))
for (y, X)
in holdout_set])
log_loss_estimated = np.mean(
[log_loss(y, loglinear_prob(X, clsfr.coef_.squeeze()))
for (y, X)
in holdout_set])
print('\n log_loss_optimal', log_loss_optimal,
'\n log_loss_estimated', log_loss_estimated, )
'''
Look at how the log-loss truy is minimized but the validation error
is basically the same even when we use the hinge loss.
log 0.885 0.237379757873 0.0708640472617
log 0.894 0.128038074657 0.0550941428819
log 0.874 0.133218984176 0.0914931968861
hinge 0.895 0.473717543562 0.215030860105
hinge 0.89 0.434286919844 0.216065130295
hinge 0.87 0.462364313401 0.240906415183
perceptron 0.81 1.00405480928 0.619301226865
perceptron 0.767 1.00245422266 0.6231903911
perceptron 0.819 0.999211001065 0.601749697036

Note that I am only able to recover the
python understanding_logistic_regression.py --alpha 0.01 --coeff_strength 0.0 --train_sample_count 80000
0.519575 0.684376979344 0.0391865540893
0.5225 0.519575
0.692045918504 0.692365926667
python understanding_logistic_regression.py --alpha 0.01 --coeff_strength 1.0 --train_sample_count 80000
0.7309375 0.0683319657749 0.0253992530515
0.7264 0.7309375
0.585110828186 0.585081212155
python understanding_logistic_regression.py --alpha 0.01 --coeff_strength 8.0 --train_sample_count 80000
0.9997 0.151844644062 0.000821186056188
0.9994 0.9997
0.00516971678716 0.00518374450427
Really what this tells me is that there are limits to recovering the
true parameters of the true system and if we only try to build a
good predictor we can often more succeed more spectacularly than if
we try to estimate the parameters of the system that generated the
data.

mbp 0.5174

prog_valid 0.727
holdout_accuracy_bayes_optimal 0.7546
holdout_accuracy_before_training 0.5174
holdout_accuracy_after_training 0.7432

erroneous_vec_fraction 0.546445772363
prog_regret_log_loss 0.216014496595

log_loss_optimal 0.530312347206
log_loss_estimated 0.548764473949
'''
```