understanding_logistic_regression module

| Filename: understanding_logistic_regression.py

Description: This script demonstrates the following:
1. The SGDClassifier class in sklearn
2. The use of progressive validation
3. The fact that accurate recovery of true parameters of the
statistical model that generated data is not necessary for good
predictive performance.
4. The fact that conditional loglikelihood maximization (as opposed
to joint MLE) can still recover the true parameters of the
conditional distribution that generated the data.
Author: Pushpendre Rastogi
Created: Wed Jul 22 17:55:17 2015 (-0400)
Last-Updated:
By:
Update #: 248

The choice of Loglinear distribution can be justified through the fact that the log linear distribution achieves maximum entropy subject to constraints over means of the feature values. It turns out that doing so amounts to MLE over the conditional loglikelihood (so not really MLE but close enough).

In this script at each iteration we would sample the LogLinearBinaryRV. And by finding the parameters that maximize the regularized conditional log-likelihood through stochastic gradient descent we would try to recover the true parameters of the probability distribution that generate the data.

Since this is a problem of recovering the parameters of a parametric model This is definitely a statistical problem. Note that in the sense that a statistical method is a correct algorithm for figuring out the true parameters for a distribution from the data a statistician is a computer scientist creating an algorithm.

For example the fact that MLE is a consistent estimator is a proof that an algorithm that solves the MLE problem is the correct algorithm for figuring out the parameters of a distribution.

Aside from these stated goals we also answer the following questions.

  1. Was training even necessary? As in could I have attained the same level performance even without training the SGD classifier. This is answered by just finding the performance of the untrained classifier.
  2. Can I improve the performance further? One way is to find out the performance of the bayes optimal predictor based on the true generating distribution. If the performance of the fitted method on the holdout set is close to the bayes optimal predictor then we know that we have done the best we could.
  3. How sensitive is the performance of the ideal bayes optimal predictor to the distribution of the true data? (Very sensitive, most importantly it is sensitive to the bandwidth)
  4. How sensitive is it to changes in magnitude?
  5. What is the performance of the naive bayes estimator? Well Jordan/Ng already gave theorems about the performance of the generative-discriminative pairs.

NOTE: Progressive validation

PV is a method for estimating the test set performance of a hypothesis learnt by a learning algorithm using just the training set. This technique was originally presented by John Langford in a very different way in his thesis. But one sentence from his thesis works well: “The progressive validation protocol has a learner repeatedly commit to a hypothesis before it is given a new example.” This is exactly what we are doing with sgd based learning. At every point during training the learner has formed a hypothesis based on the examples that were presented to it so far and we ask the learner to predict the outcome for an example before using that example for training and we keep an average of the average loss incurred by our learner during the entire training process. We can compare it to the standard cross-validation based estimates of performance and holdout based estimates of performance.


class understanding_logistic_regression.LogLinearBinaryRV(coeff)[source]

Bases: object

This class represents a loglinear conditional distribution p(y | X) that can be initialized with certain coefficients and which can be sampled and it can tell us the probability of the output.

predict(X)[source]
probability(X)[source]

Return the true probability of class 1 given the input

sample()[source]
class understanding_logistic_regression.PngPlotter(X=None, Y=None, marker='o', title='', xlabel='', ylabel='')[source]

Bases: object

add_data(x, y)[source]
render()[source]
understanding_logistic_regression.log_loss(y, pyeq1)[source]
understanding_logistic_regression.loglinear_prob(X, coeff)[source]

The probability p(y = 1 | X) and coeff values

understanding_logistic_regression.orthogonal_portion(v, e)[source]

Return the portion of v that is not along e.

'''
================================================
| Filename: understanding_logistic_regression.py
================================================

| Description: This script demonstrates the following:
|  1. The SGDClassifier class in sklearn
|  2. The use of progressive validation
|  3. The fact that accurate recovery of true parameters of the
|     statistical model that generated data is not necessary for good
|     predictive performance.
|  4. The fact that conditional loglikelihood maximization (as opposed
|     to joint MLE) can still recover the true parameters of the
|     conditional distribution that generated the data.
| Author: Pushpendre Rastogi
| Created: Wed Jul 22 17:55:17 2015 (-0400)
| Last-Updated:
|           By:
|     Update #: 248

The choice of Loglinear distribution can be justified through the fact
that the log linear distribution
achieves maximum entropy subject to constraints over means of the
feature values. It turns out that doing so amounts to MLE over the
conditional loglikelihood (so not really MLE but close enough).

In this script at each iteration we would sample the LogLinearBinaryRV.
And by finding the parameters that maximize the regularized conditional
log-likelihood through stochastic gradient descent we would try to
recover the true parameters of the probability distribution that
generate the data.

Since this is a problem of recovering the parameters of a parametric model
This is definitely a statistical problem. Note that in the sense
that a statistical method is a correct algorithm for figuring out
the true parameters for a distribution from the data a statistician
is a computer scientist creating an algorithm.

For example the fact that MLE is a consistent estimator is a proof
that an algorithm that solves the MLE problem is the correct
algorithm for figuring out the parameters of a distribution.


Aside from these stated goals we also answer the following questions.

1. Was training even necessary?
   As in could I have attained the same level performance even without
   training the SGD classifier. This is answered by just finding the
   performance of the untrained classifier.
2. Can I improve the performance further?
   One way is to find out the performance of the bayes optimal
   predictor based on the true generating distribution. If the performance
   of the fitted method on the holdout set is close to the bayes optimal
   predictor then we know that we have done the best we could.
3. How sensitive is the performance of the ideal bayes optimal predictor to the
   distribution of the true data?
   (Very sensitive, most importantly it is sensitive to the bandwidth)
4. How sensitive is it to changes in magnitude?
5. What is the performance of the naive bayes estimator?
   Well Jordan/Ng already gave theorems about the performance of the
   generative-discriminative pairs.

============================
NOTE: Progressive validation
============================

PV is a method for
estimating the test set performance of a hypothesis learnt by a
learning algorithm using just the training set.
This technique was originally presented by John Langford in a very
different way in his thesis. But one sentence from his thesis works well:
"The progressive validation protocol has a learner repeatedly commit
to a hypothesis before it is given a new example."
This is exactly what we are doing with sgd based learning. At every
point during training the learner has formed a hypothesis based on
the examples that were presented to it so far and we ask the learner
to predict the outcome for an example before using that example for
training and we keep an average of the average loss incurred by our
learner during the entire training process.
We can compare it to the standard cross-validation based estimates
of performance and holdout based estimates of performance.

------------------------------------------------------------------
'''
from __future__ import division
from __future__ import print_function
from __future__ import with_statement
from sklearn import linear_model
import argparse
import numpy
import numpy as np
from typecheck import typecheck


def loglinear_prob(X, coeff):
    '''The probability p(y = 1 | X) and coeff values
    '''
    return 1 / (1 + np.exp(-np.dot(X, coeff)))


class LogLinearBinaryRV(object):

    ''' This class represents a loglinear conditional distribution
    p(y | X) that can be initialized with certain coefficients
    and which can be sampled and it can tell us the probability of
    the output.
    '''

    def __init__(self, coeff):
        self.coeff = coeff
        self.independent_var = len(coeff)

    def sample(self):
        noise = (
            np.random.random(
                (self.independent_var,)) - 0.5) * args.noise_level
        coeff_alignment_sign = (1 if np.random.random() > 0.5 else -1)
        coeff_alignment = self.coeff * \
            args.coeff_strength * coeff_alignment_sign
        X = (noise + coeff_alignment)
        # Having two classes of coefficients is like having double the
        # number of features.
        positive_class_prob = self.probability(X)
        sample = (1
                  if numpy.random.random() < positive_class_prob
                  else 0)
        return (sample, X)

    def probability(self, X):
        ''' Return the true probability of class 1 given the input '''
        return loglinear_prob(X, self.coeff)

    def predict(self, X):
        return 1 if self.probability(X) > 0.5 else 0


class PngPlotter(object):

    def __init__(
            self,
            X=None,
            Y=None,
            marker='o',
            title='',
            xlabel='',
            ylabel=''):
        ''' X and Y are lists of X and Y values of points '''
        self.X = [] if X is None else X
        self.Y = [] if Y is None else Y
        self.marker = marker

    def add_data(self, x, y):
        self.X.append(x)
        self.Y.append(y)

    def render(self):
        pass


def log_loss(y, pyeq1):
    return - numpy.log(pyeq1
                       if y == 1
                       else 1 - pyeq1)


def orthogonal_portion(v, e):
    ''' Return the portion of v that is not along e.
    '''
    e = e / np.linalg.norm(e)
    return v - e * np.dot(v, e)


@typecheck(list, (linear_model.SGDClassifier, LogLinearBinaryRV))
def get_holdout_accuracy(holdout_set, predictor_obj):
    if isinstance(predictor_obj, linear_model.SGDClassifier):
        prediction_fnc = lambda X: predictor_obj.predict(X)[0]
    elif isinstance(predictor_obj, LogLinearBinaryRV):
        prediction_fnc = predictor_obj.predict
    else:
        raise NotImplementedError
    return np.mean([y == prediction_fnc(X) for (y, X) in holdout_set])

if __name__ == '__main__':
    arg_parser = argparse.ArgumentParser(
        description='Understanding Logistic Regression')
    arg_parser.add_argument('--loss', default='log', type=str,
                            help='Objective that is optimized, [log|perceptron|hinge]')
    arg_parser.add_argument('--penalty', default='l2', type=str,
                            help='Penalty type, [l1|l2|elasticnet]')
    arg_parser.add_argument('--alpha', default=0.1, type=float,
                            help='The weight of the regularizer')
    arg_parser.add_argument('--train_sample_count', default=10000, type=int)
    arg_parser.add_argument(
        '--validation_sample_count', default=10000, type=int)
    arg_parser.add_argument('--coeff_strength', default=1.0, type=float,
                            help='How close to true coeff is p(X)?')
    arg_parser.add_argument('--noise_level', default=2.0, type=float,
                            help='The strength of noise to add.')
    arg_parser.add_argument(
        '--bandwidth', default=1.0, type=float, help='Default={1.0}')
    arg_parser.add_argument(
        '--prob_dim', default=100, type=int, help='Default={100}')
    args = arg_parser.parse_args()
    # The true coefficients that define the log-linear distribution p(y|x)
    coeff = np.random.random_sample(
        size=(args.prob_dim,)) / np.sqrt(args.prob_dim / 3)
    coeff *= args.bandwidth
    LogLinearConditionalPDFObject = LogLinearBinaryRV(coeff)
    clsfr = linear_model.SGDClassifier(
        loss=args.loss,        # log, perceptron, hinge
        penalty=args.penalty,  # l1, l2, elasticnet
        alpha=args.alpha,      # Alpha is the weight of the regularizer
        fit_intercept=False,   # Fit the bias/intercept term or not
        warm_start=False)      # When set to True, reuse the solution of the previous call
    prog_valid_plotter = PngPlotter(title='Progressive Validation')
    coeff_error_plotter = PngPlotter(
        title='Relative Squared Error of ditribution parameters')
    prog_loss_plotter = PngPlotter(title='Progressive Log Loss')
    prog_valid, coeff_error, prog_regret_log_loss = 0.0, 0.0, 0.0
    # Create a holdout set based on samples from the
    holdout_set = [
        LogLinearConditionalPDFObject.sample() for _ in range(
            args.validation_sample_count)]

    # Find the holdout accuracy of the bayes optimal classifier
    holdout_accuracy_bayes_optimal = get_holdout_accuracy(
        holdout_set, LogLinearConditionalPDFObject)
    # Performance of the majority baseline
    mbp = np.mean([y for (y, X) in holdout_set])
    mbp = max(mbp, 1 - mbp)
    print('\n mbp', mbp, )
    # Fit once before starting progressive validation.
    (y, X) = LogLinearConditionalPDFObject.sample()
    clsfr.partial_fit(numpy.expand_dims(X, 0), [y], classes=[0, 1])
    clsfr.coef_ = np.zeros(clsfr.coef_.shape)
    # Holdout accuracy before starting training
    holdout_accuracy_before_training = get_holdout_accuracy(holdout_set, clsfr)
    for idx in range(args.train_sample_count):
        (y, X) = LogLinearConditionalPDFObject.sample()
        y_hat = clsfr.predict(X)[0]
        prog_valid *= float(idx) / float(idx + 1)
        prog_valid += (1.0 if y_hat == y else 0.0) / (idx + 1)
        coeff_error = np.linalg.norm(
            clsfr.coef_ - coeff) / np.linalg.norm(coeff)
        true_log_prob = log_loss(
            y,
            LogLinearConditionalPDFObject.probability(X))
        estimated_log_prob = log_loss(
            y,
            loglinear_prob(
                X,
                clsfr.coef_.squeeze()))
        prog_regret_log_loss *= float(idx) / float(idx + 1)
        prog_regret_log_loss += numpy.abs(true_log_prob -
                                          estimated_log_prob) / float(idx + 1)
        prog_valid_plotter.add_data(idx, prog_valid)
        coeff_error_plotter.add_data(idx, coeff_error)
        prog_loss_plotter.add_data(idx, prog_regret_log_loss)
        clsfr.partial_fit(numpy.expand_dims(X, 0), [y], classes=[0, 1])
    # Now test the performance of the fitted logistic classifier
    holdout_accuracy_after_training = get_holdout_accuracy(holdout_set, clsfr)
    print(
        '\n prog_valid', prog_valid,
        '\n holdout_accuracy_bayes_optimal', holdout_accuracy_bayes_optimal,
        '\n holdout_accuracy_before_training', holdout_accuracy_before_training,
        '\n holdout_accuracy_after_training', holdout_accuracy_after_training,
    )
    # Find out the recovered parameters versus the true parameters
    # NOTE: clsfr.coef_ is the recovered parameter and coeff was the
    # true parameter
    erroneous_vec_fraction = np.linalg.norm(
        orthogonal_portion(clsfr.coef_, coeff)) / np.linalg.norm(clsfr.coef_)
    print(
        '\n erroneous_vec_fraction', erroneous_vec_fraction,
        '\n prog_regret_log_loss', prog_regret_log_loss
    )
    prog_valid_plotter.render()
    coeff_error_plotter.render()
    prog_loss_plotter.render()
    # Log Loss of the optimal predictor.
    log_loss_optimal = np.mean(
        [log_loss(y, LogLinearConditionalPDFObject.probability(X))
         for (y, X)
         in holdout_set])
    log_loss_estimated = np.mean(
        [log_loss(y, loglinear_prob(X, clsfr.coef_.squeeze()))
         for (y, X)
         in holdout_set])
    print('\n log_loss_optimal', log_loss_optimal,
          '\n log_loss_estimated', log_loss_estimated, )
    '''
    Look at how the log-loss truy is minimized but the validation error
    is basically the same even when we use the hinge loss.
    log 0.885 0.237379757873 0.0708640472617
    log 0.894 0.128038074657 0.0550941428819
    log 0.874 0.133218984176 0.0914931968861
    hinge 0.895 0.473717543562 0.215030860105
    hinge 0.89 0.434286919844 0.216065130295
    hinge 0.87 0.462364313401 0.240906415183
    perceptron 0.81 1.00405480928 0.619301226865
    perceptron 0.767 1.00245422266 0.6231903911
    perceptron 0.819 0.999211001065 0.601749697036

    Note that I am only able to recover the
    python understanding_logistic_regression.py --alpha 0.01 --coeff_strength 0.0 --train_sample_count 80000
    0.519575 0.684376979344 0.0391865540893
    0.5225 0.519575
    0.692045918504 0.692365926667
    python understanding_logistic_regression.py --alpha 0.01 --coeff_strength 1.0 --train_sample_count 80000
    0.7309375 0.0683319657749 0.0253992530515
    0.7264 0.7309375
    0.585110828186 0.585081212155
    python understanding_logistic_regression.py --alpha 0.01 --coeff_strength 8.0 --train_sample_count 80000
    0.9997 0.151844644062 0.000821186056188
    0.9994 0.9997
    0.00516971678716 0.00518374450427
    Really what this tells me is that there are limits to recovering the
    true parameters of the true system and if we only try to build a
    good predictor we can often more succeed more spectacularly than if
    we try to estimate the parameters of the system that generated the
    data.

    mbp 0.5174

    prog_valid 0.727
    holdout_accuracy_bayes_optimal 0.7546
    holdout_accuracy_before_training 0.5174
    holdout_accuracy_after_training 0.7432

    erroneous_vec_fraction 0.546445772363
    prog_regret_log_loss 0.216014496595

    log_loss_optimal 0.530312347206
    log_loss_estimated 0.548764473949
    '''