The trigger for this post was basically just this wish that wouldn’t it have been nice if the neural network parameter optimization problem was itself convex? Although large scale convex optimization is not a “solved” problem and a lot of research is going on in this area (Yuan, Chang, Hsieh, & Lin, 2010) but at least it is reasonable to hope that a “convex neural network optimization problem” could lead to improvements. Now I have come across the following two papers that deal with “Convex neural networks”.

1. Breaking the Curse of Dimensionality with Convex Neural Networks. (Bach, 2014)

2. Convex Neural Networks. (Bengio, Roux, Vincent, Delalleau, & Marcotte, 2005)

Both of these papers are on my reading list, but the techniques that they use are at least a few months beyond me if not more. Instead of discussing these papers, I want to present a way for possible “convexifying” the optimization objectives with respect to a neural network’s parameters.

Consider a neural network with $\text{relu}$ activation units, where $\text{relu}(x) \triangleq max(0, x)$. Let $y_i$ be the activation(intermediate output) after the $i$th layer in a feed-forward neural network.

Now notice that:

1. $\text{relu}$ is a convex function.
2. If $f$ is a convex function over scalars then $f(\kv{a}\uv{x})$ is also convex wrt to $\uv{x}$ as long as $\kv{a} \ge 0$. The notation $\kv{v} \ge 0$ means that every element of the vector $\kv{v}$ is greater than or equal to 0.
3. If two differentiable functions $f$ and $g$ are convex then their composition $f g$ (I am using Haskell/Ocaml syntax for function composition) is convex if $f’ (g x) > 0$ for all $x \in \text{domain}(g)$.

Convexity is not closed under composition of functions in general. Let $f(x) = x^2$ and $g(x) = x^2-1$. Then $f (g x) = (x+1)^2(x-1)^2$ which is not convex. The real reason why $h = f g$ is non-convex is that $f’ (g x) = 2(x^2 -1)$ is negative at the neighborhood of 0. The proof of point 3 comes from the characterization of a convex function in terms of its gradients. Let $h = f g$ then we need to prove that a) $h y - h x \ge h’ (y -x)$ where $h’ = (f’(g x)) \times (g’ x)$. We also know that b) $g y - g x \ge g’(x) (y -x)$ c) $f (g y) - f (g x) \ge (f’ (g x)) \times (g y - g x)$. Once we impose the constraint that $f’ (g x) > 0$ then b) and c) are sufficient to prove a).

Naively these 3 pieces of information suggest that if I constrain $x$, $y_i$, $W_i$ to be non-negative then the output $y_{\text{N}}$ would be a convex function of the parameters $W_i$, where $N$ is the number of transformations/layers in the network. $y_i$ is anyway guaranteed to be non-negative because of $\text{relu}$. Now assume that we are just solving binary classification, so that the last layer $y_N$ contains only two nodes and we are using the cross-entropy loss and we only have one sample in our corpus and the outcome was 1 for that example. then we would basically have to minimize

$\log(1 + \exp(y_{N,0} - y_{N,1})) = \log(1 + \exp([1, -1]^T [y_{N,0}; y_{N,1}]))$

I need to do more analysis, but I think this is where our nice trail of convexity disappears. Even though the function $\log(1 + \exp(.)$ by itself is convex, the composition is not because of $[1, -1]$ but honestly I am not sure about this at this point. There are two things in my mind as I approach this problem

1. Someone would have seen this way of convexifying neural networks, so I have a strong prior belief that I would be obstructed from actually creating a convex loss function.

2. I also believe that $\log(1 + \exp(\langle \kv{\theta}, \kv{x} \rangle))$ is convex in general, so I feel like pont 2 above may not be general enough.

In any case, I have written down a fairly coherent chain of steps of a construction that I can systematically check, so I should know the answer soon enough.

###Bibliography###

1. Yuan, G.-X., Chang, K.-W., Hsieh, C.-J., & Lin, C.-J. (2010). A comparison of optimization methods and software for large-scale l1-regularized linear classification. The Journal of Machine Learning Research, 11, 3183–3234.
2. Bach, F. (2014). Breaking the curse of dimensionality with convex neural networks. ArXiv Preprint ArXiv:1412.8690.
3. Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., & Marcotte, P. (2005). Convex neural networks. In Advances in neural information processing systems (pp. 123–130).