This is an online appendix to Eisner, Goldlust, and Smith (2005), Compiling Comp Ling: Weighted Dynamic Programming and the Dyna Language.

Section 4.2 and Figure 3 give a forward-chaining algorithm that computes the values of all items in a semiring-weighted logic program.

Then section 5.2 and Figure 4 give a gradient algorithm that
differentiates the previous computation (when the semiring is
(R,+,*)). It works backwards to rapidly compute the partial
derivatives of the `goal`

item with respect to each of the
items. This gradient can be used to adjust the values of items that
correspond to free parameters, within an algorithm like gradient
descent or L-BFGS.

Below is the *derivation* of this gradient algorithm, which
did not fit in the paper (as noted in the Figure 4 caption). It is a
reformatted version of an August 12, 2004 email among the authors
(from Jason Eisner to Eric Goldlust).

Note that the Dyna compiler produces specialized code for both the forward-chaining algorithm and the gradient algorithm, for any logic program in this semiring.

Background reading:A special case of these forward-chaining and back-propagation algorithms (for a particular finite-state model) was explained in detail in Eisner (2001), sections 4.2 and 4.3 respectively. Our derivation below uses the same notation.You should already be familiar with the general idea of automatic differentiation (website) in the reverse mode (tutorial, history). Most people first see this in the case of back-propagation in neural networks (slides, video, explanation, book chapter, formulas and C++ code, another tutorial with code). Indeed, the algorithm below follows the same pattern as back-propagation through time for recurrent neural networks (Werbos 1989, Williams and Zipser 1989).

We need to start by writing out the timestamped forward pass, then
derive a matching backward pass. I'll use ^{(t)} for a timestamp superscript.

First, here is a picture of a bit of the formula graph given a rule
`c += a`

where _{1} * a_{2} * a_{3}`a`

is the only driver. You can see that
agenda_{2}^{(t-1)}[a_{2}] is used to update the chart^{(t-1)} values into
chart^{(t)} values and to define a new value for agenda^{(t)}[c]. Note
that for a given time t, the chart values feed forward into the
agenda values.

,---------------------- agenda^{(t)}[c] / * | * \ / | \ chart^{(t)}[a_{1}] chart^{(t)}[a_{2}] | \ chart^{(t)}[a_{3}] | / + \ | \ | | / \ | \ | chart^{(t-1)}[a_{1}] chart^{(t-1)}[a_{2}] agenda^{(t-1)}[a_{2}] chart^{(t-1)}[a_{3}]

Here is the forward pass pseudocode.

[Note that we explicitly push current values onto a "tape" (stack) before we update them. Later, during the backward pass, this tape will be read backwards (popped) to help us unwind the computation. From section 5.2 of our paper: "At line 3, a stack is needed to remember the sequence of (a, old, Δ) triples from the original computation. It is a more efficient version of the 'tape' usually used in automatic differentiation."]

initialize the agenda t:=0whileagenda is not empty t := t+1 // choose an update from agenda choose a such that agenda^{(t-1)}[a] != 0 // this a could permanently be called a^{(t-1)}push (a, chart^{(t-1)}[a], agenda^{(t-1)}[a]) onto tape // compute chart^{(t)}chart^{(t)}:= chart^{(t-1)}chart^{(t)}[a] += agenda{t-1)[a] // overrides (adds to) previous line, for [a] // compute agenda^{(t)}agenda^{(t)}:= agenda^{(t-1)}// plus other terms to be added below agenda^{(t)}[a] := 0 // overrides previous line, for [a]if(chart^{(t)}[a] != chart^{(t-1)}[a]) // only propagate actual changes // the following updates may be done in any convenient order (e.g., by pattern matching on a)foreach inference rule c += a_{1}*a_{2}*...*a_{k}fori from 1 to kforeach way (if any) of instantiating the rule's variables such that a_{i}=a and (all j) chart^{(t)}[a_{j}] != 0 or chart^{(t-1)}[a_{j}] != 0 agenda^{(t)}[c] += prod_{j=1)^{(i-1)}chart^{(t)}[a_{j}] * agenda^{(t-1)}[a] * prod_{j=i+1)^k chart^{(t-1)}[a_{j}] // has the same effect as updating chart[a_{1}], chart[a_{2}], ... // one at a time in that order; the only ones that // actually change are the ones that equal a.

Okay, now let's see what the backward pass must do. We'll just differentiate the defining computations above, basically in reverse order.

// t and tape are as they were left at the end of the forward pass initialize the gradient computation (e.g., set gagenda^{(t)}:=0, gchart^{(t)}:=0, then gchart^{(t)}[goal]:=1)whiletape is not empty pop (a, chart^{(t-1)}[a], agenda^{(t-1)}[a]) // In the timestamped version, popping the old // chart and agenda values isn't actually necessary, // since we remember those values from the forward pass. gagendaold := 0 // holds accumulands into gagenda^{(t-1)}[a] gchart^{(t-1)}:= 0 // to be accumulated belowif(chart^{(t)}[a] != chart^{(t-1)}[a]) // the following updates may be done in any convenient order (e.g., by pattern matching on a)foreach inference rule c += a_{1}*a_{2}*...*a_{k}fori from 1 to kforeach way (if any) of instantiating the rule's variables such that a_{i}=a and (all j) chart^{(t)}[a_{j}] != 0 or chart^{(t-1)}[a_{j}] != 0forj':=1 to i-1 gchart^{(t)}[a_{j'}] += gagenda^{(t)}[c] * prod_{j=1 to i-1 except j') chart^{(t)}[a_{j}] * agenda^{(t-1)}[a] * prod_{j=i+1)^k chart^{(t-1)}[a_{j}] gagendaold += prod_{j=1)^{(i-1)}chart^{(t)}[a_{j}] * gagenda^{(t)}[c] * prod_{j=i+1)^k chart^{(t-1)}[a_{j}]forj':=i+1 to k gchart^{(t-1)}[a_{j'}] += prod_{j=1)^{(i-1)}chart^{(t)}[a_{j}] * agenda^{(t-1)}[a] * prod_{j=i+1 to k except j') chart^{(t-1)}[a_{j}] * gagenda^{(t)}[c] // now gchart^{(t)}is defined gagenda^{(t-1)}:= gagenda^{(t)}gagenda^{(t-1)}[a] := gagendaold + gchart^{(t)}[a] // overrides above since gagenda^{(t-1)}[a] doesnotfeed forward into gagenda^{(t)}[a], but into other things gchart^{(t-1)}+= gchart^{(t)}// note that in forward pass, chart^{(t-1)}influences agenda^{(t)}both directly and via chart^{(t)}. t := t-1

As a check, the loop over inference rules simplifies as follows in the binary-rule case:

foreach inference rule c += a_{1}*a_{2}foreach way (if any) of instantiating the rule's variables such that a_{1}=a and chart^{(t)}[a_{2}] != 0 or chart^{(t-1)}[a_{2}] != 0 gagenda^{(t-1)}[a] += gagenda^{(t)}[c] * chart^{(t-1)}[a_{2}] gchart^{(t-1)}[a_{2}] += agenda^{(t-1)}[a] * gagenda^{(t)}[c] // can get these by specializing the general version above, // or by directly differentiating // agenda^{(t)}[c] += agenda^{(t-1)}[a] * chart^{(t-1)}[a_{2}]foreach way of instantiating the rule's variables such that a_{2}=a and chart^{(t)}[a_{1}] != 0 or chart^{(t-1)}[a_{1}] != 0 gchart^{(t)}[a_{1}] += gagenda^{(t)}[c] * agenda^{(t-1)}[a] gagenda^{(t-1)}[a] += chart^{(t)}[a_{1}] * gagenda^{(t)}[c] // can get these by specializing the general version above, // or by directly differentiating // agenda^{(t)}[c] += chart^{(t)}[a_{1}] * agenda^{(t-1)}[a]

Now our next job is to simplify these routines by eliminating the timestamps [and using the tape instead]. The forward pass simplifies as already shown in the paper [Figure 3]. Here's how the backward pass simplifies [Figure 4 in paper]:

initialize the gradient computation (e.g., set gagenda:=0, gchart:=0, then gchart[goal]:=1)whiletape is not empty // chart currently holds chart^{(t)}// gchart and gagenda currently hold gchart^{(t)}and gagenda^{(t)}pop (a, oldchart, oldagenda) // "old" refers to the^{(t-1)}values // During the main loop below, chart and gagenda still refer to^{(t)}, // but gchart now refers to^{(t-1)}, as if we had done // gchart^{(t-1)}:=0 as in the timestamped version but then had done // the gchart^{(t-1)}+=gchart^{(t)}step early. // Because the latter step was done early, before gchart^{(t)}was modified, // we'll need to ensure that any modifications to gchart^{(t)}(which is // no longer stored) are passed through so that they appropriately affect // gchart^{(t-1)}and gagenda^{(t-1)}[a]. // // As in the paper, we use chart(a_{j},oldchart) to refer to // a_{j}==a ? oldchart : chart[a_{j}], that is, chart^{(t-1)}[a_{j}]. gagendaold := 0 // holds accumulands into gagenda^{(t-1)}[a] (while gagenda refers to^{(t)}) gchartnew := gchart[a] // holds gchart^{(t)}[a] (while gchart refers to^{(t-1)})if(chart[a] != oldchart) // in other words, if update actually accomplished something // the following updates may be done in any convenient order (e.g., by pattern matching on a)foreach inference rule c += a_{1}*a_{2}*...*a_{k}fori from 1 to kforeach way (if any) of instantiating the rule's variables such that a_{i}=a and (all j) chart[a_{j}] != 0 or a_{j}=aforj':=1 to i-1 // modifying gchart^{(t)}, must pass through gchart[a_{j'}] += gagenda[c] * prod_{j=1 to i-1 except j') chart[a_{j}] * oldagenda * prod_{j=i+1)^k chart(a_{j},oldchart) if (a_{j'}=a) then gchartnew += (same thing) gagendaold += prod_{j=1)^{(i-1)}chart[a_{j}] * gagenda[c] * prod_{j=i+1)^k chart(a_{j},oldchart)forj':=i+1 to k gchart[a_{j'}] += prod_{j=1)^{(i-1)}chart[a_{j}] * oldagenda * prod_{j=i+1 to k except j') chart(a_{j},oldchart) * gagenda[c] // now we roll gagenda and chart back to refer to^{(t-1)}gagenda[a] := gagendaold + gchartnew chart[a] := oldchart t := t-1

Again, let's consider what the main loop looks like in the binary-rule case:

foreach inference rule c += a_{1}*a_{2}*...*a_{k}foreach way (if any) of instantiating the rule's variables such that a_{1}=a and chart[a_{2}] != 0 or a_{2}=a gagendaold += gagenda[c] * chart(a_{2},oldchart) gchart[a_{2}] += oldagenda * gagenda[c]if(a_{2}=a) then gchartnew += oldagenda * gagenda[c]foreach way (if any) of instantiating the rule's variables such that a_{2}=a and chart[a_{1}] != 0 or a_{1}=a gchart[a_{1}] += gagenda[c] * oldagenda gagendaold += chart[a_{1}] * gagenda[c]

I suggest you check this with a simple example. Suppose we have the program

a += a*a. % note: equivalent to a = a*a since only one summand a += r. % where r is an axiomThen successive updates will look like this, I think:

chart[a]=0, agenda[a]=r chart[a]=r, agenda[a]=rThe convergence condition is chart[a]=r+chart[a]^{2}-0^{2}chart[a]=r+r^{2}, agenda[a]=(r+r^{2})^{2}-r^{2}chart[a]=r+(r+r^{2})^{2}, agenda[a]=(r+(r+r^{2})^{2})^{2}-(r+r^{2})^{2}chart[a]=r+(r+(r+r^{2})^{2})^{2}, agenda[a]=(old chart[a])^{2}+(new chart[a])^{2}

I suggest you run the forward algorithm by hand for 2 steps (early
stopping!), and see what formula you get for chart[a] in terms of r,
presumably r+r^{2}. Then run the backward algorithm by hand, and make
sure that gradient[r] ends up holding the correct derivative of that.
In both cases, exactly follow the pseudocode in the EMNLP paper, with
the modifications listed above. Let me know if it works!

As [another] overly simple check, the back-relaxation code in my thesis (pp. 123-124) should fall out by trying the [backward algorithm] on the following program:

stop(I) += edgezero(I)*pathto(I). pathto(J) += edge(I,J)*pathto(I).when only pathto items are popped. Note that a

gagendaold := 0 // in thesis, z := 0 gchartnew := gchart[pathto(I)] // in thesis, doesn't appear; is effectively 0 since chart[pathto(...)] is never used: pathto items are never passengers // from stop(I) += edgezero(I)*pathto(I). gchart[edgezero(I)] += gagenda[stop(I)] * oldagenda // in thesis, gPwhich seems to match, as shown in the comments._{i0}+= gp_{i}* J_{i}gagendaold += chart[edgezero(I)] * gagenda[stop(I)] // in thesis, z += P_{i0}* gp_{i}// from pathto(J) += edge(I,J)*pathto(I). gchart[edge(I,J)] += gagenda[pathto(J)] * oldagenda // in thesis, gP_{ij}+= gJ_j * J_{i}gagendaold += chart[edge(I,J)] * gagenda[pathto(J)] // in thesis, z += P_{ij}* gJ_j // after loop gagenda[pathto(I)] := gagendaold + gchartnew // in thesis, gJ_{i}:= z (note gchartnew is effectively 0) chart[pathto(I)] := oldchart // in thesis, doesn't appear since chart[pathto(...)] is never stored: pathto items are never passengers (hence, no need to store oldchart on tape, either)

Some further optimizations seem possible though not urgent:

Notice that we are now maintaining

*two*gradients per item, gchart and gagenda. agenda[a] is only used on the forward pass, and gagenda[a] is only used on the backward pass, so the implementation could probably store them in the same place. Indeed, I suspect that agenda and gagenda are perfect complements, in the sense that gagenda^{(t)}[a] must be stored on the backward pass (for a given t, a) if and only if agenda^{(t)}[a] was stored on the forward pass.I think maintaining two gradients per item is probably necessary in the general case. Why? Because these gradients may be arbitrarily different. In the case of early stopping, chart[a] may have more influence on the objective function than agenda[a] does (it may participate as a passenger before a is popped).

We should try to figure out whether we can get by with one gradient, but I'm skeptical. However, in specific cases, it may be possible to get away with just one:

If a will never participate as a passenger, then we don't have to store either chart[a] or gchart[a], since the former will never be used and the latter is therefore zero. That's the case above.

(Probably the safest way to handle this is to treat pathto as a kind of transient item that is nonetheless processed on the agenda (rather than immediately). That way, we'll be able to cope if a

*does*participate as a passenger, e.g., because the user asserts a new edge axiom in the middle of the computation.)If we know that we're

*not*doing early stopping, then we can probably combine the gradients into a single field, since whatever happens to chart[a] in future will ultimately also happen to agenda[a] -- incrementing either one will ultimately have the same effect.(Should work out the improved pseudocode for this case. Confirm that the idea really works, even though chart[a] is used for passenger a and agenda[a] for driver a. Note: We do still want to use the tape even without early stopping, I think, to avoid the need for backward chaining and managing a backward-pass priority queue. Maybe there are also non-tape approaches that save memory.)

By static analysis, we may be able to determine at compile time that a

_{1}=a_{2}is impossible in some rule. That lets us skip two tests in the backward pass code (one is explicit, the other is hidden inside the construct "chart(a_{j},oldchart)"). We can also avoid using the temp variable gchartnew in this case, since it will not be modifiedThe pseudocode suggests that the tape has to contain triples of the form (a, chart

^{(t-1)}[a], agenda^{(t-1)}[a]). However, we don't really have to store both those^{(t-1)}values. Given one, we can get the other upon popping by subtracting from chart^{(t)}[a]. (Since we're running a backward pass, we're presumably in the real or log-real semiring, so we can assume subtraction exists.)Which is better to store? Requires a little thought. It depends on which way will give us better numerical precision, and how bad it is to have precision errors in agenda[a] vs. gradient[a]. (Storing agenda[a] means that the immediately needed gradient computation will always be accurate, using a as a driver, but we may

*accumulate*error in chart[a] that will hurt us when and if a is used as a passenger as we continue on the backward pass. Storing chart[a] means that agenda[a] might be fuzzy but we will always have chart[a] correct if we need to use a as a passenger.)In the forward pass, we can skip pushing (a, ...) onto the tape if chart[a] didn't change. This saves space. It also saves time: the backward pass never has to see (and skip over) agenda changes that didn't make any difference. The only reason I didn't do make these improvements in the pseudocode above was that it would have made the timestamped code uglier. (It's nice just to do ++t every time we pop something.)

This page online:

`http://cs.jhu.edu/~jason/papers/eisner+goldlust+smith.emnlp05-appendix.html`