The second-order-diff Git trick

By
Posted on
Tags: git, diff

One of the things that Git makes safe and practical is bulk editing by way of “sledgehammer-and-review”:

  1. Apply some powerful (but potentially dangerous) sledgehammer to a bunch of files.
  2. Review the sledgehammer’s effects using git diff to make sure there was no collateral damage.
  3. If everything looks right, commit the effects; otherwise, reset them back to step 1 and try again using a more-refined sledgehammer.

That technique is handy, but it’s not the trick I want to share with you.

The trick I want to share with you is a refinement of sledgehammer-and-review, one that allows you to adjust the sledgehammer iteratively without having to re-review changes you’ve already reviewed. Instead, you just have to review the changes to the changes, hence the “second-order-diff trick.”

The idea is to replace git reset with git stash, should you need to back up and try again with an adjusted sledgehammer. Using stash not only resets your working tree back to the starting point but also stashes the result of the applying the previous sledgehammer. Then you can apply your new sledgehammer and use git diff stash to review how it changed your original working tree differently than the previous sledgehammer. In this way, you can more easily verify that adjusting your sledgehammer worked as expected. Once everything looks good, git commit as usual.

Example

Recently, I moved my blog from the Typo system to the Hakyll system. The hardest part was translating a decade’s worth of posts from Typo-flavored Textile to Hakyll-flavored Markdown. I automated what I could using Pandoc, but a lot of Textile markup evaded translation.

In particular, over 100 links spread across 30 posts remained in Textile form. In Textile, a link looks like this:

"Link Title":http://example.com/

My challenge was to convert links like that one, across all posts, into Markdown:

[Link Title](http://example.com/)

Rather than convert them by hand, I reached for a sledgehammer. In my head, the sledgehammer took shape:

  1. Use git grep to locate posts that probably contained Textile-flavored links. This I did by looking for a double quote followed by a colon, which I figured unlikely to occur in normal text.
  2. Pipe the matching posts into a Perl one-liner that would translate probable Textile links into Markdown links.

Because the sledgehammer makes changes that are “probably” right, I had the obligation to review its effects for collateral damage. Thus began the process of sledgehammer-and-review.

Sledgehammer-and-review

At first, I used plain sledgehammer-and-review to home in on a regular expression that matched Textile links semi-reliably. The first iteration went like this:

git status  # verify that we're starting w/ a clean slate

# Iteration 1:  take an educated guess at a good sledgehammer

# 1.  Try out the candidate sledgehammer
git grep -lF '":' -- posts |
  xargs perl -i -pe's{"(.+?)":([^.,:\s]+)}{[$1]($2)}g'

# 2.  Review results and judge
git diff  # => judgment: not right

# 3.  Roll back to clean slate in preparation for the next attempt
git checkout -- posts  # (also could have used git reset --hard)

The next iterations took only a few seconds. For each, I just hit the up-arrow key from the command line to recall and refine the previous iteration:

# Iteration 2:  refine the sledgehammer
git grep -lF '":' -- posts |
  xargs perl -i -pe's{"(.+?)":([^.,!\)\]\s]+)}{[$1]($2)}g'
git diff  # => judgment: still not right
git checkout -- posts

# Iteration 3:  refine the sledgehammer more
git grep -lF '":' -- posts |
  xargs perl -i -pe's{"(.+?)":([^,!\)\]\s]+)\b}{[$1]($2)}g'
git diff  # => judgment: still not right
git checkout -- posts

But on iteration four I hit on a regular expression that worked well:

# Iteration 4:  refine the sledgehammer just a bit more
git grep -lF '":' -- posts |
  xargs perl -i -pe's{"(.+?)":([^,!\)\]\s]+)}{[$1]($2)}g'
git diff  # => judgment: hey! looking pretty good...

This time, I spent a good ten minutes in git diff because most of the changes looked good and I ended up reviewing them all.

But there was one tiny problem. If a link occurred at the end of a sentence, the sentence-ending period got absorbed into the link. Here’s one of the dozen such errors I spotted during the review:

--Do check it out: "Try Ruby":http://tryruby.hobix.com/.
++Do check it out: [Try Ruby](http://tryruby.hobix.com/.)

So I needed to adjust my sledgehammer to not break sentence-ending links. And, of course, after adjustment I would need to verify that everything now worked as expected.

But I didn’t want to re-do the entire review. Why review all 100+ links, when the adjustment should affect only a dozen?

Time for the second-order-diff trick!

The second-order-diff trick

This time, instead of blowing away the previous sledgehammer’s results, I stashed them:

git stash save "link fix: almost there"

Now I had my original clean slate back, but I also had the previous results stashed away for later comparison. Time to refine the sledgehammer and try again:

git grep -lF '":' -- posts |
  xargs perl -i -pe'
    s{"(.+?)":([^,!\)\]\s]+)}
     {my($t,$l)=($1,$2); $l=~/(.*?)([.:])?$/; "[$t]($1)$2"}eg
  '

This time for review, however, I didn’t compare the sledgehammer’s results to the original clean slate but to the previous sledgehammer’s results, as stashed away:

git diff 'stash@{0}'

And, sure enough, there were only a dozen changes to review, and they were the expected end-of-sentence fixes:

-Do check it out: [Try Ruby](http://tryruby.hobix.com/.)
+Do check it out: [Try Ruby](http://tryruby.hobix.com/).

Mission accomplished! All that was left was to commit the reviewed changes:

git add posts
git commit -m 'Fix links that evaded initial translation'

And that’s the second-order-diff Git trick.

Questions or comments? Post them to the thread on Hacker News.

Follow-up questions and answers

On Hacker News, I’m seeing some common questions, so I’ll try to answer them here.

Why not use git add -p to incrementally accept the good changes while you refine the sledgehammer?

The reason is that the sledgehammer is not incremental: it expects to be applied to the original clean slate. Every time. Using git stash will get me back to that clean slate. Using git add -p will not, for both the approved and unapproved changes will remain in the working tree, where the sledgehammer expects neither of them to be.

Instead of using the stash, can’t you just create a temporary branch?

Using the stash is just creating a temporary branch. The stash is just a stack of local temporary branches. But, unlike a hand-made branch, the stash has a more-convenient interface for work that follows a review-reset-retry cycle.

Update 2013-02-21

I expanded my explanation of what was happening during each step of the sledgehammer-and-review process.