Some less-uninformed speculation about the Google-Bing squabble

Posted by Tom Moertel Fri, 04 Feb 2011 03:32:00 GMT

Yesterday, I offered some completely uninformed speculation about why Google may have accused Microsoft of using Google’s search results for Bing. My hunch was that Google’s real concern wasn’t copy-catting but something else: Bing’s increasingly competitive search results, powered in good part by “clickstream” data that Microsoft captures by monitoring users as they surf the web with its software.

I wrote:

[M]aybe what’s really going on is that Google is trying to focus the public’s attention on how Bing is getting its relevance evidence. If it turns out that Bing is watching over people’s shoulders as they surf, I can imagine a lot of people, including citizens’ groups, raising a fuss over it. Maybe some politicians take notice. You see where I’m going?

Today, Google’s Matt Cutts offered his thoughts on the Google-Bing debate. After the expected rehash of the copy-catting evidence, he introduces something new:

I don’t think an average consumer realizes that if they say “yes, show me suggested sites” that they’re granting Microsoft permission to send their queries and clicks on Google to Microsoft, which will then be used in Bing’s ranking. I think my Mom would be confused that saying “Yes” to that dialog will send what she searches for on Google and what she clicks on to Microsoft. I don’t think that IE8′s disclosure is clear and conspicuous enough that a reasonable consumer could make an informed choice and know that IE8 will send their Google queries/clicks to Microsoft.

I’d now say that my speculation isn’t completely uninformed. (I’m counting the above as +15 dB of evidence toward my hypothesis being true. Of course, I’m not telling you what my prior probability was. ;-)

Posted in
Tags , , , ,
no comments
no trackbacks
Reddit Delicious

The Google-Microsoft squabble over Bing results: some completely uninformed speculation

Posted by Tom Moertel Wed, 02 Feb 2011 15:08:00 GMT

The recent spat between Google and Microsoft over Bing’s search results is entertaining as it is, but what it could really use is some completely uninformed speculation. Like this.

The story so far is that Google accused Microsoft of using Google’s search results for Bing. As evidence of this claim, Google seeded its search index with nonsense terms that appeared nowhere else on the Internet. For each of these terms, Google provided one search result, pointing to some legitimate page on the Internet. Then some Googlers did Google searches on those terms using new Windows laptops with IE8 and the Bing Toolbar enabled. Later, when searching on Bing for those same nonsense terms, Bing provided the same search results, results that could have come only from Google’s search engine.

Busted!

Or could there be another explanation? Another possibility, one that Google doesn’t seem to have considered (more on this later), is that Bing uses IE8 and the Toolbar to observe user behavior and, from this observational evidence, infer page relevance in general, not just when users are surfing Google’s search results. That is, if Microsoft sees you reading a page and notes that you typed “potato chips” into a form a bit earlier, it could take the relationship as a tiny piece of evidence that the current page is relevant to potato chips.

Now, I don’t know if Bing actually infers page relevance from user behavior, but if you’re watching what users are doing as they surf, it seems obvious that you could mine those observations for relevance evidence. Google uses its celebrated PageRank algorithm to mine relevance evidence from the web’s link graph, but Bing needs some other way. Why not watch users and see what they think is relevant?

If Bing does mine user behavior for relevance evidence, that puts Google’s sting operation in a new light.

In Google’s experiment, the only evidence of what was relevant to its nonsense search terms was the Googlers’ own surfing behavior. When they clicked on their own fake search results and surfed the pages those results pointed to, they sent Microsoft evidence that those pages were relevant to the earlier nonsense terms. And Bing recorded this evidence.

But, because those terms were nonsense, nobody on the Internet was searching for them, let alone visiting pages that might be “relevant” to them – except for the Googlers. Therefore, the only thing Bing had learned about those terms was what the Googlers had told it through their surfing behavior. It was denied all other evidence.

So, when somebody searched Bing for those terms, Bing tried to figure out what results were relevant and found those tiny pieces of evidence gathered from the Googlers’ surfing. And, in absence of other evidence, those tiny pieces won out, and Bing coughed up the most relevant pages, which just happened to be the fake results from Google’s own search engine.

The point is that Google’s sting operation would have had the same outcome, regardless of whether Bing was trying to borrow Google’s search results specifically or trying to mine user behavior in general. To put it in Bayesian terms, the following likelihood ratio is pretty close to one:

P(Google’s sting “busts” Bing | Bing tries to reuse Google results)
divided by
P(Google’s sting “busts” Bing | Bing mines user behavior in general)

But that’s not the interesting thing.

The interesting thing is that there are folks at Google who know how to use Bayes’ rule. And my guess is that they pointed out to their fellow Googlers that the sting, as evidence of Bing deliberately trying to use Google results, wasn’t persuasive. But Google went public anyway.

Why?

Here comes the completely uninformed speculation I promised.

Maybe Google is worried that Bing’s relevance evidence, if it comes directly from users, might in time become more reliable than Google’s own PageRank-derived evidence, now that SEO and content farms have thrown so much noise into the calculations.

So – more speculation – maybe what’s really going on is that Google is trying to focus the public’s attention on how Bing is getting its relevance evidence. If it turns out that Bing is watching over people’s shoulders as they surf, I can imagine a lot of people, including citizens’ groups, raising a fuss over it. Maybe some politicians take notice. You see where I’m going?

If monitoring user behavior lets Bing do an end-run around PageRank, Google might want to shut that play down by appealing to the referees. But Google can’t be seen as trying to take a competitive advantage away from Microsoft. So, what if Google put Microsoft into a position where it had to defend Bing’s results, and the only way it could make a credible defense was by admitting, in the public spotlight, it was spying on its users?

Anyway, it’s just completely uninformed speculation. What else have you got?

Posted in
Tags , , , ,
6 comments
no trackbacks
Reddit Delicious

Solving the Google Code Jam "countPaths" problem in Perl

Posted by Tom Moertel Thu, 17 Aug 2006 06:21:00 GMT

As promised, here’s a Perl version of a dynamic-programming-based solver for the Google Code Jam “countPaths” problem. It is a straight translation of my improved Ruby implementation. As you might expect, the Perl version was pretty fast. It proved faster than the other scripting-language implementations I tried (in this rather unscientific benchmark, not to be taken seriously):

ImplementationRun time (s)
Haskell 0.9
Perl (code below) 1.7
Python 2.8
Ruby 4.2

All timings were taken while solving the maximum-size, all-the-same-letter problem on my 1.8-GHz Opteron box.

Here’s the Perl implementation:

#!/usr/bin/perl

# Tom Moertel <tom@moertel.com>
# 2006-08-16
#
# Perl-based solution to the Google Code Jam problem "countPaths".
# See http://www.cs.uic.edu/~hnagaraj/articles/code-jam/ for more.

use strict;
use warnings;

use List::Util 'sum';
use Math::BigInt;

sub count_paths {

  my ($grid, $word) = @_;

  my $rword  = reverse $word;
  my $rowmax = $#$grid;
  my $colmax = length($grid->[0]);
  my ($slab, $sum);

  for my $i (0 .. length($rword) - 1) {
    my $char = substr $rword, $i, 1;
    ($slab, my $previous_slab) = ([], $slab);
    for my $r (0 .. $rowmax) {
      my ($row, $line) = ($grid->[$r], $slab->[$r] ||= []);
      for my $c (0 .. $colmax) {
        $line->[$c] = $char ne substr($row,$c,1) ? 0 : $i == 0 ? 1 : do {
          $sum = 0;
          my $clo = $c > 0 ? $c - 1 : $c;
          my $chi = $c < $colmax ? $c + 1 : $c;
          for my $nr (($r>0 ? $r-1 : $r) .. ($r<$rowmax ? $r+1 : $r)) {
            for my $nc ($clo .. $chi) {
              $sum += $previous_slab->[$nr][$nc]
                if $nr != $r || $nc != $c;
            }
          }
          $sum;
        }
      }
    }
  }

  sum map @$_, @$slab;
}

print count_paths([("A"x50)x50], "A"x50), $/;
# 3.03835410591851e+47
Update: I simplified the code a whisper by removing an unnecessary variable $counts. Here’s a diff if you’re curious about what’s changed:
--- countpaths.pl.orig  2006-08-18 00:16:56.000000000 -0400
+++ /countpaths.pl      2006-08-18 00:19:30.000000000 -0400
@@ -19,11 +19,11 @@
   my $rword  = reverse $word;
   my $rowmax = $#$grid;
   my $colmax = length($grid->[0]);
-  my ($counts, $slab, $sum);
+  my ($slab, $sum);

   for my $i (0 .. length($rword) - 1) {
     my $char = substr $rword, $i, 1;
-    ($slab, my $previous_slab) = ($counts->[$i] ||= [], $slab);
+    ($slab, my $previous_slab) = ([], $slab);
     for my $r (0 .. $rowmax) {
       my ($row, $line) = ($grid->[$r], $slab->[$r] ||= []);
       for my $c (0 .. $colmax) {

Update 2: Augmented the introductory paragraph with a parenthetical comment that reminds readers that these single-fuzzy-data-point-style timings should not be taken seriously. Also removed the word “bested,” which might suggest that there is an optimization contest in play. Please, no wagering.

Update 3: Stripped another variable ($j), which was completely unused and leftover from previous implementation. (See why you shouldn’t code late at night?)

Posted in , ,
Tags , , , , , ,
2 comments
no trackbacks
Reddit Delicious

Solving the Google Code Jam "countPaths" problem in Ruby

Posted by Tom Moertel Wed, 16 Aug 2006 22:54:00 GMT

Here’s a Ruby version of a dynamic-programming-based solver for the Google Code Jam “countPaths” problem. It is essentially the same as my earlier Haskell-based solution (see Update 2), but much slower. Whereas the Haskell version solves the maximum-size, all-the-same-letter problem in about 0.9 second, the Ruby version requires about 71 seconds. Maybe somebody who understands Ruby’s internals better than I do can come up with some optimizations.

Here’s the code:

# Tom Moertel <tom@moertel.com>
# 2006-08-16
#
# Ruby-based solution to the Google Code Jam problem "countPaths" 
# See http://www.cs.uic.edu/~hnagaraj/articles/code-jam/ for more.

class WordPath

  include Enumerable

  def initialize(grid, word)
    @grid, @rword, @counts = grid, word.reverse, {}
  end

  def self.count_paths(grid, word)
    new(grid, word).solve
  end

  def solve
    final_index = @rword.length - 1
    inject(0) { |sum, rc| sum + count_from(final_index, *rc) }
  end

  private

  def count_from(i, r, c)
    @counts[[r, c, i]] ||= begin
      match = @rword[i] == @grid[r][c]
      case
        when i == 0 && match then 1
        when match then subsum_of_neighbors(r, c, i - 1)
        else 0
      end
    end
  end

  def subsum_of_neighbors(r, c, i)
    sum = 0
    rowlen = @grid[0].size
    for nr in [r - 1, r, r + 1]
      next if nr < 0 or nr >= @grid.size
      for nc in [c - 1, c, c + 1]
        next if nc < 0 || nc >= rowlen
        next unless r != nr || c != nc
        if count = count_from(i, nr, nc)
          sum += count
        end
      end
    end
    sum
  end

  def each
    @grid.each_index do |r|
      @grid[0].size.times { |c| yield([r, c]) }
    end
  end

end

# TESTS

if ENV["TEST"] || ENV["BIG_TEST"]

  require "test/unit" 

  class TestWordPath < Test::Unit::TestCase

    if ENV["BIG_TEST"]
      def test_big_problem
        assert_equal \
          303835410591851117616135618108340196903254429200,
          WordPath.count_paths(["A"*50] * 50, "A"*50)
      end
    end

    if ENV["TEST"]
      def test_count_paths
        w = WordPath
        assert_equal 1,
          w.count_paths(%w{ABC FED GHI}, "ABCDEFGHI")
        assert_equal 2,
          w.count_paths(%w{ABC FED GAI}, "ABCDEA")
        assert_equal 0,
          w.count_paths(%w{ABC DEF GHI}, "ABCD")
        assert_equal 108,
          w.count_paths(%w{AA AA}, "AAAA")
        assert_equal 56448,
          w.count_paths(%w{ABABA BABAB ABABA BABAB ABABA}, "ABABABBA")
        assert_equal 2745564336,
          w.count_paths(%w{AAAAA AAAAA AAAAA AAAAA AAAAA}, "AAAAAAAAAAA")
        assert_equal 0,
          w.count_paths(%w{AB CD}, "AA" )
        assert_equal 1,
          w.count_paths(%w{A}, "A")
      end
    end

  end

end

Set the BIG_TEST and/or TEST environment variables to run the test suites. For example:

$ TEST=1 ./countpaths.rb
Loaded suite countpaths
Started
.
Finished in 0.02062 seconds.

1 tests, 8 assertions, 0 failures, 0 errors

Unless somebody beats me to it, I’ll whip up a Perl version for comparison.

Update: I managed to speed up my code by a factor of 17. Now the execution time for the maximum-size, all-the-same-letter problem is down to 4.2 seconds, which is comparable with implementations in other languages. Ivan Peev’s Python implementation, for example, is only slightly faster at 2.8 seconds.

A performance killer in the previous version was using a single big hash for my cache. Now I use a 3D array:


counts[[i,r,c]]   # one big hash (slower)
counts[i][r][c]   # 3D-array (faster)

An additional advantage of the 3D-array is that I can peel off slabs as I descend the outer layers of nested loops. For instance, instead of writing:

for i in 0 .. 10
  for j in 0 .. 10
    sum += counts[i][j]
  end
end

I can lift the counts[i] slab out of the inner loop to eliminate j array-indexing operations:

for i in 0 .. 10
  slab = counts[i]
  for j in 0 .. 10
    sum += slab[j]
  end
end

Here’s the new code (sans the unit tests, which haven’t changed):

class WordPath

  A = Array

  def self.count_paths(grid, word)

    rword  = word.reverse
    rowmax = grid.size - 1
    colmax = grid.first.size - 1

    for i in 0 .. rword.size - 1
      letter = rword[i]
      previous_slab, slab = slab, A.new(rowmax+1) { A.new(colmax+1) }
      for r in 0 .. rowmax
        row, line = grid[r], slab[r]
        for c in 0 .. colmax
          line[c] = unless letter == row[c]
            0
          else
            if i == 0
              1
            else
              sum = 0
              clo = c > 0 ? c - 1 : c
              chi = c < colmax ? c + 1 : c
              for nr in (r > 0 ? r - 1 : r) .. (r < rowmax ? r + 1 : r)
                for nc in clo .. chi
                  sum += previous_slab[nr][nc] if nr != r || nc != c
                end
              end
              sum
            end
          end
        end
      end
    end

    sum = 0
    for r in 0 .. rowmax
      for c in 0 .. colmax
        sum += slab[r][c]
      end
    end

    sum

  end
end

Update 2: I tweaked the code snippet above to remove a variable that I just noticed wasn’t actually doing anything.

Posted in ,
Tags , , , ,
3 comments
no trackbacks
Reddit Delicious

Solving the Google Code Jam "countPaths" problem in Haskell

Posted by Tom Moertel Tue, 15 Aug 2006 21:01:00 GMT

Via the article on this year’s Google Code Jam on Slashdot earlier today, I found Hareesh Nagarajan’s write-up of a previous year’s Code-Jam problem. Since Google often comes up with interesting problems, I decided to give this one a go.

The problem: count the ways to find a word by walking on a grid

You are given a rectangular grid of letters and a word to find. You must compute the number of ways to find the word within the grid using the following rules:

  • start at any cell within the grid
  • from there, move to any of the cell’s eight neighboring cells
  • continue moving from that neighbor to its neighbors, and so on, until you have spelled out the word
  • you may visit cells more than once, but you cannot visit the same cell twice in a row (i.e., you must move for each turn)

For instance, consider the following grid, taken from the examples in the problem statement:

ABC
FED
GAI

If you were asked to find the word “AEA” on this grid, you could do it in four ways:

Way  --Move---
     1   2   3

1:  *BC ABC *BC
    FED F*D FED
    GAI GAI GAI

2:  *BC ABC ABC
    FED F*D FED
    GAI GAI G*I

3:  ABC ABC *BC
    FED F*D FED
    G*I GAI GAI

4:  ABC ABC ABC
    FED F*D FED
    G*I GAI G*I

If you were asked to find “ABCD”, you could do it in only one way:

Way  --Move-------- 
     1   2   3   4 

1:  *BC A*C AB* ABC
    FED FED FED FE*
    GAI GAI GAI GAI

If you were asked to find “AAB”, you could not: there are no “A” cells on the grid that have other “A” cells as neighbors.

The tricksy nature of the problem

As you might expect from Google, this puzzle was designed to see whether your solution can scale. A simple search will quickly bog down because each step in the search can expand into vastly more possibilities, as searching for “AAAA” on a seemingly harmless 2×2 grid of all “A” cells shows – there are 108 solutions.

The problem statement says that the grid may be up to 50×50 in size and the word to find may be up to 50 letters long. Imagine, then, that you are asked to find a word composed of 50 “A” letters within a 50×50 grid of “A” cells. All of the cells will be valid starting points, and each will have, on average, slightly less than 8 valid neighbors. Thus there will be about 50 × 50 × 8^49 = 4.5e47 ways to find the word1. Tracing them all would take forever.

The trick is figuring out a more efficient way to solve the problem. Since that’s the fun part of this problem, I won’t spoil it for you by telling you how I did it. (If you truly want spoilers, you can study my code.)

My solution

Here is what I came up with. I’ll present the code first and then discuss how to use it.

Note: The code below is out of date but printed here for continuity. See Update 5 for the most-recent revision.

{-

Tom Moertel <tom@moertel.com>
2006-08-15

Haskell-based solution to the Google Code Jam problem "countPaths";
see http://www.cs.uic.edu/~hnagaraj/articles/code-jam/ for more.

-}

module Main (main) where

import Control.Monad
import Data.Array
import qualified Data.Map as M

main = do
    word:gridspec <- liftM words getContents
    print $ (countPaths word (toGridArray gridspec) :: Integer)

countPaths word@(p:_) gridArray =
    sum . M.elems $ foldl step state0 (zip word (tail word))
  where
    state0 = M.fromList [(cell, 1) | (cell, q) <- assocs gridArray, p == q]
    neighbors = toNeighborMap gridArray
    step state fromto = M.fromListWith (+) $ do
        steps <- M.lookup fromto neighbors
        (start, count) <- M.assocs state
        cells <- M.lookup start steps
        cell <- cells
        return (cell, count)

toGridArray gridspec@(l1:_) =
    listArray ((1,1), (length gridspec, length l1)) (concat gridspec)

toNeighborMap gridArray =
    M.fromListWith (M.unionWith (flip (++))) $ do
        (cell, p) <- assocs gridArray
        cell' <- neighbors8 cell
        guard $ inRange (bounds gridArray) cell'
        return ((p, gridArray!cell'), M.singleton cell [cell'])

neighbors8 (r,c) =
    [(r+h, c+v) | h <- [-1..1], v <- [-1..1], h /= 0 || v /= 0]

-- Local Variables:  ***
-- compile-command: "ghc -O2 -o wordpath --make WordPath.hs" ***
-- End: ***

My solution generalizes upon the problem statement in a few ways:

  • the grid can be any size and the word any length
  • the grid and word can be composed of any comparable data type, not just A–Z letters (if you use the stdin interface, the code will use Unicode characters)
  • the code will compute exact counts instead of returning -1 for counts greater than 1e9

You can enter problems from the command line. Enter the word first and then the grid, each row separated by whitespace. For example:

$ ./wordpath
AAAAAAAAAAA

AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
^D

2745564336

Give it a try

This was a fun problem to solve. If you have a little spare time, give it a try. I would love to compare results and talk about strategies.

Update: Fixed typo: Finding “AAAA” – not “AA” – on a 2×2 grid of all “A” letters results in a count of 108. Thanks to Joshua Volz for pointing out my mistake.

Update 2: Here’s a dynamic-programming-based implementation of countPaths that is about six times faster than my original implementation when solving the maximum-size, all-the-same-letter problem:

countPaths word gridArray =
    sum [counts ! (length word, cell) | cell <- cells]
  where
    counts = listArray ((1, (1, 1)), (length word, gridSize)) $
             [countFrom i cell | i <- [1..length word], cell <- cells]

    countFrom i cell
        | i == 1 && match = 1
        | match           = sum [counts!((i-1),n) | n <- neighbors!cell]
        | otherwise       = 0
      where
        match = rword ! i == gridArray ! cell

    neighbors = listArray (bounds gridArray) $
        [filter (inRange (bounds gridArray)) (neighbors8 cell)
            | cell <- cells ]

    rword    = listArray (1, length word) (reverse word)
    cells    = indices gridArray
    gridSize = snd (bounds gridArray)

See the thread started by ‘psykotic’ on reddit.com for more.

Update 3: Ivan Peev has solved the problem in Python: Solving the Google Code Jam ‘countPaths’ problem in Python. Because his implementation uses the same algorithm that my implementation in Update 2 does, it makes a good vehicle for Haskell-versus-Python speed comparisons, an interesting topic in light of the warning Google provides about using Python in the Google Code Jam:

NOTE: All submissions have a maximum of 2 seconds of runtime per test case. This limit is used in harder problems to force submissions to be of a certain complexity. Because of the inherent speed differences between Python and the other offered languages is large, some problems may require extra optimization or not be solvable using the Python language.

Ivan reports that his Python implementation solves the maximum-size, all-the-same-letter problem in about 8 seconds on an old 1-GHz AMD Athlon. The Haskell version comes in somewhat faster at 0.9 second on a 1.8-GHz AMD Opteron. (On the same Opteron, Ivan’s code clocks in at 2.8 seconds, which is impressive.)

Update 4: I have added a Ruby implementation and a Perl implementation and timings, too. On the the maximum-size, all-the-same-letter problem, Ruby clocks in at 4.2 seconds; Perl in 1.7 seconds. See the Perl implementation for a summary table of the timings.

Update 5: As I promised reader Kartik in a comment, here is a further-simplified, yet 25-percent-faster, version of my implementation in Update 2. This version eliminates the cache in favor of a current-state array that is folded through the successive letters of the target word. The result of the fold operation is the final state array, whose elements are summed to yield the final result. Here’s the complete code:

{-

Tom Moertel <tom@moertel.com>
2006-08-15 (revised 2006-09-01)

Haskell-based solution to the Google Code Jam problem "countPaths" 
See http://www.cs.uic.edu/~hnagaraj/articles/code-jam/ for more.

This implementation is based on the dynamic-programming strategy
mentioned by reddit.com user "psykotic":
http://programming.reddit.com/info/dni1/comments/cdp59.

-}

module Main (main) where

import Control.Monad
import Data.Array

main = do
    word:gridspec <- liftM words getContents
    print $ (countPaths word (toGridArray gridspec) :: Integer)

countPaths word grid =
    sum . elems $ foldl move counts0 (tail (reverse word))
  where
    move counts c  = step c $ sum . map (counts!) . neighbors
    counts0        = step (last word) (const 1)
    step c f       = listArray (bounds grid) $ map (match c f) cells
    match c f cell = if c == grid!cell then f cell else 0
    neighbors cell = filter (inRange (bounds grid)) (neighbors8 cell)
    cells          = indices grid

toGridArray gridspec@(l1:_) =
    listArray ((1,1), (length gridspec, length l1)) (concat gridspec)

neighbors8 (r,c) =
    [(h, v) | h <- [r-1..r+1], v <- [c-1..c+1], h /= r || v /= c]

-- Local Variables:  ***
-- compile-command: "ghc -O2 -o wordpathdp --make WordPathDP.hs" ***
-- End: ***

1 I believe that the exact count is 303 835 410 591 851 117 616 135 618 108 340 196 903 254 429 200 (approx. 3.04e47). It takes about six seconds 0.75 second to compute on a 1.8-GHz AMD64 box running Linux.

Posted in , ,
Tags , , , , , ,
10 comments
no trackbacks
Reddit Delicious