Chaos theory and Google’s crawler

I’ve been moderately perplexed by the recent spike in traffic on basically unrelated keywords. Apparently this site is currently the #5 result for “fedora 15 beta download” despite my having never written about Fedora 15. In an attempt to funnel people to a useful page I created the previous post with links to the FC 15 ISOs. I feel bad if people come here looking for an answer that’s not to be found.

In looking into this issue I searched Google for the keywords and saw this:

evanhoffman.com is blockable
evanhoffman.com is blockable

There’s a “Block all evanhoffman.com results” link under my site, but there’s none under any of the other sites. What the hell? Does my site somehow qualify as a spammer or content farm? Why do I get this dubious distinction? Ugh.

Thank you Cabinetparts.com!

A couple of weeks ago the hinge on one of the cabinets in our kitchen broke. I took the door off, unscrewed the hinge, and went to Home Depot to try and find a replacement. No luck. I went to a local hardware store, same deal. I was annoyed, and worried I’d never be able to find a replacement. I headed home.

I looked at the broken hinge and found imprinted on it, in tiny numerals, was “32.260-01”. It also had “blum” imprinted on it, which I assumed was the brand name. A long shot, but I entered “blum 32.260-01” in Google. I was thrilled to see that while there were no organic results, there was a paid link for CabinetParts.com. I clicked through and found the exact part I needed. They were a little more than I’d hoped to spend, but since I had no idea where else I could go to find them, I was happy to pay it. I got the part a few days later and it was exactly what I needed. Cabinet: repaired. So, hooray for them!

Relaying through Google Apps using Sendmail to bypass EC2 spam blockage

Update 3 May 2011: I’ve subsequently modified our EC2 systems to relay SMTP mail through Amazon’s SES which doesn’t have the 500 messages per day limit that Google Apps does.

A few months ago I moved a site into EC2. I didn’t want to move the existing IMAP server (ugh) so I moved the email to Google Apps. There are only about 10 mailboxes so we went with “Standard” edition (free). Once we completed the move to EC2 we discovered that emails from our webserver were bouncing due to our EC2 IP address being listed in a spam RBL. This sucked, so I looked into relaying the mail from the EC2 webserver through our Google Apps account. Fortunately this turned out to be pretty easy.

This wiki page on scalix.com has a procedure for setting up SMTP relaying in Ubuntu with TLS & auth. I’m not running Ubuntu so the paths were different but it was basically the same procedure:

  • Create the file /etc/mail/client-info with these contents: AuthInfo:smtp.gmail.com "U:bounces@example.com" "I:bounces@example.com" "P:superpassword", where “example.com” is your Google Apps domain, “bounces” is a valid account, and the password is the account’s password. Mail relayed with these credentials will show “bounces@example.com” in the From: field of the message.
  • In /etc/mail, run makemap hash client-info < client-info
  • Edit /etc/mail/sendmail.mc, adding or uncommenting these lines:
    define(`SMART_HOST', `smtp.gmail.com')dnl
    define(`confAUTH_MECHANISMS', `EXTERNAL GSSAPI DIGEST-MD5 CRAM-MD5 LOGIN PLAIN')dnl
    FEATURE(`authinfo', `hash /etc/mail/client-info')dnl
    
  • Recompile sendmail.cf: m4 sendmail.mc > sendmail.cf . I got this error: “/etc/mail/sendmail.mc:10: m4: Cannot open /usr/share/sendmail-cf/m4/cf.m4” when running the command, but I resolved it by doing yum install sendmail-cf
  • Restart sendmail.

Once this was done I sent myself a test message from the command line and received it; I checked the SMTP headers and sure enough it went through Google’s mail server. One nice side effect is that all the mail sent by the webserver appears in the “Sent” folder for the Google Apps username provided in the client-info file. Hopefully this will resolve the spam issues, since the mail is now coming from Google’s IP block.

How does paid blogging work?

I’ve been hearing for years about paid bloggers. If people are getting paid to write their crap down in an ad-supported industry, it seemed like it might make sense to throw some ads up on this very site to see what happens. I’ve had Adsense running on this site for a few months now and the short answer is a whole lot of nothing. Here’s what the earnings look like since 1/1/2009 (my Adsense account is much older than this site; I put the banner ads up around Fall of 2009):

Basically, in a year I’ve “earned” under $20. That doesn’t even pay for domain registration & DNS for a year. And since Google doesn’t actually pay you until you have $100 in earnings, this is fake money anyway.

Now I didn’t have any illusions about making money from this site, I just put the ads up as an experiment to see if this is a realistic way to earn a dependable income. From what I can tell, it can be, but only in certain cases, basically coming down to how much traffic you can generate.

  1. You’re already famous. If you’re already a “celebrity” in your field (whatever that field is) then people already probably want to hear what you say.
  2. Your subject matter has mass appeal. If you write about discoveries in quantum physics, you may have a decent following, but it’s still only going to be the people who care about quantum physics. If you write about Jersey Shore you have a much larger pool of possible readers, because everybody loves watching a train wreck.
  3. What you say actually matters. This is related to the first point. If Joe Shmoe (or Evan Hoffman) rants at the top of his lungs, it’s just some guy complaining. If Ben Bernanke makes an offhand comment about interest rates the stock market tanks.

I’m sure there are some other cases, but as far as I can tell a tech guy writing about things that annoy him doesn’t fit any of these criteria. I’m tempted to remove the ads altogether, but it’s too interesting seeing what ads Google puts up on some of these pages. The first few months, the ads were all for some rabbi’s circumcision service. Not sure what that was about.

A less insidious way to use Facebook?

I deactivated my Facebook account a couple of months ago. I just kind of got tired of seeing silly updates from friends and “friends” – people I’d friended but wasn’t really friends with. I was also frustrated by the privacy implications of using such a service: you tell it about yourself, you tell it about who you know and how you know them, you keep adding more information about you and your friends to its huge brain that it’s free to use or abuse however it wants.

I don’t know if I’m anti-“Social” or just antisocial but most of the info streaming into my Facebook feed was just not interesting to me. I could have hidden those people, but then it seemed like it would make more sense simply to remove the connection to them, if I didn’t want to see their updates. I actually went through my list of connections and started removing people – people I knew from high school and hadn’t spoke to since then until they added me on Facebook, and then continued not talking to them, and other people who I knew but didn’t really interact with, online or offline. I didn’t really care about what they had to say and it occurred to me that they didn’t care what I had to say. Why did we friend even each other in the first place? Well, the friend suggester (suggestor?) makes it easy to friend people who are only tangentially related, since its whole purpose is to find new people for you to add.

I remember there was one person from school whom I hadn’t spoken to since probably 4th grade. This person attempted to friend me 5 times on FB (Soandso wants to be your friend…) and each time I clicked “Ignore,” but on the 6th time I finally relented. After 2 weeks of inane updates I unfriended the person. Within a month I was getting requests to refriend. Why? I don’t know you, you don’t know me, what’s to be gained by us pretending to be e-friends?

So I had some fundamental problems with Facebook. In addition to the friending of barely-friends, the feeding of so much information into the Facebook brain was starting to bother me. This is pretty similar to my worries about Google’s reach; basically every bit of information you post to Facebook to share with “friends” is also being added to Facebook’s marketing profile about you and your friends. The more you use the service, the more they know about you. And all those “Like” buttons all over the internet – a way for you to inform your Facebook friends that you like a blog post or news story – those are just a way for Facebook to know what sites you’re visiting. Whether you click the “like” button or not, your browser is loading the button of their servers, which means Facebook is reading your cookie and knows that YOU visited the page. This annoyed me so much that I edited my /etc/hosts file to redirect http://www.facebook.com to 127.0.0.1 (my own computer) where I’m running Apache, so the Like buttons just render as 404 errors now:

But I’m fine with that. I’ve also set my browser to reject all cookies from *.facebook.com. I realize this is just a drop in the ocean of data for Facebook, but screw them. Even with my account disabled they were collecting data about me, and that just pissed me off. But much like Google, Facebook’s tracking ability transcends browsers and computers, since in order to use their service you need to log in, and thus your movements around the internet can be tracked regardless of which computer or device you’re using.

Facebook wasn’t a completely worthless service for me. I found the photo album feature very useful. It was a great way to upload pictures and share them instantly with whomever wanted to see them. In my case this was usually my family plus a few friends. I doubt anything will top Facebook for this because these people are already on Facebook, and for something to come along that’s better at this than Facebook, these people would need to move to the new platform, which as of today doesn’t seem likely.

Photo sharing is the one thing I miss. I haven’t stopped taking pictures but it’s a much clumsier process now to share them with people. I put them in an album in Picasa, upload it to PicasaWeb, set the permissions on the album, send out the invitations. The recipients then have to click on a private link to get to the pictures, and if they want to see them again in the future, they need to dig through their inbox to find the link and click on it again. Not everybody uses Gmail, and even for those who do, this is just a clunky process. With Facebook albums, if the album is shared with someone, all they have to do is click on me and then click on my list of albums to see the pictures. Easy. I’m considering returning to Facebook just to get the photo album back.

So I was thinking that if I could restrict myself to using only the Facebook iPhone app, I’d still be able to take the occasional picture with the phone, upload it for people to see, and not fall prey to the tracking cookie problems I described above, since (I’m assuming) the Facebook app and Safari don’t share data. At least, not yet.

That idea prompted me to write this post in the first place, but as I’ve been writing it it occurred to me that it’s not really a workable plan. If I’m using it I’ll eventually feel the need to login via browser, meaning I’ll have to tear down all the walls I’ve erected – the hosts file entry, the cookie blocking – and I’ll be right back where I was, feeding them all my info and letting them track me everywhere I go. So I guess it’s going to come down to a question of whether or not the costs outweigh the benefits, as it always does.

Unless I can just write a browser plugin to strip the “Like” button from non-Facebook websites. Maybe AdBlock can do this. Hmm… The dog woke me up early today and everyone else is asleep still, and this all sounded a lot better in my head before I started writing it down.

More thoughts on Google's tracking abilities

It all comes down to the cookie.

The Wall Street Journal recently began a series of articles called What They Know, detailing the different pieces of data that online marketing companies have about people as they traverse the web. None of this is really new, especially not to me, since I work in that industry. But I was surprised at some of the data that was present in the cookies right in plaintext:

Now, I don’t know if the above image of a cookie was presented as it was because the reporters didn’t realize that all that was needed to “decode” that cookie was a couple of runs through PHP’s urldecode() and those %25255Es would be converted from their hexcodes to plain old ASCII – %25255E0 -> %255E0 -> %5E0 -> ^0 (caret). Maybe they didn’t know, or maybe they knew but they left it all computery so it looked “scarier” to readers… that green text on black background is usually reserved for movies like The Matrix. Anyway, like I said, what was surprising to me wasn’t that there was that much data being collected, but rather that the data was right there in the body of the cookie, readable by anyone. Even a simple base64_encode would have hidden the contents of the cookie from the casual snooper.

For a while I’ve been thinking about Google’s vast troves of data that go far, far beyond what the average marketer knows about the average web user. Let’s assume you’re… me. You use Gmail, Google, and YouTube on a pretty frequent basis. Google has single sign-on — as it should — so to use any of these services you can (and in many cases, have to) be logged in with your Google Account. This is logical and convenient for the user, but it unlocks huge amounts of information about you to Google. By having you sign in to any of their services, Google’s ability to track you online transcends cookies.

Cookies are small bits of data set by the server on your browser to allow information to persist between sessions. Since it’s set in the browser, it’s implicitly impossible for cookies set in one browser to be used in another browser. This means that if you start Firefox and click around the internet for a while, you’ll accumulate some cookies. If you then exit Firefox and start Safari, and click around to those same sites, you’ll get completely different cookies than those you got in Firefox — from a “tracking” perspective, the person using Firefox and the person using Safari are different people (even though they both happen to be you)1. Also, because cookies are tied to browsers, this implies that cookies set on one computer are bound to that browser on that computer — i.e., cookies in Firefox on computer A have no bearing on what happens in Firefox (or any other browser) on computer B.

Single sign-on knocks down these implicit privacy walls. Assume, again, that you’re me, and you have a Linux laptop at work. At home you have a Linux desktop, a Mac mini hooked up to the TV in the living room, and a Windows laptop. You also have an iPhone. Single sign-on enables Google to track what you’re doing across all of these devices. It’s really quite simple: on each machine you use, if you want to read your email (Gmail) you log in with your Google Account. At that point, Google knows that it’s you using the browser. The value inside the cookie they set in your particular browser may differ, but they know that you’re you. They know what you’re searching for in Google; where you go (by IP address; or, if you allow it, by GPS on most modern smart phones — Google’s Latitude service lets you relay your GPS coordinates to your friends), what kind of email you receive, who you correspond with. And let’s not forget that Google has plastered the internet with ads – over 90% of their revenue comes from advertising, and they bought DoubleClick a few years ago, so any time you go to a site with Google ads on it (which is pretty much all of them), they know it. They own YouTube, so they know every video you’ve watched on YouTube, which ones you’ve “Liked” and which ones you’ve “Favorited.” And, as I mentioned in my previous crazy-guy post, Google is amassing a huge facial-recognition database, so they’ll know everything about you – interests, income, travel habits, friends, what you look like, likes & dislikes. They can probably give a pretty good guess as to where you home is and where your office is just by seeing that between 9:00 AM and 6:00 PM you commonly access the internet from IP 1.2.3.4 and the rest of the time you usually come from IP 2.3.4.5, and simple IP-geo databases can tell them where those IPs are (admittedly, with widely varying accuracy).

The trove of information they have on the average person is actually frightening. The only thing keeping them from completely exploiting this data (assuming they aren’t, for argument’s sake) is their “Don’t Be Evil” philosophy and the shitstorm of bad press (and, one would assume, legal action) that would ensue if they were to do so. I’m not really convinced they aren’t already using all of this data, probably to make ultra-targeted advertising decisions, which seems relatively benign on the face. But the real risk comes when this all falls into someone else’s hands. Google could get hax0red — it’s already happened. Google could get subpoenaed — I’m sure it’s happened hundreds of times already. A new batch of idiots in the Senate could just redefine terrorism and require all Google’s data be handed over daily.

This isn’t strictly a problem with Google, but there aren’t many companies I can think of that have massive ad platforms that also provide services you’re willing to log in to, and the logging in is what allows them to track you across browsers, across computers, across devices, and ultimately in real life.

Oh well. Whatever. I’m a big hypocrite because I can’t imagine not using Gmail or any of Google’s services that I use daily. Sucks to be me, I guess. Even if you “trust” Google, you may not trust what Google becomes 10 years from now, but by then they already know all about you.

1This isn’t completely accurate, because even without cookies there are pieces of data that will be the same regardless of your browser, for example your IP address, which in general is a pretty good proxy for uniqueness, but I’m just thinking about cookies for now.

More thoughts on Google’s tracking abilities

It all comes down to the cookie.

The Wall Street Journal recently began a series of articles called What They Know, detailing the different pieces of data that online marketing companies have about people as they traverse the web. None of this is really new, especially not to me, since I work in that industry. But I was surprised at some of the data that was present in the cookies right in plaintext:

Now, I don’t know if the above image of a cookie was presented as it was because the reporters didn’t realize that all that was needed to “decode” that cookie was a couple of runs through PHP’s urldecode() and those %25255Es would be converted from their hexcodes to plain old ASCII – %25255E0 -> %255E0 -> %5E0 -> ^0 (caret). Maybe they didn’t know, or maybe they knew but they left it all computery so it looked “scarier” to readers… that green text on black background is usually reserved for movies like The Matrix. Anyway, like I said, what was surprising to me wasn’t that there was that much data being collected, but rather that the data was right there in the body of the cookie, readable by anyone. Even a simple base64_encode would have hidden the contents of the cookie from the casual snooper.

For a while I’ve been thinking about Google’s vast troves of data that go far, far beyond what the average marketer knows about the average web user. Let’s assume you’re… me. You use Gmail, Google, and YouTube on a pretty frequent basis. Google has single sign-on — as it should — so to use any of these services you can (and in many cases, have to) be logged in with your Google Account. This is logical and convenient for the user, but it unlocks huge amounts of information about you to Google. By having you sign in to any of their services, Google’s ability to track you online transcends cookies.

Cookies are small bits of data set by the server on your browser to allow information to persist between sessions. Since it’s set in the browser, it’s implicitly impossible for cookies set in one browser to be used in another browser. This means that if you start Firefox and click around the internet for a while, you’ll accumulate some cookies. If you then exit Firefox and start Safari, and click around to those same sites, you’ll get completely different cookies than those you got in Firefox — from a “tracking” perspective, the person using Firefox and the person using Safari are different people (even though they both happen to be you)1. Also, because cookies are tied to browsers, this implies that cookies set on one computer are bound to that browser on that computer — i.e., cookies in Firefox on computer A have no bearing on what happens in Firefox (or any other browser) on computer B.

Single sign-on knocks down these implicit privacy walls. Assume, again, that you’re me, and you have a Linux laptop at work. At home you have a Linux desktop, a Mac mini hooked up to the TV in the living room, and a Windows laptop. You also have an iPhone. Single sign-on enables Google to track what you’re doing across all of these devices. It’s really quite simple: on each machine you use, if you want to read your email (Gmail) you log in with your Google Account. At that point, Google knows that it’s you using the browser. The value inside the cookie they set in your particular browser may differ, but they know that you’re you. They know what you’re searching for in Google; where you go (by IP address; or, if you allow it, by GPS on most modern smart phones — Google’s Latitude service lets you relay your GPS coordinates to your friends), what kind of email you receive, who you correspond with. And let’s not forget that Google has plastered the internet with ads – over 90% of their revenue comes from advertising, and they bought DoubleClick a few years ago, so any time you go to a site with Google ads on it (which is pretty much all of them), they know it. They own YouTube, so they know every video you’ve watched on YouTube, which ones you’ve “Liked” and which ones you’ve “Favorited.” And, as I mentioned in my previous crazy-guy post, Google is amassing a huge facial-recognition database, so they’ll know everything about you – interests, income, travel habits, friends, what you look like, likes & dislikes. They can probably give a pretty good guess as to where you home is and where your office is just by seeing that between 9:00 AM and 6:00 PM you commonly access the internet from IP 1.2.3.4 and the rest of the time you usually come from IP 2.3.4.5, and simple IP-geo databases can tell them where those IPs are (admittedly, with widely varying accuracy).

The trove of information they have on the average person is actually frightening. The only thing keeping them from completely exploiting this data (assuming they aren’t, for argument’s sake) is their “Don’t Be Evil” philosophy and the shitstorm of bad press (and, one would assume, legal action) that would ensue if they were to do so. I’m not really convinced they aren’t already using all of this data, probably to make ultra-targeted advertising decisions, which seems relatively benign on the face. But the real risk comes when this all falls into someone else’s hands. Google could get hax0red — it’s already happened. Google could get subpoenaed — I’m sure it’s happened hundreds of times already. A new batch of idiots in the Senate could just redefine terrorism and require all Google’s data be handed over daily.

This isn’t strictly a problem with Google, but there aren’t many companies I can think of that have massive ad platforms that also provide services you’re willing to log in to, and the logging in is what allows them to track you across browsers, across computers, across devices, and ultimately in real life.

Oh well. Whatever. I’m a big hypocrite because I can’t imagine not using Gmail or any of Google’s services that I use daily. Sucks to be me, I guess. Even if you “trust” Google, you may not trust what Google becomes 10 years from now, but by then they already know all about you.

1This isn’t completely accurate, because even without cookies there are pieces of data that will be the same regardless of your browser, for example your IP address, which in general is a pretty good proxy for uniqueness, but I’m just thinking about cookies for now.

The sinister side of Google's Picasa face tagging

So, let me start by saying that I love Picasa, Google’s photo organization tool. It automatically finds new photos as you add them to your hard drive. It lets you crop pictures, remove red-eye, adjust colors and make a few other basic edits that cover probably 95% of what most people need to do when editing photos. It lets you select a few photos from your library and email them to anyone with just a couple of clicks. It also integrates with Google Earth and Google Maps to show you on a map where a particular photo was taken (for those unaware, GPS-enabled cameras, including many mobile phone cameras [e.g., iPhone] embed your GPS coordinates within the EXIF metadata of the photo, so any person, program or website with access to the image will know the location at which it was taken).

It also has a nifty feature called face tagging. How this works, basically, is Picasa analyzes all of the photos in your library and looks for faces. There’s some algorithm in the program that can recognize that two eyes, a nose, a mouth and maybe some hair is a face. So if you use the face-tagging feature, Picasa shows you a page of faces extracted from your photo library. Initially these photos have no names, but Picasa does some basic grouping of them. For example, it doesn’t know who your Uncle Bob is, but it does know that these 14 photos are all of the same person. The grouping feature isn’t perfect, but it is very helpful when you decide to apply a name to the group of photos – tagging 14 photos instead of one is a great time-saver.

This feature only really becomes useful if you start tagging faces with real names — i.e. if you tag the photos of Uncle Bob by telling Picasa “these are photos of Uncle Bob.” If you facetag enough photos, Google will start “guessing” the name for a particular face, and tagging it automatically. This feature is also not perfect, but I imagine they’re working on improving it all the time.

So, this all happens on your computer, within Picasa. I’m not so much of a tinfoil hat type as to suggest Google’s doing anything in particular with the data on your computer itself. The “problem” as I see it is that when you tag a photo of Uncle Bob, Picasa pulls Uncle Bob’s contact info out of your Gmail contacts. So essentially, you’re tying a face to an email address. As I said, I don’t think Google’s surreptitiously going to use the info that resides on your computer.

But in addition to Picasa, the photo organization tool you run on your computer, Google offers an online photo album service called Picasa Web Albums. This is similar to other services, Flickr being the largest, that offer a simple way to upload photos and share them with others. All users get 1 GB of free storage, and you can buy more pretty cheaply (as of today you can get 20 GB for $5/year). As you might expect from the names, Picasa and Picasa Web Albums integrate very well. If you create an album within Picasa, all you have to do to upload it to Picasa Web Albums is click the “Sync This Album” button. It will then upload all the photos in the album to Picasaweb.

Here’s where the potential creepy part starts. Let’s say you have a photo in Picasa, that you took on August 4th, 2010, at 10:00 AM, and you’ve tagged 2 faces in it: Aunt Alice (alice@gmail.com) and Uncle Bob (bob@gmail.com). Let’s further say that you took this photo with your iPhone, so the GPS coordinates are embedded in the photo metadata. You upload the photo to Picasa Web Albums. Well, now you’ve just told Google the following:

  • What alice@gmail.com looks like.
  • What bob@gmail.com looks like.
  • Where alice@gmail.com and bob@gmail.com physically were (via GPS coordinates) on 8/4/2010 at 10:00 AM

There’s lots of other information you’ve probably also told them, but these are the data that are creeping me out lately. If your album has 20 or 30 photos of Alice and Bob that you’ve tagged with their contact info then Google’s got a pretty good idea what they look like – if the Picasa desktop app is able to guess who people in your photos are based on some algorithm inside it, imagine what Google’s billion-dollar datacenters can do?

In all likelihood, you aren’t the only one uploading photos of Alice and Bob. Other people at other events tag photos of Alice and Bob and upload them to Google, further “teaching” this massive computer brain what Alice and Bob look like (since email addresses are basically internet-wide unique IDs, two photos tagged with the same email address can generally be assumed to be the same person). Alice and Bob may never use Picasa, may not even own a camera themselves, and may not even use Google at all. But at this point Google knows what they look like and where they’ve gone – completely apart from their computer-based activities.

I think facial recognition is going to become huge for marketers over the next decade or so. Picasa offers users a useful feature that seems like it has this sinister other side to it – basically building an enormous crowdsourced facial recognition database, so they’ll be able to identify millions of people right out of the gate. If New York City ever gives Google access to its street cams, Google will be able to track the activities of millions more people without their knowledge or consent. Combine that with the existing knowledge Google has – if your iPhone checks your Gmail account, they know your general location at any given time anyway, just based on IP address – and they can create a pretty accurate (in advertising terms) picture of you. And with facial recognition, it will actually BE a picture of you.

Much is made of Google’s “Don’t Be Evil” motto (and I couldn’t write this without throwing those 3 words in), and I tend to be somewhat of a Google fanboy myself. However, much like government, what you have to worry about isn’t always what the current regime is doing with its power, but what the regime 10 or 20 or 100 years from now will do with it. I’m sure Google has rules about how these data are used, but rules change; rules are broken. If there’s one rule that seems inviolate throughout human history it’s that power corrupts. Knowledge is power. Or something.

Well, whatever. I still love Picasa, it just gives me this creepy feeling sometimes. This stuff is all completely voluntary, nobody is being forced to use any of these features, but like I said, Uncle Bob and Aunt Alice were tagged in a photo by someone else – you don’t need to do anything to have your face added to the Great Google Face Database In The Sky. This is something I’ve been thinking about for a while, but I was prompted to write it down based on Eric Schmidt’s recent comment, “Show us 14 photos of yourself and we can identify who you are.”

The sinister side of Google’s Picasa face tagging

So, let me start by saying that I love Picasa, Google’s photo organization tool. It automatically finds new photos as you add them to your hard drive. It lets you crop pictures, remove red-eye, adjust colors and make a few other basic edits that cover probably 95% of what most people need to do when editing photos. It lets you select a few photos from your library and email them to anyone with just a couple of clicks. It also integrates with Google Earth and Google Maps to show you on a map where a particular photo was taken (for those unaware, GPS-enabled cameras, including many mobile phone cameras [e.g., iPhone] embed your GPS coordinates within the EXIF metadata of the photo, so any person, program or website with access to the image will know the location at which it was taken).

It also has a nifty feature called face tagging. How this works, basically, is Picasa analyzes all of the photos in your library and looks for faces. There’s some algorithm in the program that can recognize that two eyes, a nose, a mouth and maybe some hair is a face. So if you use the face-tagging feature, Picasa shows you a page of faces extracted from your photo library. Initially these photos have no names, but Picasa does some basic grouping of them. For example, it doesn’t know who your Uncle Bob is, but it does know that these 14 photos are all of the same person. The grouping feature isn’t perfect, but it is very helpful when you decide to apply a name to the group of photos – tagging 14 photos instead of one is a great time-saver.

This feature only really becomes useful if you start tagging faces with real names — i.e. if you tag the photos of Uncle Bob by telling Picasa “these are photos of Uncle Bob.” If you facetag enough photos, Google will start “guessing” the name for a particular face, and tagging it automatically. This feature is also not perfect, but I imagine they’re working on improving it all the time.

So, this all happens on your computer, within Picasa. I’m not so much of a tinfoil hat type as to suggest Google’s doing anything in particular with the data on your computer itself. The “problem” as I see it is that when you tag a photo of Uncle Bob, Picasa pulls Uncle Bob’s contact info out of your Gmail contacts. So essentially, you’re tying a face to an email address. As I said, I don’t think Google’s surreptitiously going to use the info that resides on your computer.

But in addition to Picasa, the photo organization tool you run on your computer, Google offers an online photo album service called Picasa Web Albums. This is similar to other services, Flickr being the largest, that offer a simple way to upload photos and share them with others. All users get 1 GB of free storage, and you can buy more pretty cheaply (as of today you can get 20 GB for $5/year). As you might expect from the names, Picasa and Picasa Web Albums integrate very well. If you create an album within Picasa, all you have to do to upload it to Picasa Web Albums is click the “Sync This Album” button. It will then upload all the photos in the album to Picasaweb.

Here’s where the potential creepy part starts. Let’s say you have a photo in Picasa, that you took on August 4th, 2010, at 10:00 AM, and you’ve tagged 2 faces in it: Aunt Alice (alice@gmail.com) and Uncle Bob (bob@gmail.com). Let’s further say that you took this photo with your iPhone, so the GPS coordinates are embedded in the photo metadata. You upload the photo to Picasa Web Albums. Well, now you’ve just told Google the following:

  • What alice@gmail.com looks like.
  • What bob@gmail.com looks like.
  • Where alice@gmail.com and bob@gmail.com physically were (via GPS coordinates) on 8/4/2010 at 10:00 AM

There’s lots of other information you’ve probably also told them, but these are the data that are creeping me out lately. If your album has 20 or 30 photos of Alice and Bob that you’ve tagged with their contact info then Google’s got a pretty good idea what they look like – if the Picasa desktop app is able to guess who people in your photos are based on some algorithm inside it, imagine what Google’s billion-dollar datacenters can do?

In all likelihood, you aren’t the only one uploading photos of Alice and Bob. Other people at other events tag photos of Alice and Bob and upload them to Google, further “teaching” this massive computer brain what Alice and Bob look like (since email addresses are basically internet-wide unique IDs, two photos tagged with the same email address can generally be assumed to be the same person). Alice and Bob may never use Picasa, may not even own a camera themselves, and may not even use Google at all. But at this point Google knows what they look like and where they’ve gone – completely apart from their computer-based activities.

I think facial recognition is going to become huge for marketers over the next decade or so. Picasa offers users a useful feature that seems like it has this sinister other side to it – basically building an enormous crowdsourced facial recognition database, so they’ll be able to identify millions of people right out of the gate. If New York City ever gives Google access to its street cams, Google will be able to track the activities of millions more people without their knowledge or consent. Combine that with the existing knowledge Google has – if your iPhone checks your Gmail account, they know your general location at any given time anyway, just based on IP address – and they can create a pretty accurate (in advertising terms) picture of you. And with facial recognition, it will actually BE a picture of you.

Much is made of Google’s “Don’t Be Evil” motto (and I couldn’t write this without throwing those 3 words in), and I tend to be somewhat of a Google fanboy myself. However, much like government, what you have to worry about isn’t always what the current regime is doing with its power, but what the regime 10 or 20 or 100 years from now will do with it. I’m sure Google has rules about how these data are used, but rules change; rules are broken. If there’s one rule that seems inviolate throughout human history it’s that power corrupts. Knowledge is power. Or something.

Well, whatever. I still love Picasa, it just gives me this creepy feeling sometimes. This stuff is all completely voluntary, nobody is being forced to use any of these features, but like I said, Uncle Bob and Aunt Alice were tagged in a photo by someone else – you don’t need to do anything to have your face added to the Great Google Face Database In The Sky. This is something I’ve been thinking about for a while, but I was prompted to write it down based on Eric Schmidt’s recent comment, “Show us 14 photos of yourself and we can identify who you are.”