More thoughts on Google's tracking abilities

It all comes down to the cookie.

The Wall Street Journal recently began a series of articles called What They Know, detailing the different pieces of data that online marketing companies have about people as they traverse the web. None of this is really new, especially not to me, since I work in that industry. But I was surprised at some of the data that was present in the cookies right in plaintext:

Now, I don’t know if the above image of a cookie was presented as it was because the reporters didn’t realize that all that was needed to “decode” that cookie was a couple of runs through PHP’s urldecode() and those %25255Es would be converted from their hexcodes to plain old ASCII – %25255E0 -> %255E0 -> %5E0 -> ^0 (caret). Maybe they didn’t know, or maybe they knew but they left it all computery so it looked “scarier” to readers… that green text on black background is usually reserved for movies like The Matrix. Anyway, like I said, what was surprising to me wasn’t that there was that much data being collected, but rather that the data was right there in the body of the cookie, readable by anyone. Even a simple base64_encode would have hidden the contents of the cookie from the casual snooper.

For a while I’ve been thinking about Google’s vast troves of data that go far, far beyond what the average marketer knows about the average web user. Let’s assume you’re… me. You use Gmail, Google, and YouTube on a pretty frequent basis. Google has single sign-on — as it should — so to use any of these services you can (and in many cases, have to) be logged in with your Google Account. This is logical and convenient for the user, but it unlocks huge amounts of information about you to Google. By having you sign in to any of their services, Google’s ability to track you online transcends cookies.

Cookies are small bits of data set by the server on your browser to allow information to persist between sessions. Since it’s set in the browser, it’s implicitly impossible for cookies set in one browser to be used in another browser. This means that if you start Firefox and click around the internet for a while, you’ll accumulate some cookies. If you then exit Firefox and start Safari, and click around to those same sites, you’ll get completely different cookies than those you got in Firefox — from a “tracking” perspective, the person using Firefox and the person using Safari are different people (even though they both happen to be you)1. Also, because cookies are tied to browsers, this implies that cookies set on one computer are bound to that browser on that computer — i.e., cookies in Firefox on computer A have no bearing on what happens in Firefox (or any other browser) on computer B.

Single sign-on knocks down these implicit privacy walls. Assume, again, that you’re me, and you have a Linux laptop at work. At home you have a Linux desktop, a Mac mini hooked up to the TV in the living room, and a Windows laptop. You also have an iPhone. Single sign-on enables Google to track what you’re doing across all of these devices. It’s really quite simple: on each machine you use, if you want to read your email (Gmail) you log in with your Google Account. At that point, Google knows that it’s you using the browser. The value inside the cookie they set in your particular browser may differ, but they know that you’re you. They know what you’re searching for in Google; where you go (by IP address; or, if you allow it, by GPS on most modern smart phones — Google’s Latitude service lets you relay your GPS coordinates to your friends), what kind of email you receive, who you correspond with. And let’s not forget that Google has plastered the internet with ads – over 90% of their revenue comes from advertising, and they bought DoubleClick a few years ago, so any time you go to a site with Google ads on it (which is pretty much all of them), they know it. They own YouTube, so they know every video you’ve watched on YouTube, which ones you’ve “Liked” and which ones you’ve “Favorited.” And, as I mentioned in my previous crazy-guy post, Google is amassing a huge facial-recognition database, so they’ll know everything about you – interests, income, travel habits, friends, what you look like, likes & dislikes. They can probably give a pretty good guess as to where you home is and where your office is just by seeing that between 9:00 AM and 6:00 PM you commonly access the internet from IP 1.2.3.4 and the rest of the time you usually come from IP 2.3.4.5, and simple IP-geo databases can tell them where those IPs are (admittedly, with widely varying accuracy).

The trove of information they have on the average person is actually frightening. The only thing keeping them from completely exploiting this data (assuming they aren’t, for argument’s sake) is their “Don’t Be Evil” philosophy and the shitstorm of bad press (and, one would assume, legal action) that would ensue if they were to do so. I’m not really convinced they aren’t already using all of this data, probably to make ultra-targeted advertising decisions, which seems relatively benign on the face. But the real risk comes when this all falls into someone else’s hands. Google could get hax0red — it’s already happened. Google could get subpoenaed — I’m sure it’s happened hundreds of times already. A new batch of idiots in the Senate could just redefine terrorism and require all Google’s data be handed over daily.

This isn’t strictly a problem with Google, but there aren’t many companies I can think of that have massive ad platforms that also provide services you’re willing to log in to, and the logging in is what allows them to track you across browsers, across computers, across devices, and ultimately in real life.

Oh well. Whatever. I’m a big hypocrite because I can’t imagine not using Gmail or any of Google’s services that I use daily. Sucks to be me, I guess. Even if you “trust” Google, you may not trust what Google becomes 10 years from now, but by then they already know all about you.

1This isn’t completely accurate, because even without cookies there are pieces of data that will be the same regardless of your browser, for example your IP address, which in general is a pretty good proxy for uniqueness, but I’m just thinking about cookies for now.

More thoughts on Google’s tracking abilities

It all comes down to the cookie.

The Wall Street Journal recently began a series of articles called What They Know, detailing the different pieces of data that online marketing companies have about people as they traverse the web. None of this is really new, especially not to me, since I work in that industry. But I was surprised at some of the data that was present in the cookies right in plaintext:

Now, I don’t know if the above image of a cookie was presented as it was because the reporters didn’t realize that all that was needed to “decode” that cookie was a couple of runs through PHP’s urldecode() and those %25255Es would be converted from their hexcodes to plain old ASCII – %25255E0 -> %255E0 -> %5E0 -> ^0 (caret). Maybe they didn’t know, or maybe they knew but they left it all computery so it looked “scarier” to readers… that green text on black background is usually reserved for movies like The Matrix. Anyway, like I said, what was surprising to me wasn’t that there was that much data being collected, but rather that the data was right there in the body of the cookie, readable by anyone. Even a simple base64_encode would have hidden the contents of the cookie from the casual snooper.

For a while I’ve been thinking about Google’s vast troves of data that go far, far beyond what the average marketer knows about the average web user. Let’s assume you’re… me. You use Gmail, Google, and YouTube on a pretty frequent basis. Google has single sign-on — as it should — so to use any of these services you can (and in many cases, have to) be logged in with your Google Account. This is logical and convenient for the user, but it unlocks huge amounts of information about you to Google. By having you sign in to any of their services, Google’s ability to track you online transcends cookies.

Cookies are small bits of data set by the server on your browser to allow information to persist between sessions. Since it’s set in the browser, it’s implicitly impossible for cookies set in one browser to be used in another browser. This means that if you start Firefox and click around the internet for a while, you’ll accumulate some cookies. If you then exit Firefox and start Safari, and click around to those same sites, you’ll get completely different cookies than those you got in Firefox — from a “tracking” perspective, the person using Firefox and the person using Safari are different people (even though they both happen to be you)1. Also, because cookies are tied to browsers, this implies that cookies set on one computer are bound to that browser on that computer — i.e., cookies in Firefox on computer A have no bearing on what happens in Firefox (or any other browser) on computer B.

Single sign-on knocks down these implicit privacy walls. Assume, again, that you’re me, and you have a Linux laptop at work. At home you have a Linux desktop, a Mac mini hooked up to the TV in the living room, and a Windows laptop. You also have an iPhone. Single sign-on enables Google to track what you’re doing across all of these devices. It’s really quite simple: on each machine you use, if you want to read your email (Gmail) you log in with your Google Account. At that point, Google knows that it’s you using the browser. The value inside the cookie they set in your particular browser may differ, but they know that you’re you. They know what you’re searching for in Google; where you go (by IP address; or, if you allow it, by GPS on most modern smart phones — Google’s Latitude service lets you relay your GPS coordinates to your friends), what kind of email you receive, who you correspond with. And let’s not forget that Google has plastered the internet with ads – over 90% of their revenue comes from advertising, and they bought DoubleClick a few years ago, so any time you go to a site with Google ads on it (which is pretty much all of them), they know it. They own YouTube, so they know every video you’ve watched on YouTube, which ones you’ve “Liked” and which ones you’ve “Favorited.” And, as I mentioned in my previous crazy-guy post, Google is amassing a huge facial-recognition database, so they’ll know everything about you – interests, income, travel habits, friends, what you look like, likes & dislikes. They can probably give a pretty good guess as to where you home is and where your office is just by seeing that between 9:00 AM and 6:00 PM you commonly access the internet from IP 1.2.3.4 and the rest of the time you usually come from IP 2.3.4.5, and simple IP-geo databases can tell them where those IPs are (admittedly, with widely varying accuracy).

The trove of information they have on the average person is actually frightening. The only thing keeping them from completely exploiting this data (assuming they aren’t, for argument’s sake) is their “Don’t Be Evil” philosophy and the shitstorm of bad press (and, one would assume, legal action) that would ensue if they were to do so. I’m not really convinced they aren’t already using all of this data, probably to make ultra-targeted advertising decisions, which seems relatively benign on the face. But the real risk comes when this all falls into someone else’s hands. Google could get hax0red — it’s already happened. Google could get subpoenaed — I’m sure it’s happened hundreds of times already. A new batch of idiots in the Senate could just redefine terrorism and require all Google’s data be handed over daily.

This isn’t strictly a problem with Google, but there aren’t many companies I can think of that have massive ad platforms that also provide services you’re willing to log in to, and the logging in is what allows them to track you across browsers, across computers, across devices, and ultimately in real life.

Oh well. Whatever. I’m a big hypocrite because I can’t imagine not using Gmail or any of Google’s services that I use daily. Sucks to be me, I guess. Even if you “trust” Google, you may not trust what Google becomes 10 years from now, but by then they already know all about you.

1This isn’t completely accurate, because even without cookies there are pieces of data that will be the same regardless of your browser, for example your IP address, which in general is a pretty good proxy for uniqueness, but I’m just thinking about cookies for now.

The sinister side of Google's Picasa face tagging

So, let me start by saying that I love Picasa, Google’s photo organization tool. It automatically finds new photos as you add them to your hard drive. It lets you crop pictures, remove red-eye, adjust colors and make a few other basic edits that cover probably 95% of what most people need to do when editing photos. It lets you select a few photos from your library and email them to anyone with just a couple of clicks. It also integrates with Google Earth and Google Maps to show you on a map where a particular photo was taken (for those unaware, GPS-enabled cameras, including many mobile phone cameras [e.g., iPhone] embed your GPS coordinates within the EXIF metadata of the photo, so any person, program or website with access to the image will know the location at which it was taken).

It also has a nifty feature called face tagging. How this works, basically, is Picasa analyzes all of the photos in your library and looks for faces. There’s some algorithm in the program that can recognize that two eyes, a nose, a mouth and maybe some hair is a face. So if you use the face-tagging feature, Picasa shows you a page of faces extracted from your photo library. Initially these photos have no names, but Picasa does some basic grouping of them. For example, it doesn’t know who your Uncle Bob is, but it does know that these 14 photos are all of the same person. The grouping feature isn’t perfect, but it is very helpful when you decide to apply a name to the group of photos – tagging 14 photos instead of one is a great time-saver.

This feature only really becomes useful if you start tagging faces with real names — i.e. if you tag the photos of Uncle Bob by telling Picasa “these are photos of Uncle Bob.” If you facetag enough photos, Google will start “guessing” the name for a particular face, and tagging it automatically. This feature is also not perfect, but I imagine they’re working on improving it all the time.

So, this all happens on your computer, within Picasa. I’m not so much of a tinfoil hat type as to suggest Google’s doing anything in particular with the data on your computer itself. The “problem” as I see it is that when you tag a photo of Uncle Bob, Picasa pulls Uncle Bob’s contact info out of your Gmail contacts. So essentially, you’re tying a face to an email address. As I said, I don’t think Google’s surreptitiously going to use the info that resides on your computer.

But in addition to Picasa, the photo organization tool you run on your computer, Google offers an online photo album service called Picasa Web Albums. This is similar to other services, Flickr being the largest, that offer a simple way to upload photos and share them with others. All users get 1 GB of free storage, and you can buy more pretty cheaply (as of today you can get 20 GB for $5/year). As you might expect from the names, Picasa and Picasa Web Albums integrate very well. If you create an album within Picasa, all you have to do to upload it to Picasa Web Albums is click the “Sync This Album” button. It will then upload all the photos in the album to Picasaweb.

Here’s where the potential creepy part starts. Let’s say you have a photo in Picasa, that you took on August 4th, 2010, at 10:00 AM, and you’ve tagged 2 faces in it: Aunt Alice (alice@gmail.com) and Uncle Bob (bob@gmail.com). Let’s further say that you took this photo with your iPhone, so the GPS coordinates are embedded in the photo metadata. You upload the photo to Picasa Web Albums. Well, now you’ve just told Google the following:

  • What alice@gmail.com looks like.
  • What bob@gmail.com looks like.
  • Where alice@gmail.com and bob@gmail.com physically were (via GPS coordinates) on 8/4/2010 at 10:00 AM

There’s lots of other information you’ve probably also told them, but these are the data that are creeping me out lately. If your album has 20 or 30 photos of Alice and Bob that you’ve tagged with their contact info then Google’s got a pretty good idea what they look like – if the Picasa desktop app is able to guess who people in your photos are based on some algorithm inside it, imagine what Google’s billion-dollar datacenters can do?

In all likelihood, you aren’t the only one uploading photos of Alice and Bob. Other people at other events tag photos of Alice and Bob and upload them to Google, further “teaching” this massive computer brain what Alice and Bob look like (since email addresses are basically internet-wide unique IDs, two photos tagged with the same email address can generally be assumed to be the same person). Alice and Bob may never use Picasa, may not even own a camera themselves, and may not even use Google at all. But at this point Google knows what they look like and where they’ve gone – completely apart from their computer-based activities.

I think facial recognition is going to become huge for marketers over the next decade or so. Picasa offers users a useful feature that seems like it has this sinister other side to it – basically building an enormous crowdsourced facial recognition database, so they’ll be able to identify millions of people right out of the gate. If New York City ever gives Google access to its street cams, Google will be able to track the activities of millions more people without their knowledge or consent. Combine that with the existing knowledge Google has – if your iPhone checks your Gmail account, they know your general location at any given time anyway, just based on IP address – and they can create a pretty accurate (in advertising terms) picture of you. And with facial recognition, it will actually BE a picture of you.

Much is made of Google’s “Don’t Be Evil” motto (and I couldn’t write this without throwing those 3 words in), and I tend to be somewhat of a Google fanboy myself. However, much like government, what you have to worry about isn’t always what the current regime is doing with its power, but what the regime 10 or 20 or 100 years from now will do with it. I’m sure Google has rules about how these data are used, but rules change; rules are broken. If there’s one rule that seems inviolate throughout human history it’s that power corrupts. Knowledge is power. Or something.

Well, whatever. I still love Picasa, it just gives me this creepy feeling sometimes. This stuff is all completely voluntary, nobody is being forced to use any of these features, but like I said, Uncle Bob and Aunt Alice were tagged in a photo by someone else – you don’t need to do anything to have your face added to the Great Google Face Database In The Sky. This is something I’ve been thinking about for a while, but I was prompted to write it down based on Eric Schmidt’s recent comment, “Show us 14 photos of yourself and we can identify who you are.”

The sinister side of Google’s Picasa face tagging

So, let me start by saying that I love Picasa, Google’s photo organization tool. It automatically finds new photos as you add them to your hard drive. It lets you crop pictures, remove red-eye, adjust colors and make a few other basic edits that cover probably 95% of what most people need to do when editing photos. It lets you select a few photos from your library and email them to anyone with just a couple of clicks. It also integrates with Google Earth and Google Maps to show you on a map where a particular photo was taken (for those unaware, GPS-enabled cameras, including many mobile phone cameras [e.g., iPhone] embed your GPS coordinates within the EXIF metadata of the photo, so any person, program or website with access to the image will know the location at which it was taken).

It also has a nifty feature called face tagging. How this works, basically, is Picasa analyzes all of the photos in your library and looks for faces. There’s some algorithm in the program that can recognize that two eyes, a nose, a mouth and maybe some hair is a face. So if you use the face-tagging feature, Picasa shows you a page of faces extracted from your photo library. Initially these photos have no names, but Picasa does some basic grouping of them. For example, it doesn’t know who your Uncle Bob is, but it does know that these 14 photos are all of the same person. The grouping feature isn’t perfect, but it is very helpful when you decide to apply a name to the group of photos – tagging 14 photos instead of one is a great time-saver.

This feature only really becomes useful if you start tagging faces with real names — i.e. if you tag the photos of Uncle Bob by telling Picasa “these are photos of Uncle Bob.” If you facetag enough photos, Google will start “guessing” the name for a particular face, and tagging it automatically. This feature is also not perfect, but I imagine they’re working on improving it all the time.

So, this all happens on your computer, within Picasa. I’m not so much of a tinfoil hat type as to suggest Google’s doing anything in particular with the data on your computer itself. The “problem” as I see it is that when you tag a photo of Uncle Bob, Picasa pulls Uncle Bob’s contact info out of your Gmail contacts. So essentially, you’re tying a face to an email address. As I said, I don’t think Google’s surreptitiously going to use the info that resides on your computer.

But in addition to Picasa, the photo organization tool you run on your computer, Google offers an online photo album service called Picasa Web Albums. This is similar to other services, Flickr being the largest, that offer a simple way to upload photos and share them with others. All users get 1 GB of free storage, and you can buy more pretty cheaply (as of today you can get 20 GB for $5/year). As you might expect from the names, Picasa and Picasa Web Albums integrate very well. If you create an album within Picasa, all you have to do to upload it to Picasa Web Albums is click the “Sync This Album” button. It will then upload all the photos in the album to Picasaweb.

Here’s where the potential creepy part starts. Let’s say you have a photo in Picasa, that you took on August 4th, 2010, at 10:00 AM, and you’ve tagged 2 faces in it: Aunt Alice (alice@gmail.com) and Uncle Bob (bob@gmail.com). Let’s further say that you took this photo with your iPhone, so the GPS coordinates are embedded in the photo metadata. You upload the photo to Picasa Web Albums. Well, now you’ve just told Google the following:

  • What alice@gmail.com looks like.
  • What bob@gmail.com looks like.
  • Where alice@gmail.com and bob@gmail.com physically were (via GPS coordinates) on 8/4/2010 at 10:00 AM

There’s lots of other information you’ve probably also told them, but these are the data that are creeping me out lately. If your album has 20 or 30 photos of Alice and Bob that you’ve tagged with their contact info then Google’s got a pretty good idea what they look like – if the Picasa desktop app is able to guess who people in your photos are based on some algorithm inside it, imagine what Google’s billion-dollar datacenters can do?

In all likelihood, you aren’t the only one uploading photos of Alice and Bob. Other people at other events tag photos of Alice and Bob and upload them to Google, further “teaching” this massive computer brain what Alice and Bob look like (since email addresses are basically internet-wide unique IDs, two photos tagged with the same email address can generally be assumed to be the same person). Alice and Bob may never use Picasa, may not even own a camera themselves, and may not even use Google at all. But at this point Google knows what they look like and where they’ve gone – completely apart from their computer-based activities.

I think facial recognition is going to become huge for marketers over the next decade or so. Picasa offers users a useful feature that seems like it has this sinister other side to it – basically building an enormous crowdsourced facial recognition database, so they’ll be able to identify millions of people right out of the gate. If New York City ever gives Google access to its street cams, Google will be able to track the activities of millions more people without their knowledge or consent. Combine that with the existing knowledge Google has – if your iPhone checks your Gmail account, they know your general location at any given time anyway, just based on IP address – and they can create a pretty accurate (in advertising terms) picture of you. And with facial recognition, it will actually BE a picture of you.

Much is made of Google’s “Don’t Be Evil” motto (and I couldn’t write this without throwing those 3 words in), and I tend to be somewhat of a Google fanboy myself. However, much like government, what you have to worry about isn’t always what the current regime is doing with its power, but what the regime 10 or 20 or 100 years from now will do with it. I’m sure Google has rules about how these data are used, but rules change; rules are broken. If there’s one rule that seems inviolate throughout human history it’s that power corrupts. Knowledge is power. Or something.

Well, whatever. I still love Picasa, it just gives me this creepy feeling sometimes. This stuff is all completely voluntary, nobody is being forced to use any of these features, but like I said, Uncle Bob and Aunt Alice were tagged in a photo by someone else – you don’t need to do anything to have your face added to the Great Google Face Database In The Sky. This is something I’ve been thinking about for a while, but I was prompted to write it down based on Eric Schmidt’s recent comment, “Show us 14 photos of yourself and we can identify who you are.”