Chaos theory and Google’s crawler

I’ve been moderately perplexed by the recent spike in traffic on basically unrelated keywords. Apparently this site is currently the #5 result for “fedora 15 beta download” despite my having never written about Fedora 15. In an attempt to funnel people to a useful page I created the previous post with links to the FC 15 ISOs. I feel bad if people come here looking for an answer that’s not to be found.

In looking into this issue I searched Google for the keywords and saw this:

evanhoffman.com is blockable
evanhoffman.com is blockable

There’s a “Block all evanhoffman.com results” link under my site, but there’s none under any of the other sites. What the hell? Does my site somehow qualify as a spammer or content farm? Why do I get this dubious distinction? Ugh.

Advertisements

Thank you Cabinetparts.com!

A couple of weeks ago the hinge on one of the cabinets in our kitchen broke. I took the door off, unscrewed the hinge, and went to Home Depot to try and find a replacement. No luck. I went to a local hardware store, same deal. I was annoyed, and worried I’d never be able to find a replacement. I headed home.

I looked at the broken hinge and found imprinted on it, in tiny numerals, was “32.260-01”. It also had “blum” imprinted on it, which I assumed was the brand name. A long shot, but I entered “blum 32.260-01” in Google. I was thrilled to see that while there were no organic results, there was a paid link for CabinetParts.com. I clicked through and found the exact part I needed. They were a little more than I’d hoped to spend, but since I had no idea where else I could go to find them, I was happy to pay it. I got the part a few days later and it was exactly what I needed. Cabinet: repaired. So, hooray for them!

Relaying through Google Apps using Sendmail to bypass EC2 spam blockage

Update 3 May 2011: I’ve subsequently modified our EC2 systems to relay SMTP mail through Amazon’s SES which doesn’t have the 500 messages per day limit that Google Apps does.

A few months ago I moved a site into EC2. I didn’t want to move the existing IMAP server (ugh) so I moved the email to Google Apps. There are only about 10 mailboxes so we went with “Standard” edition (free). Once we completed the move to EC2 we discovered that emails from our webserver were bouncing due to our EC2 IP address being listed in a spam RBL. This sucked, so I looked into relaying the mail from the EC2 webserver through our Google Apps account. Fortunately this turned out to be pretty easy.

This wiki page on scalix.com has a procedure for setting up SMTP relaying in Ubuntu with TLS & auth. I’m not running Ubuntu so the paths were different but it was basically the same procedure:

  • Create the file /etc/mail/client-info with these contents: AuthInfo:smtp.gmail.com "U:bounces@example.com" "I:bounces@example.com" "P:superpassword", where “example.com” is your Google Apps domain, “bounces” is a valid account, and the password is the account’s password. Mail relayed with these credentials will show “bounces@example.com” in the From: field of the message.
  • In /etc/mail, run makemap hash client-info < client-info
  • Edit /etc/mail/sendmail.mc, adding or uncommenting these lines:
    define(`SMART_HOST', `smtp.gmail.com')dnl
    define(`confAUTH_MECHANISMS', `EXTERNAL GSSAPI DIGEST-MD5 CRAM-MD5 LOGIN PLAIN')dnl
    FEATURE(`authinfo', `hash /etc/mail/client-info')dnl
    
  • Recompile sendmail.cf: m4 sendmail.mc > sendmail.cf . I got this error: “/etc/mail/sendmail.mc:10: m4: Cannot open /usr/share/sendmail-cf/m4/cf.m4” when running the command, but I resolved it by doing yum install sendmail-cf
  • Restart sendmail.

Once this was done I sent myself a test message from the command line and received it; I checked the SMTP headers and sure enough it went through Google’s mail server. One nice side effect is that all the mail sent by the webserver appears in the “Sent” folder for the Google Apps username provided in the client-info file. Hopefully this will resolve the spam issues, since the mail is now coming from Google’s IP block.

How does paid blogging work?

I’ve been hearing for years about paid bloggers. If people are getting paid to write their crap down in an ad-supported industry, it seemed like it might make sense to throw some ads up on this very site to see what happens. I’ve had Adsense running on this site for a few months now and the short answer is a whole lot of nothing. Here’s what the earnings look like since 1/1/2009 (my Adsense account is much older than this site; I put the banner ads up around Fall of 2009):

Basically, in a year I’ve “earned” under $20. That doesn’t even pay for domain registration & DNS for a year. And since Google doesn’t actually pay you until you have $100 in earnings, this is fake money anyway.

Now I didn’t have any illusions about making money from this site, I just put the ads up as an experiment to see if this is a realistic way to earn a dependable income. From what I can tell, it can be, but only in certain cases, basically coming down to how much traffic you can generate.

  1. You’re already famous. If you’re already a “celebrity” in your field (whatever that field is) then people already probably want to hear what you say.
  2. Your subject matter has mass appeal. If you write about discoveries in quantum physics, you may have a decent following, but it’s still only going to be the people who care about quantum physics. If you write about Jersey Shore you have a much larger pool of possible readers, because everybody loves watching a train wreck.
  3. What you say actually matters. This is related to the first point. If Joe Shmoe (or Evan Hoffman) rants at the top of his lungs, it’s just some guy complaining. If Ben Bernanke makes an offhand comment about interest rates the stock market tanks.

I’m sure there are some other cases, but as far as I can tell a tech guy writing about things that annoy him doesn’t fit any of these criteria. I’m tempted to remove the ads altogether, but it’s too interesting seeing what ads Google puts up on some of these pages. The first few months, the ads were all for some rabbi’s circumcision service. Not sure what that was about.

A less insidious way to use Facebook?

I deactivated my Facebook account a couple of months ago. I just kind of got tired of seeing silly updates from friends and “friends” – people I’d friended but wasn’t really friends with. I was also frustrated by the privacy implications of using such a service: you tell it about yourself, you tell it about who you know and how you know them, you keep adding more information about you and your friends to its huge brain that it’s free to use or abuse however it wants.

I don’t know if I’m anti-“Social” or just antisocial but most of the info streaming into my Facebook feed was just not interesting to me. I could have hidden those people, but then it seemed like it would make more sense simply to remove the connection to them, if I didn’t want to see their updates. I actually went through my list of connections and started removing people – people I knew from high school and hadn’t spoke to since then until they added me on Facebook, and then continued not talking to them, and other people who I knew but didn’t really interact with, online or offline. I didn’t really care about what they had to say and it occurred to me that they didn’t care what I had to say. Why did we friend even each other in the first place? Well, the friend suggester (suggestor?) makes it easy to friend people who are only tangentially related, since its whole purpose is to find new people for you to add.

I remember there was one person from school whom I hadn’t spoken to since probably 4th grade. This person attempted to friend me 5 times on FB (Soandso wants to be your friend…) and each time I clicked “Ignore,” but on the 6th time I finally relented. After 2 weeks of inane updates I unfriended the person. Within a month I was getting requests to refriend. Why? I don’t know you, you don’t know me, what’s to be gained by us pretending to be e-friends?

So I had some fundamental problems with Facebook. In addition to the friending of barely-friends, the feeding of so much information into the Facebook brain was starting to bother me. This is pretty similar to my worries about Google’s reach; basically every bit of information you post to Facebook to share with “friends” is also being added to Facebook’s marketing profile about you and your friends. The more you use the service, the more they know about you. And all those “Like” buttons all over the internet – a way for you to inform your Facebook friends that you like a blog post or news story – those are just a way for Facebook to know what sites you’re visiting. Whether you click the “like” button or not, your browser is loading the button of their servers, which means Facebook is reading your cookie and knows that YOU visited the page. This annoyed me so much that I edited my /etc/hosts file to redirect http://www.facebook.com to 127.0.0.1 (my own computer) where I’m running Apache, so the Like buttons just render as 404 errors now:

But I’m fine with that. I’ve also set my browser to reject all cookies from *.facebook.com. I realize this is just a drop in the ocean of data for Facebook, but screw them. Even with my account disabled they were collecting data about me, and that just pissed me off. But much like Google, Facebook’s tracking ability transcends browsers and computers, since in order to use their service you need to log in, and thus your movements around the internet can be tracked regardless of which computer or device you’re using.

Facebook wasn’t a completely worthless service for me. I found the photo album feature very useful. It was a great way to upload pictures and share them instantly with whomever wanted to see them. In my case this was usually my family plus a few friends. I doubt anything will top Facebook for this because these people are already on Facebook, and for something to come along that’s better at this than Facebook, these people would need to move to the new platform, which as of today doesn’t seem likely.

Photo sharing is the one thing I miss. I haven’t stopped taking pictures but it’s a much clumsier process now to share them with people. I put them in an album in Picasa, upload it to PicasaWeb, set the permissions on the album, send out the invitations. The recipients then have to click on a private link to get to the pictures, and if they want to see them again in the future, they need to dig through their inbox to find the link and click on it again. Not everybody uses Gmail, and even for those who do, this is just a clunky process. With Facebook albums, if the album is shared with someone, all they have to do is click on me and then click on my list of albums to see the pictures. Easy. I’m considering returning to Facebook just to get the photo album back.

So I was thinking that if I could restrict myself to using only the Facebook iPhone app, I’d still be able to take the occasional picture with the phone, upload it for people to see, and not fall prey to the tracking cookie problems I described above, since (I’m assuming) the Facebook app and Safari don’t share data. At least, not yet.

That idea prompted me to write this post in the first place, but as I’ve been writing it it occurred to me that it’s not really a workable plan. If I’m using it I’ll eventually feel the need to login via browser, meaning I’ll have to tear down all the walls I’ve erected – the hosts file entry, the cookie blocking – and I’ll be right back where I was, feeding them all my info and letting them track me everywhere I go. So I guess it’s going to come down to a question of whether or not the costs outweigh the benefits, as it always does.

Unless I can just write a browser plugin to strip the “Like” button from non-Facebook websites. Maybe AdBlock can do this. Hmm… The dog woke me up early today and everyone else is asleep still, and this all sounded a lot better in my head before I started writing it down.

More thoughts on Google's tracking abilities

It all comes down to the cookie.

The Wall Street Journal recently began a series of articles called What They Know, detailing the different pieces of data that online marketing companies have about people as they traverse the web. None of this is really new, especially not to me, since I work in that industry. But I was surprised at some of the data that was present in the cookies right in plaintext:

Now, I don’t know if the above image of a cookie was presented as it was because the reporters didn’t realize that all that was needed to “decode” that cookie was a couple of runs through PHP’s urldecode() and those %25255Es would be converted from their hexcodes to plain old ASCII – %25255E0 -> %255E0 -> %5E0 -> ^0 (caret). Maybe they didn’t know, or maybe they knew but they left it all computery so it looked “scarier” to readers… that green text on black background is usually reserved for movies like The Matrix. Anyway, like I said, what was surprising to me wasn’t that there was that much data being collected, but rather that the data was right there in the body of the cookie, readable by anyone. Even a simple base64_encode would have hidden the contents of the cookie from the casual snooper.

For a while I’ve been thinking about Google’s vast troves of data that go far, far beyond what the average marketer knows about the average web user. Let’s assume you’re… me. You use Gmail, Google, and YouTube on a pretty frequent basis. Google has single sign-on — as it should — so to use any of these services you can (and in many cases, have to) be logged in with your Google Account. This is logical and convenient for the user, but it unlocks huge amounts of information about you to Google. By having you sign in to any of their services, Google’s ability to track you online transcends cookies.

Cookies are small bits of data set by the server on your browser to allow information to persist between sessions. Since it’s set in the browser, it’s implicitly impossible for cookies set in one browser to be used in another browser. This means that if you start Firefox and click around the internet for a while, you’ll accumulate some cookies. If you then exit Firefox and start Safari, and click around to those same sites, you’ll get completely different cookies than those you got in Firefox — from a “tracking” perspective, the person using Firefox and the person using Safari are different people (even though they both happen to be you)1. Also, because cookies are tied to browsers, this implies that cookies set on one computer are bound to that browser on that computer — i.e., cookies in Firefox on computer A have no bearing on what happens in Firefox (or any other browser) on computer B.

Single sign-on knocks down these implicit privacy walls. Assume, again, that you’re me, and you have a Linux laptop at work. At home you have a Linux desktop, a Mac mini hooked up to the TV in the living room, and a Windows laptop. You also have an iPhone. Single sign-on enables Google to track what you’re doing across all of these devices. It’s really quite simple: on each machine you use, if you want to read your email (Gmail) you log in with your Google Account. At that point, Google knows that it’s you using the browser. The value inside the cookie they set in your particular browser may differ, but they know that you’re you. They know what you’re searching for in Google; where you go (by IP address; or, if you allow it, by GPS on most modern smart phones — Google’s Latitude service lets you relay your GPS coordinates to your friends), what kind of email you receive, who you correspond with. And let’s not forget that Google has plastered the internet with ads – over 90% of their revenue comes from advertising, and they bought DoubleClick a few years ago, so any time you go to a site with Google ads on it (which is pretty much all of them), they know it. They own YouTube, so they know every video you’ve watched on YouTube, which ones you’ve “Liked” and which ones you’ve “Favorited.” And, as I mentioned in my previous crazy-guy post, Google is amassing a huge facial-recognition database, so they’ll know everything about you – interests, income, travel habits, friends, what you look like, likes & dislikes. They can probably give a pretty good guess as to where you home is and where your office is just by seeing that between 9:00 AM and 6:00 PM you commonly access the internet from IP 1.2.3.4 and the rest of the time you usually come from IP 2.3.4.5, and simple IP-geo databases can tell them where those IPs are (admittedly, with widely varying accuracy).

The trove of information they have on the average person is actually frightening. The only thing keeping them from completely exploiting this data (assuming they aren’t, for argument’s sake) is their “Don’t Be Evil” philosophy and the shitstorm of bad press (and, one would assume, legal action) that would ensue if they were to do so. I’m not really convinced they aren’t already using all of this data, probably to make ultra-targeted advertising decisions, which seems relatively benign on the face. But the real risk comes when this all falls into someone else’s hands. Google could get hax0red — it’s already happened. Google could get subpoenaed — I’m sure it’s happened hundreds of times already. A new batch of idiots in the Senate could just redefine terrorism and require all Google’s data be handed over daily.

This isn’t strictly a problem with Google, but there aren’t many companies I can think of that have massive ad platforms that also provide services you’re willing to log in to, and the logging in is what allows them to track you across browsers, across computers, across devices, and ultimately in real life.

Oh well. Whatever. I’m a big hypocrite because I can’t imagine not using Gmail or any of Google’s services that I use daily. Sucks to be me, I guess. Even if you “trust” Google, you may not trust what Google becomes 10 years from now, but by then they already know all about you.

1This isn’t completely accurate, because even without cookies there are pieces of data that will be the same regardless of your browser, for example your IP address, which in general is a pretty good proxy for uniqueness, but I’m just thinking about cookies for now.

More thoughts on Google’s tracking abilities

It all comes down to the cookie.

The Wall Street Journal recently began a series of articles called What They Know, detailing the different pieces of data that online marketing companies have about people as they traverse the web. None of this is really new, especially not to me, since I work in that industry. But I was surprised at some of the data that was present in the cookies right in plaintext:

Now, I don’t know if the above image of a cookie was presented as it was because the reporters didn’t realize that all that was needed to “decode” that cookie was a couple of runs through PHP’s urldecode() and those %25255Es would be converted from their hexcodes to plain old ASCII – %25255E0 -> %255E0 -> %5E0 -> ^0 (caret). Maybe they didn’t know, or maybe they knew but they left it all computery so it looked “scarier” to readers… that green text on black background is usually reserved for movies like The Matrix. Anyway, like I said, what was surprising to me wasn’t that there was that much data being collected, but rather that the data was right there in the body of the cookie, readable by anyone. Even a simple base64_encode would have hidden the contents of the cookie from the casual snooper.

For a while I’ve been thinking about Google’s vast troves of data that go far, far beyond what the average marketer knows about the average web user. Let’s assume you’re… me. You use Gmail, Google, and YouTube on a pretty frequent basis. Google has single sign-on — as it should — so to use any of these services you can (and in many cases, have to) be logged in with your Google Account. This is logical and convenient for the user, but it unlocks huge amounts of information about you to Google. By having you sign in to any of their services, Google’s ability to track you online transcends cookies.

Cookies are small bits of data set by the server on your browser to allow information to persist between sessions. Since it’s set in the browser, it’s implicitly impossible for cookies set in one browser to be used in another browser. This means that if you start Firefox and click around the internet for a while, you’ll accumulate some cookies. If you then exit Firefox and start Safari, and click around to those same sites, you’ll get completely different cookies than those you got in Firefox — from a “tracking” perspective, the person using Firefox and the person using Safari are different people (even though they both happen to be you)1. Also, because cookies are tied to browsers, this implies that cookies set on one computer are bound to that browser on that computer — i.e., cookies in Firefox on computer A have no bearing on what happens in Firefox (or any other browser) on computer B.

Single sign-on knocks down these implicit privacy walls. Assume, again, that you’re me, and you have a Linux laptop at work. At home you have a Linux desktop, a Mac mini hooked up to the TV in the living room, and a Windows laptop. You also have an iPhone. Single sign-on enables Google to track what you’re doing across all of these devices. It’s really quite simple: on each machine you use, if you want to read your email (Gmail) you log in with your Google Account. At that point, Google knows that it’s you using the browser. The value inside the cookie they set in your particular browser may differ, but they know that you’re you. They know what you’re searching for in Google; where you go (by IP address; or, if you allow it, by GPS on most modern smart phones — Google’s Latitude service lets you relay your GPS coordinates to your friends), what kind of email you receive, who you correspond with. And let’s not forget that Google has plastered the internet with ads – over 90% of their revenue comes from advertising, and they bought DoubleClick a few years ago, so any time you go to a site with Google ads on it (which is pretty much all of them), they know it. They own YouTube, so they know every video you’ve watched on YouTube, which ones you’ve “Liked” and which ones you’ve “Favorited.” And, as I mentioned in my previous crazy-guy post, Google is amassing a huge facial-recognition database, so they’ll know everything about you – interests, income, travel habits, friends, what you look like, likes & dislikes. They can probably give a pretty good guess as to where you home is and where your office is just by seeing that between 9:00 AM and 6:00 PM you commonly access the internet from IP 1.2.3.4 and the rest of the time you usually come from IP 2.3.4.5, and simple IP-geo databases can tell them where those IPs are (admittedly, with widely varying accuracy).

The trove of information they have on the average person is actually frightening. The only thing keeping them from completely exploiting this data (assuming they aren’t, for argument’s sake) is their “Don’t Be Evil” philosophy and the shitstorm of bad press (and, one would assume, legal action) that would ensue if they were to do so. I’m not really convinced they aren’t already using all of this data, probably to make ultra-targeted advertising decisions, which seems relatively benign on the face. But the real risk comes when this all falls into someone else’s hands. Google could get hax0red — it’s already happened. Google could get subpoenaed — I’m sure it’s happened hundreds of times already. A new batch of idiots in the Senate could just redefine terrorism and require all Google’s data be handed over daily.

This isn’t strictly a problem with Google, but there aren’t many companies I can think of that have massive ad platforms that also provide services you’re willing to log in to, and the logging in is what allows them to track you across browsers, across computers, across devices, and ultimately in real life.

Oh well. Whatever. I’m a big hypocrite because I can’t imagine not using Gmail or any of Google’s services that I use daily. Sucks to be me, I guess. Even if you “trust” Google, you may not trust what Google becomes 10 years from now, but by then they already know all about you.

1This isn’t completely accurate, because even without cookies there are pieces of data that will be the same regardless of your browser, for example your IP address, which in general is a pretty good proxy for uniqueness, but I’m just thinking about cookies for now.