lessons from the ad blocker trenches

Greetings from the beautiful museum district of Berlin, where I’ve been trapped in a small conference room all week for the quarterly meeting of the W3C Technical Architecture group. So far we’ve produced two documents this week that I think are pretty good:

I just realized I have a few more things to say about the latter, based on my experience building and maintaining a semi-popular ad blocker (Privacy Badger Firefox).

  1. Beware of ad blockers that don’t actually block requests to tracking domains. For instance, an ad blocker that simply hides ads using CSS rules is not really useful for preventing tracking. Many users can’t tell the difference.
  2. Third-party cookies are not the only way to track users anymore, which means that browser features and extensions that only block/delete third-party cookies are not as useful as they once were. This 2012 survey paper [PDF] by Jonathan Mayer et. al. has a table of non-cookie browser tracking methods, which is probably out of date by now: Screen Shot 2015-07-17 at 4.32.55 PM
  3. Detecting whether a domain is performing third-party tracking is not straightforward. Naively, you could do this by counting the number of first-party domains that a domain reads high-entropy cookies from in a third-party context. However, this doesn’t encompass reading non-cookie browser state that could be used to uniquely identify users in aggregate (see table above). A more general but probably impractical approach is to try to tag every piece of site-readable browser state with an entropy estimate so that you can score sites by the total entropy that is readable by them in a third-party context. (We assume that while a site is being accessed as a first-party, the user implicitly consents to letting it read information about them. This is a gross simplification, since first parties can read lots of information that users don’t consent to by invisible fingerprinting. Also, I am recklessly using the term “entropy” here in a way that would probably cause undergrad thermodynamics professors to have aneurysms.)
  4. The browser definition of “third-party” only roughly approximates the real-life definition. For instance, dropbox.com and dropboxusercontent.com are the same party from a business and privacy perspective but not from the cookie-scoping or DNS or same-origin-policy perspective.
  5. The hardest-to-block tracking domains are the ones who cause collateral damage when blocked. A good example of this is Disqus, commonly embedded as a third-party widget on blogs and forums; if we block requests to Disqus (which include cookies for logged-in users), we severely impede the functionality of many websites. So Disqus is too usability-expensive to block, even though they can track your behavior from site to site.
  6. The hardest-to-block tracking methods are the ones that cause collateral damage when disabled. For instance, HSTS and HPKP both store user-specific persistent data that can be abused to probe users’ browsing histories and/or mark users so that you can re-identify them after the first time they visit your site. However, clearing HSTS/HPKP state between browser sessions dilutes their security value, so browsers/extensions are reluctant to do so.
  7. Specifiers and implementers sometimes argue that Feature X, which adds some fingerprinting/tracking surface, is okay because it’s no worse than cookies. I am skeptical of this argument for the following reasons:
    a. Unless explicitly required, there is no guarantee that browsers will treat Feature X the same as cookies in privacy-paranoid edge cases. For instance, if Safari blocks 3rd party cookies by default, will it block 3rd party media stream captures (which will store a unique deviceid) by default too?
    b. Ad blockers and anti-tracking tools like Disconnect, Privacy Badger, and Torbutton were mostly written to block and detect tracking on the basis of cookies, not arbitrary persistent data. It’s arguable that they should be blocking these other things as soon as they are shipped in browsers, but that requires developer time.

That’s all. And here’s some photos I took while walking around Berlin in a jetlagged haze for hours last night:

berlin1

berlin2

Update (7/18/15): Artur Janc of Google pointed out this document by folks at Chromium analyzing various client identification methods, including several I hadn’t thought about.