A 1-Million-Site Measurement and Analysis of Online Tracking webtransparency.cs.princeton.edu

Steven Englehardt and Arvind Narayanan of Princeton University measured the third-party tracking scripts on the top million websites as ranked by Alexa. Some findings aren’t surprising — of the top twenty third-party domains, for example, twelve are owned by Google.

But there are some fairly new styles of tracking out there. For example:

Firefox’s third-party cookie blocking is very effective, only 237 sites (0.4%) have any third-party cookies set from a domain other than the landing page of the site. Most of these are for benign reasons, such as redirecting to the U.S. version of a non-U.S. site. We did find a handful of exceptions, including 32 that contained ID cookies. These sites appeared to be deliberately redirecting the landing page to a separate domain before redirecting back to the initial domain.

I’ve previously discussed how Criteo and AdRoll engage in this behaviour.

The HTML Canvas allows web application to draw graphics in real time, with functions to support drawing shapes, arcs, and text to a custom canvas element. Differences in font rendering, smoothing, anti-aliasing, as well as other device features cause devices to draw the image differently. This allows the resulting pixels to be used a part of a device fingerprint. […]

We found canvas fingerprinting on 14,371 sites, caused by scripts loaded from about 400 different domains.

That’s nearly 1.5% of the top million websites, from about 0.5% of all third-party trackers in the study.

Steven Englehardt followed up on Princeton’s Freedom to Tinker blog with one particularly new way a small number of websites are tracking visitors:

[…] One of our more surprising findings was the discovery of two apparent attempts to use the HTML5 Audio API for fingerprinting.

The figure is a visualization of the audio processing executed on users’ browsers by third-party fingerprinting scripts. We found two different AudioNode configurations in use. In both configurations an audio signal is generated by an oscillator and the resulting signal is hashed to create an identifier. Initial testing shows that the techniques may have some limitations when used for fingerprinting, but further analysis is necessary.

Expedia, Hotels.com, and Travelocity are all prepared to use audio fingerprinting, but have not actively implemented it.

It feels like those of us who value a modicum of privacy online are losing a battle against advertising and marketing technology companies. Users are overwhelmingly distrusting of the handling of their personal information by Google and Facebook; imagine how they’d react when they find out that a bunch of smaller companies they’ve never heard of are also collecting vast amounts of data.

These smaller companies are held to a different set of standards than a giant like Google because almost nobody knows they exist. What websites they’re on, what information they collect, and how that information is used often remains a complete mystery. These companies will tell critics that users can always opt-out, but it’s hard to opt out of something when its existence isn’t disclosed.

We need a stronger set of rules regarding the collection and use of personal information. Automatic opt-in should not be the default, and the ability to know what information is collected and how it’s being used ought to be significantly easier.