Kashmir Hill summarizes for Fusion a study by Steven Hill, et. al. (PDF):
“In many online communities, it is the norm to redact names and other sensitive text from posted screen shots,” write the researchers, specifically citing Reddit. “Mosaicing and blurring have also been used for the redaction of high-profile government documents and celebrity social media.”
They should probably stop doing that. The UC-San Diego researchers found that they could use statistical models—”so-called hidden Markov models”—to generate the blurring or pixelation of lots of numbers, letters, and words, to the point that their software program could match a known redaction to an unknown redaction to figure out what it says. The biggest challenge is figuring out the font and size of the underlying text which the researchers need for their deciphering. They say it works better than a brute-force technique for deciphering pixelated images discussed by Dheera Venkatraman in 2007.
There’s a great reason why intelligence agencies redact documents by placing an oversized black bar on top of the text in question, then printing and scanning the document to make it unrecoverable. The latter steps were not performed by the New York Times in 2014, and it lead to the unintentional exposure of sensitive information from a Snowden-leaked NSA document.