How to black out text in a PDF document correctly?

Miłosz Cybowski | Michał Błażejewicz

Document redaction has many faces – it may turn out that overwriting of the text in our document, which at first glance looks 100% effective, is in fact a trap that exposes our company data. What to look for when redacting documents?

All the officer patients in the ward were forced to censor letters written by all the enlisted-men patients (...). It was a monotonous job, and Yossarian was disappointed to learn that the lives of enlisted men were only slightly more interesting than the lives of officers. (...) To break the monotony he invented games. (...) One time he blacked out all but the salutation 'Dear Mary' from a letter (...).

The above quote from the novel “Catch 22” by Joseph Heller shows the classical practice of sanitizing the content of documents – although presented in a deliberately mocking context – where an unwanted fragment of the text is simply blacked out with a wide black pen. Today, we still use blackout as one of the methods of data protection, yet we also send documents containing non-disclosure content digitally. It is worth remembering that in the digital version this method, however, is associated with serious threats, and the correct way of using the “digital pen” often needs to be learned. Document blackening serves a variety of purposes. No one would laugh if their financial data, improperly redacted, got into the wrong hands. Administrators and Personal Data Processors are also threatened with severe penalties under the General Personal Data Regulation act in the event of anonymity breach. Properly used blackening is by all means crucial.

Blackening transforms data in a digital document in a way that prevents reading of the original content, for example, the identification of a person or specific information contained in the document, by applying a black stripe where the text appears. Today, most often, the options that are built into the popular software are used but sometimes special tools that promise users professional redaction of data are chosen instead. As many real-life examples show, redacting is not always as easy as it may seem (consider the American report on the death of General Nicola Calipari, for example).

Where do problems with blackening of electronic documents come from?

Problems related to insufficient or incompetent blackening of documents can be divided into two categories. The first is the lack of awareness that many files, in addition to their main content that we see on the screen, also contain properties (or metadata). As a result, even if we remove the most obvious and visible information from the content of the document, there may be a way to reach it based on the data contained in the properties of the file itself. This is particularly important when sharing entire files, not just their content. There are, of course, adequate protection methods. One of the options offered by the FORDATA system is to provide users with possibility to only view the content of documents, without the possibility of downloading them. In this way, even if confidential information has been saved in the file properties, it will not be available to the viewer.

The second category of problems associated with redaction is the inefficient blackening of the text of the document itself. Many tools are not about removing, but covering the relevant parts of the text. In no way does this affect the content itself, which is still there under the applied blackening – as a result, a simple tool for marking text and copying the content to another file is enough to know the hidden information. The same applies to other attempts to hide content, such as changing of the background color of the displayed text to black or changing the font color to white. The content will be invisible by eye, but all you need is a simple marking and the copy / paste command to get to know ineffectively hidden data.

What does blacking out (of text) in PDF documents look like?

And this is not all. We know that in the case of blackening the text by covering the fragments with the black stripe, what we are doing is add another layer to the file. This means that even if we do not have access to the original file (e.g. it will be made available in the VDR system in read-only mode), it may happen that the mechanism for loading such a document may for a short time “display” to the user the content that was meant to be invisible. This may happen because we are dealing with many layers accumulated in one file – the system will load them from the “lowest” one (i.e. from the original content of the document) to ones located on top (i.e.later added elements, such as the black strip itself).

PDF anonymization - how to effectively black out documents?

It should be remembered that for effective blackening of relevant data in documents in a digital version, it is not necessary to just obscure the content, but to delete it. Until we are sure that the content has been removed, we cannot say that redaction can be considered successful.

Adobe Acrobat Pro DC (the paid version of the most popular PDF viewer) has a built-in feature for editing PDF content. With its help, after selecting the appropriate words, fragments of text or entire pages, the program will automatically remove this content from the document. After saving the file and re-opening it, we will no longer be able to reach the deleted content. In this way, the blackened file can be freely shared with third parties. Note, however, that there may still be additional information about the document in the properties of the document and it is worth deleting it too.

If we do not have Adobe Acrobat Pro DC and / or we have a smaller number of documents to redact, we can do it manually by exporting the document, e.g. to a jpg file. We should then open such a document (or in fact a graphic file) in a program that allows basic editing of this type of file (e.g. IrfanView, GIMP, and even Paint). Using the tool to cut out fragments of an image, edit the file appropriately (by selecting and cutting out / overlay fragments that contain sensitive data). The file thus modified is then saved in a graphic format or converted back to the original format. Files prepared in this way are also ready to be securely loaded into the Virtual Data Room system. We write more on how FORDATA VDR can become a company document repository in the article “What is an electronic document repository”.

Proper redaction of electronic documents can create difficulties. Therefore, before undertaking this task, we should first test the available solutions and make sure that they work flawlessly. After all, even the safest channel for exchanging documentation will not fulfill its role if the content of, for example, a PDF file containing confidential personal information protected by GDPR falls victim to a misplaced “black pen”.

