27 . 02 . 2020

SECURITY How to black out text in a PDF document correctly?

27 . 02 . 2020

Document redaction has many faces – it may turn out that overwriting of the text in our document, which at first glance looks 100% effective, is in fact a trap that exposes our company data. What to look for when redacting documents?

“All the officer patients in the ward were forced to censor letters written by all the enlisted-men patients (…). It was a monotonous job, and Yossarian was disappointed to learn that the lives of enlisted men were only slightly more interesting than the lives of officers. (…) To break the monotony he invented games. (…) One time he blacked out all but the salutation ‘Dear Mary’ from a letter (…).”

The above quote from the novel “Catch 22” by Joseph Heller shows the classical practice of sanitizing the content of documents – although presented in a deliberately mocking context – where an unwanted fragment of the text is simply blacked out with a wide black pen. Today, we still use blackout as one of the methods of data protection, yet we also send documents containing non-disclosure content digitally. It is worth remembering that in the digital version this method, however, is associated with serious threats, and the correct way of using the “digital pen” often needs to be learned. Document blackening serves a variety of purposes. No one would laugh if their financial data, improperly redacted, got into the wrong hands. Administrators and Personal Data Processors are also threatened with severe penalties under the General Personal Data Regulation act in the event of anonymity breach. Properly used blackening is by all means crucial.

Blackening transforms data in a digital document in a way that prevents reading of the original content, for example, the identification of a person or specific information contained in the document, by applying a black stripe where the text appears. Today, most often, the options that are built into the popular software are used but sometimes special tools that promise users professional redaction of data are chosen instead. As many real-life examples show, redacting is not always as easy as it may seem (consider the American report on the death of General Nicola Calipari, for example).

Where do problems with blackening of electronic documents come from?

Problems related to insufficient or incompetent blackening of documents can be divided into two categories. The first is the lack of awareness that many files, in addition to their main content that we see on the screen, also contain properties (or metadata). As a result, even if we remove the most obvious and visible information from the content of the document, there may be a way to reach it based on the data contained in the properties of the file itself. This is particularly important when sharing entire files, not just their content. There are, of course, adequate protection methods. One of the options offered by the FORDATA system is to provide users with possibility to only view the content of documents, without the possibility of downloading them. In this way, even if confidential information has been saved in the file properties, it will not be available to the viewer.

The second category of problems associated with redaction is the inefficient blackening of the text of the document itself. Many tools are not about removing, but covering the relevant parts of the text. In no way does this affect the content itself, which is still there under the applied blackening – as a result, a simple tool for marking text and copying the content to another file is enough to know the hidden information. The same applies to other attempts to hide content, such as changing of the background color of the displayed text to black or changing the font color to white. The content will be invisible by eye, but all you need is a simple marking and the copy / paste command to get to know ineffectively hidden data.

What does blacking out (of text) in PDF documents look like?

And this is not all. We know that in the case of blackening the text by covering the fragments with the black stripe, what we are doing is add another layer to the file. This means that even if we do not have access to the original file (e.g. it will be made available in the VDR system in read-only mode), it may happen that the mechanism for loading such a document may for a short time “display” to the user the content that was meant to be invisible. This may happen because we are dealing with many layers accumulated in one file – the system will load them from the “lowest” one (i.e. from the original content of the document) to ones located on top (i.e.later added elements, such as the black strip itself).

PDF anonymization - how to effectively black out documents?

It’s important to remember that effective redaction of specific data in digital documents requires not just concealing the content but removing it entirely. Until we are certain that the content has been removed, we cannot consider the anonymization successful. Censoring PDFs can serve as an example.

Adobe Acrobat Pro DC (the paid version of the most popular PDF document viewer) includes a built-in feature for editing the content of PDF documents. Using this feature, by selecting the appropriate words, text segments, or entire pages, the program will automatically remove that content from the document. After saving the file and reopening it, there will be no way to access the removed content. This way, redacted files can be shared with third parties without any issues. However, it’s essential to remember that additional information about the document can still be present in its properties, and it’s advisable to remove them. Anonymizing PDF documents while preserving the original format is a task better suited for more advanced users.

Not all company policies allow the installation of additional software. Proper data redaction can undoubtedly pose challenges. Therefore, it’s worth relying on tested solutions that maintain high security standards. An example of such a solution is the Redaction Tool embedded in Fordata’s VDR, which enables the anonymization of .pdf files directly within the VDR, using a range of automation features. Thanks to it, VDR users can retain full control over their documents while eliminating the potential risk of confidential information leaks. Detailed information about the capabilities of the Redaction Tool can be found in the article “Welcome to Our In-Built Redaction Tool?”.

Taking it a step further, it’s worth considering a solution supported by artificial intelligence mechanisms – the AI-Powered Redaction Tool, which reduces time and increases precision in redacting sensitive information. The AI-Powered Redaction Tool embedded in Fordata VDR is designed to automatically detect and redact up to 19 different types of information, including all personal data compliant with GDPR (PII), financial and governmental data in global formats (PHI), as well as words in various grammatical forms in almost 80 languages. As a result, the anonymization process becomes more effective, contributing to maintaining privacy and data security on a wide scale.

The article has been updated on 26.02.2024.

Did you like the article?

Share page opens in new window

How many heads, so many ideas. That's why each of us contributes to making the content on our blog attractive and valuable for you. Discover a source of knowledge and inspiration for your business with Fordata.

Do you want to exchange knowledge or ask a question?

Write to me : #FORDATAteam page opens in new window

The safest place for your data. FORDATA Virtual Data Room - use for free for 14 days!

You may also read