Variable suspension is the removal of direct identifiable from a data collection. Dynamic Data Masking is a set of techniques that attempt to protect direct identifications. It is also known as common and defendable approaches. When data sets are required to be disclosed for research purposes in the public health area, suppression is used. These situations make it unnecessary to identify variables within a particular data set.
Shuffling refers to a technique that extracts one value and replaces it by another from a different record. This results in real values being present in the data set but assigned to different people.
Methods for Creating Pseudonyms
There are two possible options for creating pseudonyms. Both methods must use unique patient values, such as SSNs or medical records numbers. The first method involves applying a one-way hash to a value using a secret key, which must then be protected. A hash function converts many values to and from its original value. This method has the advantage that it can be applied later and recreated for another data set. The second method uses a random pseudonym, which is locked and cannot be reproduced in the future. Both approaches can be used in different situations.
Randomization reduces the number of identifiers within the data set. However, the values are replaced by rake or random numbers. The possibility of changing the masked values is very unlikely once the procedure has been completed correctly. A common use for randomization is to create data sets for testing software. Data is pulled from production databases and masked before being sent to the development team for testing. The data must follow a predetermined data format. Fields are kept and should have realistic values.
Usage of Masking Tools Random Protection
Certain companies use techniques to mask tools that do not provide meaningful protection, such as: Continuous variables can be affected by noise addition. There are many methods to remove noise from data. This is a problem. Filters can be used by an adversary to extract noise from data and return the original values. There are many filter types available for signal processing domain.
Character Scrambling uses masking tools to rearrange characters’ orders in a field, such as NURSE being scrambled into RSUNE. It is simple to reverse this.
Truncation refers to a variant of character masking where the last few characters are removed, and then replaced. This can pose the same risks as character-masking. The removal of the last characters of a surname could result in 67% less or more unique names remaining on characters.
Encoding is the act of replacing one value with another that is meaningless. This requires care. It is also possible to do a frequency analysis, which shows how frequently the names appear. The most common name in a multiracial data collection is SMITH. The problem of encoding should be solved by creating pseudonyms based on unique values and not a general masking feature.
Non-Protective Masking Techniques
These shouldn’t be used even in practice. A data custodian who does this poses a risk to the security of their data.
Protective masking techniques can significantly reduce data utility. Masking should not be used on fields not intended for data analysis. These are the direct identifiers that are usually limited to email addresses and names that are not relevant to any data analysis. Masking techniques shouldn’t be used to hide dates and geographic information, as these data are frequently used in analysis. Effective analysis would be difficult to achieve if masking is used.
Dynamic Data Masking is also based on different characteristics of field types and variables. When it comes to birth-dates, different algorithms are used than zip codes. There are many sets that include both direct and quasi-identifiers. It is recommended to use both de-identification, masking, and treatment in data sensitive data protection methods.