Most of the small European companies store their operational data using different databases. Those databases can start from a simple Excel file in a physical shop that only needs the contacts of its providers, but they can also be complex MySQL structures, a Relational Database Management server which manages many databases at the same time. They can be even more intricate when the companies rely on open source software. As an example, the popular e-commerce solution “PrestaShop” creates a database with more than 200 different tables after the installation, and the schema created by the “Moodle” (a well-known platform for e-learning that allows the managing of groups of users) is far from simple.
Moreover, small companies usually need to focus on their core business and cannot expend a great effort on understanding the data that is stored in their system. This is even more worrisome in the case of online companies that cannot control manually the data the users introduce into their systems.
Automatic detection
Thus, the identification of personal data, and specifically of the special categories of personal data protected by the GDPR may become a very complex task even for experts that would need to navigate among hundreds of different tables and formats in order to find that kind of data. Thus, the automatization of this process may help the companies to understand the data they are storing.
However, the automatic identification of personal data into heterogeneous data silos is a challenging task. Among the personal data we may find, very different information that ranges from a name or email address to the religious beliefs of people.
Several families of algorithms may be used in order to identify that kind of data. In particular:
Regular expressions
Regular expressions are patterns used to find a specific combination of characters into text. For example, we know telephone numbers in Spain usually starts with 6, 8 or 9 and are 9 digits long. Also, an email address usually has the shape UserName@domain.com, so we can create regular expressions to identify them. In fact, the regular expression used to find an email address is:
(?:[a-z0-9!#$%&’*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*|”(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*”)@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Check https://regexr.com/ play with different regular expressions/codes.
Dictionaries
Other kinds of data cannot be identified using regular expressions. For example, the names of people do not follow any well-known pattern. However, most people in each country share a small number of names and surnames. Then, a dictionary could be created containing the most common names and surnames in a region. That way, if most of the elements in a row of the database are included in the dictionary, it is possible to infer that row includes real names of users.
Natural Language Processing
Finally, there are other types of protected information that are more difficult to identify due to its complexity. For instance, the sexual preferences or the religious beliefs of the users (that may be of interest for certain businesses) can be expressed in completely different ways.
Natural Language Processing (NLP) is a branch of AI that is focused on the understanding of texts. Among them, we find completely different algorithms that are able to predict the most common words for a specific topic, or even to classify different text pieces. All of them can be used in order to identify those kinds of personal data.
In conclusion, we need automatic systems able to perform the complex task of identifying personal data in datasets if we want to ensure full compliance with the GDPR. Luckily, even when it is complex, state of the art technology can help us in our duty.
Written by: Roberto Gonzalez