Building a Robust Machine Learning Model for PII Detection

Protect Private Information with this ML Model - Open Source For You

Here’s a simple machine learning model to help detect the personally identifiable information of your customers and maintain its confidentiality. Data is a great resource and we all know that we need to be very careful about ensuring its privacy and confidentiality, especially if we have access to a large amount of customer data as an organisation.

To ensure that we are not letting any PII or personally identifiable information out of the organisation, one popular strategy is to extract and identify this information in each file and mask it either with cryptography or with random characters. Many other strategies can be followed to extract this information. If the given data is always in a specific format, methods like regular expressions can be used. If that is not the case, but we have enough data, we can build a machine learning model. Generative AI models like OpenAI can be used if we don’t have enough data to train, but this is an expensive solution.

Building an ML Model for Detecting PII

Let’s learn how to build an ML model for detecting PII. This is basically information that can be used to identify individuals — it may be their name, phone number, government identity information, address, etc. One strategy is to build and deploy a model for each of these categories. We will follow a better and a simpler approach, and build a generic model for identifying and detecting any kind of PII.

Let’s get started. I am using a Google Colab notebook as the IDE here. You can use your local Jupyter Notebook or Anaconda environment or even a Visual Studio Code – Jupyter Notebook extension for this experiment.

Data Confidentiality: How Can Businesses Protect Their Data?

If you are using Google Colab notebooks like me, open the following link and click on ‘new notebook’ as shown below:

The first step is to import the required libraries. We will use the Pandas and Sklearn libraries to build the machine learning model. You can use the following code for this:

If you don’t have the libraries already installed, use the following command:

SOC 2: Privacy vs. Confidentiality - Auditwerx

I have taken a sample dataset that has a few of these categories, like name, credit card number, phone number, date of birth, etc. You can create this on your own using different formats of the category. Then load the dataset and view it as shown below.

The first step after collecting and building the dataset in the data science life cycle is to extract or identify the features. The first feature we are using here is the length of the string. We can do this using the following code.

Similarly, we can extract a few more features – for example, if the string has an ‘@’ in it, contains spaces, has a ‘/’, is a number, is a decimal, and so on. The following code can be used to do this, and our dataset then looks as shown below.

Now that the features are extracted and built, the next step is to train the model. We are using the Sklearn library for the machine learning models. The following lines are used for assigning the features:

Data Security and Confidentiality - ACS Data Recovery

For this particular use case, after trying all the algorithms, I have concluded that random forests and decision tree algorithms give close to 94% accuracy and hence the best results. The following code is used to train the model using the given dataset and algorithm.

We can save this model as a pickle file or a .pmml file and use it in Python or Java codes, respectively. For that, use the following code:

That’s it! The model file will be saved and will detect if a given piece of text is PII or not. To detect the type of PII, we can build another model in the same way using the code given below.

You have now learnt how to build an ML model to identify PII!! You can try this out with other algorithms and datasets too, and explore further in this field!