Open In App

Exploring Basics of Informatica Data Profiling

Data Profiling is an important step in data quality assurance. By profiling your data, you can identify any data quality issues that need to be addressed. This can help to ensure that your data is accurate, complete, and consistent, which is essential for effective data analysis and decision-making.

It is the process of gathering statistics and data-related information about a dataset. This information can be used to evaluate the quality of the data, identify potential issues, and determine its suitability for different purposes. Reviewing Profile Results helps in identifying deviations in the data and knowing about vulnerabilities in the data.



Why Data Profiling?

Profiling can be used for a variety of purposes, including:

Challenges in Database Metadata and Documentation

While not all database metadata and documentation provide precise representations of the data source, it’s noteworthy that documentation is a rare find.



Creating data profile in Informatica

You should have appropriate IICS license for accessing Informatica Data Profiling or we can use the free version of informatica using Oracle e-delivery with the license key provided at the oracle license code website.

  1. Select Data Profiling in My Services Page of IICS (Informatica Intelligent Cloud Services).
  2. Click on New.
  3. from the dialog, select Data Profiling Task.
  4. In the profile definition tab, enter the profile name, description, and location to save in the Asset Details Section.
  5. Select the Connection and Source Object information in the Source Details section.
  6. Tune the profile settings to run the profile on all or on limited number of rows. Choose enable or disable drilldown.
  7. You can select the columns for which Profiling is required.
  8. Click Run on the top Right section of IICS.

Informatica Data Profile Definition

Utilize the “Monitor” feature to check the outcomes of a profile or job run. Users can opt to view results from various runs by selecting “Choose Profile Run” from the menu.

Note : The Rules Tab is used to add Data Quality Assets to Profiling. Additionally you can configure scheduling, runtime settings, and advanced profiling options within the “Schedule” tab. Please go through Informatica Documentations for detailed explanations of these tabs as these are not explained in this article.

Profile Results — Summary View

One of the Profile Results types is Summary View which offers a consolidated perspective of profiling outcomes, encompassing patterns and data types.

Summary View of Data Profile Result

Below are the Columns that can be found in Profile Results:

Columns, Value Distribution, % Null, # Null, % Distinct, # Distinct, % Non-Distinct, # Non-Distinct, # Patterns, % Top Patterns, Minimum Length, Maximum Length, Minimum Value, Maximum Value, # Blank and % Blank.

Profile Results — Detailed View

The second and the final option to view the profile result is the Detailed View which is a in-depth explanation of a specific column which lists down the unique values in a Column with frequencies , percentage of usage and length. It also Includes Value frequency graph, datatypes , patterns, five most frequent values and more.

Snippet of a section in Detailed View of Results

Python Clone of Data Profiling for Summary Results

We will try to achieve almost similar results using the program but there are limitations. This data profiling tool in python can be used in small scale and the input format is excel, csv.

we will use Pandas to import files and create dataframe out of the same. Additionally it is used to get Minimum, Maximum, Average and to create the profile results. Regex is used to identify patterns in a column. This algorithm can identify URL, Credit card number, IP Address , email, Aadhar card Number and more. Additionally, we will use math module to do complex mathematical operations.

It is a simple code and can be used for small scale Analysis of metadata. First, we will import the dataset and check what kind of datatype is it, so we know what operations we can do with it based on the data types. Then we will find Number of Null values in a specific column, Null percentage, Number of Unique Values, Percentage of Unique Value, Number of Duplicate , Percentage Duplicate, Number of Patterns, Minimum value, Maximum value and average for Number data types, Minimum length , maximum length and Average length for String or Object types and the last column which tells the datatype. Finally, we will add all this information to a new dataframe to create a profile of the dataset.

Below is the Regex assignments that I used to identify the patterns in a column:




pattern_url = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
pattern_date = r'\d{4}-\d{2}-\d{2}|[0-3]?[0-9]/[0-3]?[0-9]/\d{4}'
pattern_credit_card = r'\b(?:\d{4}-?){3}\d{4}\b'
pattern_ip = r'\b(?:\d{1,3}\.){3}\d{1,3}\b|\b(?:[A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}\b'
pattern_email = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
pattern_phone = r'(?:(?:\+|0{0,2})(91-?)?)[7-9]\d{9}'
pattern_aadhar = r'\b\d{4}\s\d{4}\s\d{4}\b'
pattern_ssn = r'\b\d{3}-\d{2}-\d{4}\b'
  
pattern_list = [pattern_url , pattern_date, pattern_credit_card, pattern_ip , pattern_email , pattern_phone , pattern_aadhar , pattern_ssn]

Below is the code to the function to get all the column properties:




def data_profile(i, col_name, pattern_list):
  
    # Total Number of Records
    Total_Records = len(data[data.columns[i]])
  
    # 2  Number of Null values in the column
    Number_Null = data[data.columns[i]].isnull().sum()
  
    # 3  Percentage Null
    Percent_Null = Number_Null / Total_Records * 100
  
    # 4 Number of Unique Values
    Num_Unique = data[data.columns[i]].nunique()
  
    # 5 Percentage Unique
    Percent_Unique = Num_Unique / Total_Records * 100
  
    # 6 Number of Duplicate Values
    Non_Distinct = Total_Records - Num_Unique
  
    # 7 Percentage of Duplicate Values
    Percent_non_distinct = Non_Distinct / Total_Records * 100
  
    # 8, 9, 10 - Minimum, Maximum and Average of a Integer or Float type
    if data[data.columns[i]].dtypes in ["int64", "int32", "int16", "float64", "float32", "float16"]:
        minimum = min(data[data.columns[i]])
        maximum = max(data[data.columns[i]])
        average = data[data.columns[i]].mean()
    else:
        minimum = "NA"
        maximum = "NA"
        average = "NA"
  
    # 11 Number of Patterns
    patterns = []
    if data[data.columns[i]].dtype == 'O':
        for value in data[data.columns[i]]:
            k = ""
            for n in range(0, len(pattern_list)):
                k = re.search(pattern_list[n], value)
                patterns.append(k)
                if k == re.Match:
                    break
        patterns_1 = list(filter(lambda x: x is not None, patterns))
        Num_Pattern = len(patterns_1)
    else:
        for value in data[data.columns[i]].astype(str):
            k = ""
            for n in range(0, len(pattern_list)):
                k = re.search(pattern_list[n], value)
                patterns.append(k)
                if k == re.Match:
                    break
        patterns_1 = list(filter(lambda x: x is not None, patterns))
        Num_Pattern = len(patterns_1)
  
    # 12 Minimum Length of the string or number
    if data[data.columns[i]].dtype == 'O':
        min_word = min(data[data.columns[i]], key=len)
        min_len = len(min_word)
    else:
        min_len = "NA"
  
    # 13 Maximum Length of the string or number
    if data[data.columns[i]].dtype == 'O':
        max_word = max(data[data.columns[i]], key=len)
        max_len = len(max_word)
    else:
        max_len = "NA"
  
    # 14 Average Length of the string
    if data[data.columns[i]].dtype == 'O':
        avg_word = data[data.columns[i]].apply(len).mean()
        avg_word = math.ceil(avg_word)
    else:
        avg_word = "NA"
  
    # 15 Datatype of the Column
    dtype = data[data.columns[i]].dtype
    if dtype == 'O':
        data_type = "String"
    elif dtype == 'int64':
        data_type = "int64"
    elif dtype == 'int32':
        data_type = "int32"
    elif dtype == 'int16':
        data_type = "int16"
    elif dtype == 'float64':
        data_type = "float64"
    elif dtype == 'float32':
        data_type = "float32"
    elif dtype == 'float16':
        data_type = "float16"
    elif dtype == 'bool':
        data_type = "boolean"
    else:
        data_type = dtype
  
    return_list = [col_name, Number_Null, Percent_Null, Num_Unique, Percent_Unique, Non_Distinct,
                   Percent_non_distinct, Num_Pattern, minimum, maximum, average,  min_len, max_len, avg_word, data_type]
  
    return return_list

Below is the snapshot of the result :

Profile Result for a simple dataset with around 2000 records

Informatica stands out as a top choice in the market for Data Profiling and Data Quality, making it a prominent tool for organizations, whether they are focused on organizational needs or analyzing extensive datasets.


Article Tags :