Exploring Basics of Informatica Data Profiling
Last Updated :
04 Jan, 2024
Data Profiling is an important step in data quality assurance. By profiling your data, you can identify any data quality issues that need to be addressed. This can help to ensure that your data is accurate, complete, and consistent, which is essential for effective data analysis and decision-making.
It is the process of gathering statistics and data-related information about a dataset. This information can be used to evaluate the quality of the data, identify potential issues, and determine its suitability for different purposes. Reviewing Profile Results helps in identifying deviations in the data and knowing about vulnerabilities in the data.
Why Data Profiling?
Profiling can be used for a variety of purposes, including:
- Data quality assessment: Profiling can be used to identify data quality issues, such as missing values, invalid data types, and duplicate records.
- Data exploration: Profiling can be used to gain insight into the data, such as its distribution, patterns, and relationships.
- Data Integration: It helps in identifying data conflicts and inconsistencies across datasets. This is important for creating a unified and accurate view of the data.
- knowing: It is an essential tool for detecting anomalies and outliers in data. These anomalies can indicate errors, fraud, or unusual behavior, making them valuable for risk assessment and security purposes.
Challenges in Database Metadata and Documentation
While not all database metadata and documentation provide precise representations of the data source, it’s noteworthy that documentation is a rare find.
- Column Profiling: The descriptions documented for data elements may exhibit differences in how these elements are actually utilized. Additionally, there can be disparities between the column names and the data they actually contain. In contrast, Column Profiling ascertains a column’s characteristics by examining the data it encompasses.
- Outdated Documentation: As databases evolve over time, documentation often lags behind. Changes to data structure, column names, or data types may not be adequately reflected in the documentation, leading to confusion
- Complex Data Models: Database with complex data models, including multiple relationships, hierarchies, and inheritance structures, can be challenging to document. Ensuring that relationships are correctly documented is crucial for data understanding
- Documentation Ownership: Determining who is responsible for creating and maintaining database documentation can be unclear. Lack of ownership can lead to gaps in documentation and inconsistencies. Challenges in Database Metadata and Documentation
Creating data profile in Informatica
You should have appropriate IICS license for accessing Informatica Data Profiling or we can use the free version of informatica using Oracle e-delivery with the license key provided at the oracle license code website.
- Select Data Profiling in My Services Page of IICS (Informatica Intelligent Cloud Services).
- Click on New.
- from the dialog, select Data Profiling Task.
- In the profile definition tab, enter the profile name, description, and location to save in the Asset Details Section.
- Select the Connection and Source Object information in the Source Details section.
- Tune the profile settings to run the profile on all or on limited number of rows. Choose enable or disable drilldown.
- You can select the columns for which Profiling is required.
- Click Run on the top Right section of IICS.
Informatica Data Profile Definition
Utilize the “Monitor” feature to check the outcomes of a profile or job run. Users can opt to view results from various runs by selecting “Choose Profile Run” from the menu.
Note : The Rules Tab is used to add Data Quality Assets to Profiling. Additionally you can configure scheduling, runtime settings, and advanced profiling options within the “Schedule” tab. Please go through Informatica Documentations for detailed explanations of these tabs as these are not explained in this article.
Profile Results — Summary View
One of the Profile Results types is Summary View which offers a consolidated perspective of profiling outcomes, encompassing patterns and data types.
Summary View of Data Profile Result
Below are the Columns that can be found in Profile Results:
Columns, Value Distribution, % Null, # Null, % Distinct, # Distinct, % Non-Distinct, # Non-Distinct, # Patterns, % Top Patterns, Minimum Length, Maximum Length, Minimum Value, Maximum Value, # Blank and % Blank.
Profile Results — Detailed View
The second and the final option to view the profile result is the Detailed View which is a in-depth explanation of a specific column which lists down the unique values in a Column with frequencies , percentage of usage and length. It also Includes Value frequency graph, datatypes , patterns, five most frequent values and more.
Snippet of a section in Detailed View of Results
—
Python Clone of Data Profiling for Summary Results
We will try to achieve almost similar results using the program but there are limitations. This data profiling tool in python can be used in small scale and the input format is excel, csv.
we will use Pandas to import files and create dataframe out of the same. Additionally it is used to get Minimum, Maximum, Average and to create the profile results. Regex is used to identify patterns in a column. This algorithm can identify URL, Credit card number, IP Address , email, Aadhar card Number and more. Additionally, we will use math module to do complex mathematical operations.
It is a simple code and can be used for small scale Analysis of metadata. First, we will import the dataset and check what kind of datatype is it, so we know what operations we can do with it based on the data types. Then we will find Number of Null values in a specific column, Null percentage, Number of Unique Values, Percentage of Unique Value, Number of Duplicate , Percentage Duplicate, Number of Patterns, Minimum value, Maximum value and average for Number data types, Minimum length , maximum length and Average length for String or Object types and the last column which tells the datatype. Finally, we will add all this information to a new dataframe to create a profile of the dataset.
Below is the Regex assignments that I used to identify the patterns in a column:
Python3
pattern_url = r 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
pattern_date = r '\d{4}-\d{2}-\d{2}|[0-3]?[0-9]/[0-3]?[0-9]/\d{4}'
pattern_credit_card = r '\b(?:\d{4}-?){3}\d{4}\b'
pattern_ip = r '\b(?:\d{1,3}\.){3}\d{1,3}\b|\b(?:[A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}\b'
pattern_email = r '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
pattern_phone = r '(?:(?:\+|0{0,2})(91-?)?)[7-9]\d{9}'
pattern_aadhar = r '\b\d{4}\s\d{4}\s\d{4}\b'
pattern_ssn = r '\b\d{3}-\d{2}-\d{4}\b'
pattern_list = [pattern_url , pattern_date, pattern_credit_card, pattern_ip , pattern_email , pattern_phone , pattern_aadhar , pattern_ssn]
|
Below is the code to the function to get all the column properties:
Python3
def data_profile(i, col_name, pattern_list):
Total_Records = len (data[data.columns[i]])
Number_Null = data[data.columns[i]].isnull(). sum ()
Percent_Null = Number_Null / Total_Records * 100
Num_Unique = data[data.columns[i]].nunique()
Percent_Unique = Num_Unique / Total_Records * 100
Non_Distinct = Total_Records - Num_Unique
Percent_non_distinct = Non_Distinct / Total_Records * 100
if data[data.columns[i]].dtypes in [ "int64" , "int32" , "int16" , "float64" , "float32" , "float16" ]:
minimum = min (data[data.columns[i]])
maximum = max (data[data.columns[i]])
average = data[data.columns[i]].mean()
else :
minimum = "NA"
maximum = "NA"
average = "NA"
patterns = []
if data[data.columns[i]].dtype = = 'O' :
for value in data[data.columns[i]]:
k = ""
for n in range ( 0 , len (pattern_list)):
k = re.search(pattern_list[n], value)
patterns.append(k)
if k = = re.Match:
break
patterns_1 = list ( filter ( lambda x: x is not None , patterns))
Num_Pattern = len (patterns_1)
else :
for value in data[data.columns[i]].astype( str ):
k = ""
for n in range ( 0 , len (pattern_list)):
k = re.search(pattern_list[n], value)
patterns.append(k)
if k = = re.Match:
break
patterns_1 = list ( filter ( lambda x: x is not None , patterns))
Num_Pattern = len (patterns_1)
if data[data.columns[i]].dtype = = 'O' :
min_word = min (data[data.columns[i]], key = len )
min_len = len (min_word)
else :
min_len = "NA"
if data[data.columns[i]].dtype = = 'O' :
max_word = max (data[data.columns[i]], key = len )
max_len = len (max_word)
else :
max_len = "NA"
if data[data.columns[i]].dtype = = 'O' :
avg_word = data[data.columns[i]]. apply ( len ).mean()
avg_word = math.ceil(avg_word)
else :
avg_word = "NA"
dtype = data[data.columns[i]].dtype
if dtype = = 'O' :
data_type = "String"
elif dtype = = 'int64' :
data_type = "int64"
elif dtype = = 'int32' :
data_type = "int32"
elif dtype = = 'int16' :
data_type = "int16"
elif dtype = = 'float64' :
data_type = "float64"
elif dtype = = 'float32' :
data_type = "float32"
elif dtype = = 'float16' :
data_type = "float16"
elif dtype = = 'bool' :
data_type = "boolean"
else :
data_type = dtype
return_list = [col_name, Number_Null, Percent_Null, Num_Unique, Percent_Unique, Non_Distinct,
Percent_non_distinct, Num_Pattern, minimum, maximum, average, min_len, max_len, avg_word, data_type]
return return_list
|
Below is the snapshot of the result :
Profile Result for a simple dataset with around 2000 records
—
Informatica stands out as a top choice in the market for Data Profiling and Data Quality, making it a prominent tool for organizations, whether they are focused on organizational needs or analyzing extensive datasets.
Share your thoughts in the comments
Please Login to comment...