Data Profiling is an important step in data quality assurance. By profiling your data, you can identify any data quality issues that need to be addressed. This can help to ensure that your data is accurate, complete, and consistent, which is essential for effective data analysis and decision-making.
It is the process of gathering statistics and data-related information about a dataset. This information can be used to evaluate the quality of the data, identify potential issues, and determine its suitability for different purposes. Reviewing Profile Results helps in identifying deviations in the data and knowing about vulnerabilities in the data.
Why Data Profiling?
Profiling can be used for a variety of purposes, including:
- Data quality assessment: Profiling can be used to identify data quality issues, such as missing values, invalid data types, and duplicate records.
- Data exploration: Profiling can be used to gain insight into the data, such as its distribution, patterns, and relationships.
- Data Integration: It helps in identifying data conflicts and inconsistencies across datasets. This is important for creating a unified and accurate view of the data.
- knowing: It is an essential tool for detecting anomalies and outliers in data. These anomalies can indicate errors, fraud, or unusual behavior, making them valuable for risk assessment and security purposes.
Challenges in Database Metadata and Documentation
While not all database metadata and documentation provide precise representations of the data source, it’s noteworthy that documentation is a rare find.
- Column Profiling: The descriptions documented for data elements may exhibit differences in how these elements are actually utilized. Additionally, there can be disparities between the column names and the data they actually contain. In contrast, Column Profiling ascertains a column’s characteristics by examining the data it encompasses.
- Outdated Documentation: As databases evolve over time, documentation often lags behind. Changes to data structure, column names, or data types may not be adequately reflected in the documentation, leading to confusion
- Complex Data Models: Database with complex data models, including multiple relationships, hierarchies, and inheritance structures, can be challenging to document. Ensuring that relationships are correctly documented is crucial for data understanding
- Documentation Ownership: Determining who is responsible for creating and maintaining database documentation can be unclear. Lack of ownership can lead to gaps in documentation and inconsistencies. Challenges in Database Metadata and Documentation
Creating data profile in Informatica
You should have appropriate IICS license for accessing Informatica Data Profiling or we can use the free version of informatica using Oracle e-delivery with the license key provided at the oracle license code website.
- Select Data Profiling in My Services Page of IICS (Informatica Intelligent Cloud Services).
- Click on New.
- from the dialog, select Data Profiling Task.
- In the profile definition tab, enter the profile name, description, and location to save in the Asset Details Section.
- Select the Connection and Source Object information in the Source Details section.
- Tune the profile settings to run the profile on all or on limited number of rows. Choose enable or disable drilldown.
- You can select the columns for which Profiling is required.
- Click Run on the top Right section of IICS.
Utilize the “Monitor” feature to check the outcomes of a profile or job run. Users can opt to view results from various runs by selecting “Choose Profile Run” from the menu.
Note : The Rules Tab is used to add Data Quality Assets to Profiling. Additionally you can configure scheduling, runtime settings, and advanced profiling options within the “Schedule” tab. Please go through Informatica Documentations for detailed explanations of these tabs as these are not explained in this article.
Profile Results — Summary View
One of the Profile Results types is Summary View which offers a consolidated perspective of profiling outcomes, encompassing patterns and data types.
Below are the Columns that can be found in Profile Results:
Columns, Value Distribution, % Null, # Null, % Distinct, # Distinct, % Non-Distinct, # Non-Distinct, # Patterns, % Top Patterns, Minimum Length, Maximum Length, Minimum Value, Maximum Value, # Blank and % Blank.
Profile Results — Detailed View
The second and the final option to view the profile result is the Detailed View which is a in-depth explanation of a specific column which lists down the unique values in a Column with frequencies , percentage of usage and length. It also Includes Value frequency graph, datatypes , patterns, five most frequent values and more.
—
Python Clone of Data Profiling for Summary Results
We will try to achieve almost similar results using the program but there are limitations. This data profiling tool in python can be used in small scale and the input format is excel, csv.
we will use Pandas to import files and create dataframe out of the same. Additionally it is used to get Minimum, Maximum, Average and to create the profile results. Regex is used to identify patterns in a column. This algorithm can identify URL, Credit card number, IP Address , email, Aadhar card Number and more. Additionally, we will use math module to do complex mathematical operations.
It is a simple code and can be used for small scale Analysis of metadata. First, we will import the dataset and check what kind of datatype is it, so we know what operations we can do with it based on the data types. Then we will find Number of Null values in a specific column, Null percentage, Number of Unique Values, Percentage of Unique Value, Number of Duplicate , Percentage Duplicate, Number of Patterns, Minimum value, Maximum value and average for Number data types, Minimum length , maximum length and Average length for String or Object types and the last column which tells the datatype. Finally, we will add all this information to a new dataframe to create a profile of the dataset.
Below is the Regex assignments that I used to identify the patterns in a column:
pattern_url = r 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
pattern_date = r '\d{4}-\d{2}-\d{2}|[0-3]?[0-9]/[0-3]?[0-9]/\d{4}'
pattern_credit_card = r '\b(?:\d{4}-?){3}\d{4}\b'
pattern_ip = r '\b(?:\d{1,3}\.){3}\d{1,3}\b|\b(?:[A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}\b'
pattern_email = r '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
pattern_phone = r '(?:(?:\+|0{0,2})(91-?)?)[7-9]\d{9}'
pattern_aadhar = r '\b\d{4}\s\d{4}\s\d{4}\b'
pattern_ssn = r '\b\d{3}-\d{2}-\d{4}\b'
pattern_list = [pattern_url , pattern_date, pattern_credit_card, pattern_ip , pattern_email , pattern_phone , pattern_aadhar , pattern_ssn]
|
Below is the code to the function to get all the column properties:
def data_profile(i, col_name, pattern_list):
# Total Number of Records
Total_Records = len (data[data.columns[i]])
# 2 Number of Null values in the column
Number_Null = data[data.columns[i]].isnull(). sum ()
# 3 Percentage Null
Percent_Null = Number_Null / Total_Records * 100
# 4 Number of Unique Values
Num_Unique = data[data.columns[i]].nunique()
# 5 Percentage Unique
Percent_Unique = Num_Unique / Total_Records * 100
# 6 Number of Duplicate Values
Non_Distinct = Total_Records - Num_Unique
# 7 Percentage of Duplicate Values
Percent_non_distinct = Non_Distinct / Total_Records * 100
# 8, 9, 10 - Minimum, Maximum and Average of a Integer or Float type
if data[data.columns[i]].dtypes in [ "int64" , "int32" , "int16" , "float64" , "float32" , "float16" ]:
minimum = min (data[data.columns[i]])
maximum = max (data[data.columns[i]])
average = data[data.columns[i]].mean()
else :
minimum = "NA"
maximum = "NA"
average = "NA"
# 11 Number of Patterns
patterns = []
if data[data.columns[i]].dtype = = 'O' :
for value in data[data.columns[i]]:
k = ""
for n in range ( 0 , len (pattern_list)):
k = re.search(pattern_list[n], value)
patterns.append(k)
if k = = re.Match:
break
patterns_1 = list ( filter ( lambda x: x is not None , patterns))
Num_Pattern = len (patterns_1)
else :
for value in data[data.columns[i]].astype( str ):
k = ""
for n in range ( 0 , len (pattern_list)):
k = re.search(pattern_list[n], value)
patterns.append(k)
if k = = re.Match:
break
patterns_1 = list ( filter ( lambda x: x is not None , patterns))
Num_Pattern = len (patterns_1)
# 12 Minimum Length of the string or number
if data[data.columns[i]].dtype = = 'O' :
min_word = min (data[data.columns[i]], key = len )
min_len = len (min_word)
else :
min_len = "NA"
# 13 Maximum Length of the string or number
if data[data.columns[i]].dtype = = 'O' :
max_word = max (data[data.columns[i]], key = len )
max_len = len (max_word)
else :
max_len = "NA"
# 14 Average Length of the string
if data[data.columns[i]].dtype = = 'O' :
avg_word = data[data.columns[i]]. apply ( len ).mean()
avg_word = math.ceil(avg_word)
else :
avg_word = "NA"
# 15 Datatype of the Column
dtype = data[data.columns[i]].dtype
if dtype = = 'O' :
data_type = "String"
elif dtype = = 'int64' :
data_type = "int64"
elif dtype = = 'int32' :
data_type = "int32"
elif dtype = = 'int16' :
data_type = "int16"
elif dtype = = 'float64' :
data_type = "float64"
elif dtype = = 'float32' :
data_type = "float32"
elif dtype = = 'float16' :
data_type = "float16"
elif dtype = = 'bool' :
data_type = "boolean"
else :
data_type = dtype
return_list = [col_name, Number_Null, Percent_Null, Num_Unique, Percent_Unique, Non_Distinct,
Percent_non_distinct, Num_Pattern, minimum, maximum, average, min_len, max_len, avg_word, data_type]
return return_list
|
Below is the snapshot of the result :
—
Informatica stands out as a top choice in the market for Data Profiling and Data Quality, making it a prominent tool for organizations, whether they are focused on organizational needs or analyzing extensive datasets.