Profiling
Last updated
Last updated
VDA provides feature of data profiling , this involves reviewing data to better understand its structure and maintain data quality standards within an organization. The main purpose of this feature is to gain insights into the quality of the data by using methods to review, summarize, and evaluate its condition.
Data engineers typically perform this work using a range of business rules and analytical algorithms. VDA's data profiling evaluates data based on factors such as accuracy, consistency, and timeliness, identifying issues like inconsistencies, inaccuracies, or null values.
Navigate to Data Source tab and click on the desired Data Source to find the Data Sets
From the list of Data Sets click on the required Data Set
Select the checkbox next to the column name to run the Data Profiling
Click on the Graph icon positioned at bottom right corner of the screen to run the Profiling
Once the profiling is done the column name will be highlighted in blue Click on it and you will find different parameters based on the column data type or go to Profiling tab to check profiling details for a dataset .
Text/ String Data Type
Definition: The number of unique values present in a text field.
Importance: Provides insights into data variety and uniqueness. High distinct values indicate diverse textual content, while low distinct values may suggest repetitive or standardized data entries.
Definition: The count or percentage of records where the text field is empty or null.
Importance: Indicates data completeness. High missing values can affect analysis and decision-making, highlighting potential gaps in data collection or entry processes.
Definition: The number of unique values present in the string field.
Importance: Provides insights into data variety and uniqueness. High distinct values suggest diverse textual content, while low distinct values may indicate repetitive or standardized data entries.
Definition: The count or percentage of records where the string field is empty or null.
Importance: Indicates data completeness. High occurrences of missing values can affect analysis and decision-making, highlighting potential gaps in data collection or entry processes.
Definition: The total number of characters (including spaces and special characters) across all string values in the field.
Importance: Helps in understanding data volume and storage requirements. Larger total character counts may impact system performance and storage costs.
Definition: The count of values that appear only once within the string field.
Importance: Identifies truly unique entries, which can be critical for data deduplication and ensuring data accuracy.
Definition: The count of values that appear more than once within the string field.
Importance: Indicates data redundancy and potential data quality issues. Identifying and managing duplicates is essential for maintaining data integrity.
Definition: The count or percentage of characters in the string field that are in lower case.
Importance: Provides insights into text normalization and consistency. Monitoring lower case usage helps in standardizing data for analysis and reporting purposes.
Definition: The count or percentage of characters in the string field that are in upper case.
Importance: Similar to lower case letters, tracking upper case usage assists in data standardization and consistency checks.
Definition: The count or percentage of punctuation characters (e.g., periods, commas, exclamation marks) within the string field.
Importance: Helps in analyzing text complexity and identifying patterns in punctuation usage that may influence data processing and analysis.
If Datatype is Number
Definition: The number of unique numeric values present in the field.
Importance: Provides insights into data diversity and granularity. High distinct values indicate a wide range of data points, while low distinct values may suggest categorical or heavily aggregated data.
Definition: The count or percentage of records where the numeric field is empty or null.
Importance: Indicates data completeness. High occurrences of missing values can impact analysis and decision-making, necessitating data cleansing or imputation.
Definition: The count or percentage of numeric values that are negative.
Importance: Helps in understanding data trends and distributions. Negative values may be critical in specific contexts such as financial or scientific datasets.
Definition: Values that divide a dataset into equal portions, providing insights into data distribution.
Importance: Helps in understanding data spread and variability. Common quantiles include quartiles (dividing data into quarters) and percentiles (dividing data into hundredths).
Definition: Statistical summaries such as mean, median, mode, standard deviation, and variance.
Importance: Provides a comprehensive view of central tendency, dispersion, and shape of the numeric data distribution.
Definition: The most frequently occurring numeric values and their occurrence count.
Importance: Identifies popular or dominant values within the dataset, highlighting potential data trends or biases.
Definition: The highest and lowest numeric values observed in the dataset.
Importance: Indicates data range and extremes. Examining maximum and minimum values helps in identifying outliers or unusual data points that may require further investigation.