Challenges and approaches on Data Profiling

Data processing and analysis cannot take place without data profiling — reviewing the reliability and accuracy of the source data. When data expands and infrastructure shifts to the cloud, data profiling is becoming increasingly important. Need to achieve big data profiling with limited time and resources?

‘Data is the new oil’, depicts the fact that data stored in on-premises databases for a longer period has immense potential to solve business challenges. Data quality is of the utmost importance in order to obtain meaningful results from the data. Data quality is also a measure of the accuracy, validity and completeness of data. In this blog, we have highlighted some of the challenges faced in improving the quality of data.

  • The occurrence of duplicate data
  • Null values and Outliers in the data
  • Inconsistency of attributes.

To overcome these challenges, we leveraged Data Profiling.

Data Profiling is a process of examining data from an existing source and summarizing information about that data. The data profiler provides

  • Details about the attributes, distribution of data, missing data.
  • Maximum, minimum and average values of each attribute.
  • Relationship between the attributes by correlation matrix, histogram analysis, whether it is categorical or numerical.

The following are the insights of Data Profiling on a sample dataset

Different Approaches of Data Profiling

  • Data Profiling with pandas
  • Data Profiling with spark
  • Data Profiling with pandas on Google Colab

Note: The following approaches are experimented on a sample dataset of 250MB with 40 columns.

Data Profiling with Pandas

Replace the <df> with relevant data frame in which dataset is imported

In pandas, appropriate package for data profiling has to be installed and imported

The run time taken to execute the data profiler on pandas was the highest amongst the other approaches.

Data Profiling with Spark

In spark, appropriate package for data profiling has to be installed and imported

 

Replace the <df> with relevant data frame name in which the dataset is loaded.

The run time to execute the data profiler in spark (local machine) was considerably faster than processing on Pandas.

To make it more efficient, cache the data frame in spark and perform data profiling. The run time was faster than processing on spark.

Data Profiling with pandas on Google Colab

Google Colab, a free cloud service by Google that provides infrastructure with 12GB of RAM and 1 GPU as standard. It gives us feasibility to mount with the google drive.

The appropriate package for data profiling has to be installed and imported.

 

Replace <path-name> with the path where the file is located.

The run time to execute the data profiler in pandas leveraging Google colab was around 14 seconds which was more efficient than the other approaches.

 

Difficulties faced with Google Colab  

  • External libraries have to be installed and imported for every new session if required.
  • File handling errors while importing large datasets from the google drive.

With the help of data profiling we identified challenges like finding null values in the data, occurrence of duplicate data. While the data profiler is a good approach to understand the details of the data, there are certain things that data profiler doesn’t provide :

  1. The outliers in the data.
  2. Different plots to visualize different attributes.
  3. Detailed view on the inconsistency of data.

 

Organizations can make better decisions with data they can trust, and data profiling is an essential first step on this journey. If you wish to explore further about Data Profiling for your organization, do contact us to schedule a demo session

 

Written by :

Sreekar Ippili & Umashankar N

Subscribe to our Newsletter1CloudHub