Today, many organisations require a data storage and analytics solution which offers more agility and flexibility than traditional DBMS. Data lake is now popular way to store and analyse data since it allows businesses to store all their structured and unstructured data from different sources in a centralized repository.
In simple words, we can say as it’s all about Store Now & Analyze Later.
What is a Data Lake?
A data lake is a central storage repository that holds big data from many sources in a raw, granular format.
It can store structured, unstructured data, which means data can be kept in a more flexible format for future use.
Data ingestion allows connectors to get data from various data sources and load into the data lake.
Challenges in Data Lake Projects:
- Ingestion
- Data Preparation
- Making Data Ready to be Queried
Data Ingestion can be done in various techniques and store data in AWS S3. For example, you can use AWS Database Migration Service (DMS) to ingest data from existing database. AWS DataSync to ingest data from on-prem Network File System.
In this article, we going to see about deployment of AWS Data Lake resources using AWS CloudFormation template and Ingest data from data source (MySQL).
Having an automated serverless DataLake architecture reduces the manual work involved in managing data from its source to destination.
AWS CloudFormation template creates the following resources in AWS account.
- Amazon S3 bucket to store the raw data
- Glue Connection to connect source database
- Glue Crawler and Glue Jobs
- IAM roles for accessing AWS Glue and Amazon S3
The CloudFormation template & Script for AWS Glue job is available in below GitHub repository with Deployment steps.
Process Summary:
- Ingested data stores in AWS S3 bucket which referred as Raw data bucket, to make it available to catalog its schema in the AWS Glue Data Catalog using AWS Glue Crawlers.
- AWS Glue ETL job converts the data to Apache Parquet format and stores it in the S3 bucket.
- You must run the AWS Glue crawler on S3 bucket path, once the data is ready in Amazon S3 it creates a metadata table with the relevant schema in the AWS Glue Data Catalog.
- Once the Data Catalog table is created, you can execute standard SQL queries using AWS Athena and Visualize the data in AWS QuickSight.
Key Features:
Understanding these features helps you to replicate this kind of strategy for other purpose or customize the application for your needs.
AWS Glue:
AWS Glue has 2 types of jobs, Python shell and Apache spark. This template uses Apache Spark jobs to determine which files to process and maintain data in S3. AWS Glue has 3 core components Data Catalog, Crawler and ETL Job. It also provides the necessary triggering, scheduling features to run the jobs as a part of data processing workflow.
IAM:
This template deploys an IAM role named Glue_Execution_Role. The role is attached to AWS services and is associated with AWS managed policies and an inline policy.
The AssumeRolePolicyDocument trust document for the roles includes the following policies, which attach to AWS Glue and AWS S3 services to make sure that jobs have the necessary permission to execute.
IAM role includes following AWS managed policies.
Outcome:
Once the deployment is successful, you could see the resources created in AWS Glue & S3.
Now, you can test the glue connection created through cloudformation.
Folders created in S3 bucket,
Once the scheduled AWS Glue crawler runs, AWS Glue data catalog list the tables and you can query the same using Athena.
Conclusion:
In this article, we have provided an AWS CloudFormation template which allows you to quickly setup the DataLake resources and analysis your data in Analytical tools. Reach out to us if you need more details!
Written by : Geetha Pandiyan & Umashankar N
Cloud Solution Architect
1CloudHub