October 26, 2024
In this article, we will discuss the challenges posed by a high volume of data. We will also discuss the scope of data source types you may need to ingest and store. We will end with a discussion of the AWS options available for storing data.
Topics in this article include:
Businesses have been storing data for decades—that is nothing new. What has changed in recent years is the ability to analyze certain types of data.
There are three broad classifications of data source types:
- Structured data is organized and stored in the form of values that are grouped into rows and columns of a table.
- Semistructured data is often stored in a series of key-value pairs that are grouped into elements within a file.
- Unstructured data is not structured in a consistent way. Some data may have a structure similar to semi-structured data but others may only contain metadata.
Many internet articles talks about the huge amount of information sitting within unstructured data. New applications are being released that can now catalog and provide incredible insights into this untapped resource.
But what is unstructured data? It is in every file that we store, every picture we take, and email we send.
Modern data management platforms must capture data from diverse sources at speed and scale. Data needs to be pulled together in manageable, central repositories—breaking down traditional silos. The benefits of the collection and analysis of all business data must outweigh the costs.
Jennifer: "We need a solution to preprocess raw click streams generated by customers using our website. Processing these click streams provides valuable insight into customer preferences, which are passed to our data warehouse. The data warehouse then couples these customer preferences with marketing campaigns and recommendation engines to offer investment suggestions and analysis to consumers."
A data lake is a centralized repository that allows you to store structured, semistructured, and unstructured data at any scale.
Data lakes promise the ability to store all data for a business in a single repository. You can leverage data lakes to store large volumes of data instead of persisting that data in data warehouses. Data lakes, such as those built in Amazon S3, are generally less expensive than specialized big data storage solutions. That way, you only pay for the specialized solutions when using them for processing and analytics and not for long-term storage. Your extract, transform, and load (ETL) and analytic process can still access this data for analytics.
Be careful to learn how to use data in new ways. Don't limit analytics to typical data warehouse-style analytics. Al and machine learning offer significant insights.
Be careful not to let your data lake become a swamp. Enforce proper organization and structure for all data entering the lake.
Be careful to ensure that data within the data lake is relevant and does not go unused. Train users on how to access the data, and set retention policies to ensure the data stays refreshed.
Data warehouse is a central repository of structured data from many data sources. This data is transformed, aggregated, and prepared for business reporting and analysis.
A data warehouse is a central repository of information coming from one or more data sources. Data flows into a data warehouse from transactional systems, relational databases, and other sources. These data sources can include structured, semi-structured, and unstructured data. These data sources are transformed into structured data before they are stored in the data warehouse.
Data is stored within the data warehouse using a schema. A schema defines how data is stored within tables, columns, and rows. The schema enforces constraints on the data to ensure integrity of the data. The transformation process often involves the steps required to make the source data conform to the schema. Following the first successful ingestion of data into the data warehouse, the process of ingesting and transforming the data can continue at a regular cadence.
Business analysts, data scientists, and decision-makers access the data through business intelligence (BI) tools, SQL clients, and other analytics applications. Businesses use reports, dashboards, and analytics tools to extract insights from their data, monitor business performance, and support decision-making. These reports, dashboards, and analytics tools are powered by data warehouses, which store data efficiently to minimize I/O and deliver query results at blazing speeds to hundreds and thousands of users concurrently.
Data warehouses can be massive. Analyzing these huge stores of data can be confusing. Many organizations need a way to limit the tables to those that are most relevant to the analytics users will be performing.
A subset of data from a data warehouse is called a data mart. Data marts only focus on one subject or functional area. A warehouse might contain all relevant sources for an enterprise, but a data mart might store only a single department’s sources. Because data marts are generally a copy of data already contained in a data warehouse, they are often fast and simple to implement.
For analysis to be most effective, it should be performed on data that has been processed and cleansed. This often means implementing an ETL operation to collect, cleanse, and transform the data. This data is then placed in a data warehouse. It is very common for data from many different parts of the organization to be combined into a single data warehouse.
Amazon Redshift is a data warehousing solution specially designed for workloads of all sizes. Amazon Redshift Spectrum even provides the ability to query data that is housed in an Amazon S3 data lake.
Data lakes provide customers a means for including unstructured and semistructured data in their analytics. Analytic queries can be run over cataloged data within a data lake. This extends the reach of analytics beyond the confines of a single data warehouse.
Businesses can securely store data coming from applications and devices in its native format, with high availability, and durability, at low cost, and at any scale. Businesses can easily access and analyze data in a variety of ways using the tools and frameworks of their choice in a high-performance, cost-effective way without having to move large amounts of data between storage and analytics systems.
Setting up a Data Lake and Data Warehouse in AWS can be a great way to deploy a secure, cloud-based storage solution. Data Lakes and Data Warehouses give businesses the flexibility to store large amounts of structured and unstructured data, while also allowing them to easily query and analyze data in real time. AWS offers a variety of services to make this process easier, such as Amazon S3, Amazon Redshift, Amazon Lake Formation and Amazon EMR. With the right configuration, businesses can store and access data quickly and securely, while also taking advantage of the scalability, security, and cost savings that AWS provides.
Traditional data storage and analytic tools can no longer provide the agility and flexibility required to deliver relevant business insights. That’s why many organizations are shifting to a data lake architecture.
A data lake on AWS can help you do the following:
- Collect and store any type of data, at any scale, and at a low cost
- Secure the data and prevent unauthorized access
- Catalog, search, and find the relevant data in the central repository
- Quickly and easily perform new types of data analysis
- Use a broad set of analytic engines for one-time analytics, real-time streaming, predictive analytics, AI, and machine learning
Amazon S3 is storage for the internet. This service is designed to make web-scale computing easier for developers. Amazon S3 provides a simple web service interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. The service gives any developer access to the same highly scalable, reliable, secure, fast, and inexpensive infrastructure that Amazon uses to run its own global network of websites. The service aims to maximize the benefits of scale and pass those benefits on to developers.
The benefits of Amazon S3 include the following:
- Store anything
- Secure object storage
- Natively online, HTTP access
- Unlimited scalability
- 99.999999999% durability
To get the most out of Amazon S3, you need to understand a few simple concepts. First, Amazon S3 stores data as objects within buckets.
An object is composed of a file and any metadata that describes that file. To store an object in Amazon S3, you upload the file you want to store into a bucket. When you upload a file, you can set permissions on the object and add any metadata.
Buckets are logical containers for objects. You can have one or more buckets in your account and can control access for each bucket individually. You control who can create, delete, and list objects in the bucket. You can also view access logs for the bucket and its objects and choose the geographical region where Amazon S3 will store the bucket and its contents.
With Amazon S3, you can cost-effectively store all data types in their native formats. You can then launch as many or as few virtual servers needed using Amazon Elastic Compute Cloud (Amazon EC2) and use AWS analytics tools to process your data. You can optimize your EC2 instances to provide the correct ratios of CPU, memory, and bandwidth for best performance. Decoupling your processing and storage provides a significant number of benefits, including the ability to process and analyze the same data with a variety of tools.
Amazon S3 makes it easy to build a multi-tenant environment, where many users can bring their own data analytics tools to a common set of data. This improves both cost and data governance over traditional solutions, which require multiple copies of data to be distributed across multiple processing platforms.
Although this may require an additional step to load your data into the right tool, using Amazon S3 as your central data store provides even more benefits over traditional storage options.
Combine Amazon S3 with other AWS services to query and process data. Amazon S3 also integrates with AWS Lambda serverless computing to run code without provisioning or managing servers. Amazon Athena can query Amazon S3 directly using the Structured Query Language (SQL), without the need for data to be ingested into a relational database.
With all of these capabilities, you only pay for the actual amounts of data you process or the compute time you consume.
Representational State Transfer (REST) APIs are programming interfaces commonly used to interact with files in Amazon S3. Amazon S3's RESTful APIs are simple, easy to use, and supported by most major third-party independent software vendors (ISVs), including Apache Hadoop and other leading analytics tool vendors. This allows customers to bring the tools they are most comfortable with and knowledgeable about to help them perform analytics on data in Amazon S3.
Businesses have begun realizing the power of data lakes. Businesses can place data within a data lake and use their choice of open-source distributed processing frameworks, such as those supported by Amazon EMR. Apache Hadoop and Spark are both supported by Amazon EMR, which has the ability to help businesses easily, quickly, and cost-effectively implement data processing solutions based on Amazon S3 data lakes.
Data scientists spend 60% of their time cleaning and organizing data and 19% collecting data sets.
Data preparation is a huge undertaking. There are no easy answers when it comes to cleaning, transforming, and collecting data for your data lake. However, there are services that can automate many of these time-consuming processes.
Setting up and managing data lakes today can involve a lot of manual, complicated, and time-consuming tasks. This work includes loading the data, monitoring the data flows, setting up partitions for the data, and tuning encryption. You may also need to reorganize data, deduplicate it, match linked records, and audit data over time.
AWS Lake Formation makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and when prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.
AWS Lake Formation makes it easy to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning. Lake Formation gives you a central console where you can discover data sources, set up transformation jobs to move data to an Amazon S3 data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies, and audit and control access from AWS analytic and machine learning services. Lake Formation automatically configures underlying AWS services to ensure compliance with your defined policies. If you have set up transformation jobs spanning AWS services, Lake Formation configures the flows, centralizes their orchestration, and lets you monitor the processing of your jobs.
AWS Lake Formation is an optional service that helps simplify and automate many of the complex manual steps required to create a secure data lake, such as data ingestion, data cataloging, data preparation, and security. AWS Lake Formation can also help you manage and audit access to data in your data lake. However, you can set up a data lake in AWS without using AWS Lake Formation. You can directly use the AWS services such as Amazon S3, Amazon Athena, Amazon EMR, Amazon Glue, Amazon Redshift, and Amazon Kinesis Data Firehose to create and manage your data lake.
- Are a cost-effective data storage solution. You can durably store a nearly unlimited amount of data using Amazon S3.
- Implement industry-leading security and compliance. AWS uses stringent data security, compliance, privacy, and protection mechanisms.
- Allow you to take advantage of many different data collection and ingestion tools to ingest data into your data lake. These services include Amazon Kinesis for streaming data and AWS Snowball appliances for large volumes of on-premises data.
- Help you to categorize and manage your data simply and efficiently. Use AWS Glue to understand the data within your data lake, prepare it, and load it reliably into data stores. Once AWS Glue catalogs your data, it is immediately searchable, can be queried, and is available for ETL processing.
- Help you turn data into meaningful insights. Harness the power of purpose-built analytic services for a wide range of use cases, such as interactive analysis, data processing using Apache Spark and Apache Hadoop, data warehousing, real-time analytics, operational analytics, dashboards, and visualizations.
Amazon Redshift is a cloud-based data warehouse solution that allows for the storage and analysis of large amounts of data. It is designed to handle the demands of large-scale data warehouses and offers a variety of features to make data storing and analysis easier.
Amazon Redshift has a massively parallel processing (MPP) architecture that can scale up to petabytes of data and uses columnar storage to reduce query times. Redshift also provides a number of advanced features such as compression, data replication, and support for data integration. It can also be integrated with other Amazon services such as Amazon S3, Amazon EMR, and Amazon Kinesis. Amazon Redshift is a powerful and cost-effective solution for businesses looking to build large-scale data warehouses.
Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning to deliver the best price-performance at any scale.
- Faster performance
- 10x faster than other data warehouses
- Easy to set up, deploy, and manage
- Secure
- Scales quickly to meet your needs
As the volume of data has increased, so have the options for storing data. Traditional storage methods such as data warehouses are still very popular and relevant. However, data lakes have become more popular recently. These new options can confuse businesses that are trying to be financially wise and technically relevant.
So which is better: data warehouses or data lakes? Neither and both. They are different solutions that can be used together to maintain existing data warehouses while taking full advantage of the benefits of data lakes.
- When storing individual objects or files, we recommend Amazon S3.
- When storing massive volumes of data, both semistructured and unstructured, we recommend building a data lake on Amazon S3.
- When storing massive amounts of structured data for complex analysis, we recommend storing your data in Amazon Redshift.
Each of the AWS processing services we will cover in the next lesson incorporate a temporary storage layer that houses data while it is being processed and analyzed. This data is eventually moved to permanent storage within one of the other solutions we have already discussed.
When many people think of working with a massive volume of fast-moving data, the first thing that comes to mind is Hadoop. Within AWS, Hadoop frameworks are implemented using Amazon EMR and AWS Glue. These services implement the Hadoop framework to ingest, transform, analyze, and move results to analytical data stores.
Hadoop uses a distributed processing architecture, in which a task is mapped to a cluster of commodity servers for processing. Each piece of work distributed to the cluster servers can be run or re-run on any of the servers. The cluster servers frequently use the Hadoop Distributed File System (HDFS) to store data locally for processing. The results of the computation performed by those servers are then reduced to a single output set. One node, designated as the master node, controls the distribution of tasks and can automatically handle server failures.
Hadoop facilitates data navigation, discovery, and one-time data analysis. With Hadoop, you can compensate for unexpected occurrences by analyzing large amounts of data quickly to form a response.
Unlike traditional database systems, Hadoop can process structured, semi-structured, or unstructured data. This includes virtually any data format currently available.
In addition to natively handling many types of data (such as XML, CSV, text, log files, objects, SQL, JSON, and binary), you can use Hadoop to transform data into formats that allow better integration into your existing data sets. Also, you can store data with or without a schema and perform large-scale ETL operations to transform your data.
Because Hadoop is open source, several ecosystem projects are available to help you analyze the multiple types of data Hadoop can process and analyze.
These projects give you tremendous flexibility when you are developing data analytics solutions. Hadoop’s programming frameworks (such as Hive and Pig) can support almost any data analytics use case for your applications.
Because of Hadoop’s distributed architecture, Hadoop clusters can handle tremendous amounts of data affordably. Adding additional data processing capability is as simple as adding additional servers to your cluster (horizontal scaling).
Amazon EMR is the AWS service that implements Hadoop frameworks. The service will ingest data from nearly any data source type at nearly any speed! Amazon EMR has the ability to implement two different file systems: HDFS or the Elastic MapReduce File System (EMRFS). A file system is a set of organizational rules that govern how files are stored.
To handle massive volumes of data rapidly, the processing system required a way to distribute the load of reading and writing files across tens or even hundreds of high-powered servers. HDFS is distributed storage allowing files to be read and written to clusters of servers in parallel. This dramatically reduces the overall length of each and every operation.
It is helpful to understand the inner workings of an HDFS cluster. An HDFS cluster primarily consists of a NameNode, which manages the file system metadata, and DataNodes, which store the actual data.
Configure most of the servers as DataNodes, where the actual data will be stored, and a small number of servers as NameNodes, which contain maps of where the data resides.
Clients contact a NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.
Amazon EMR is the AWS service that implements Hadoop frameworks. An Amazon EMR process begins by ingesting data from one or more data sources and storing that data within a file system. If using HDFS, the file system is stored as an elastic block store volume. This storage volume is ephemeral meaning that the storage is of a temporary nature. Once the data has been copied into the HDFS volume, the transformation and analysis of the data is performed. The results are then sent to an analytical data store, such as an Amazon S3 data lake or Amazon Redshift data warehouse.
Amazon EMR provides an alternative to HDFS: the EMR File System (EMRFS). EMRFS can help ensure that there is a persistent "source of truth" for HDFS data stored in Amazon S3. When implementing EMRFS, there is no need to copy data into the cluster before transforming and analyzing the data as with HDFS. EMRFS can catalog data within a data lake on Amazon S3. The time that is saved by eliminating the copy step can dramatically improve performance of the cluster.
In this article, we discussed Data Lake and Data warehouse, we also discussed the challenges posed by a high volume of data. We ended with a discussion of the options available within AWS for setting up your Data Lake or Data Warehouse.