Карточка | Таблица | RUSMARC | |
Eagar, Gareth. Data engineering with AWS: build and implement complex data pipelines using AWS / Gareth Eagar. — 1 online resource — <URL:http://elib.fa.ru/ebsco/3108938.pdf>.Дата создания записи: 15.10.2021 Тематика: Cloud computing.; Big data.; Infonuagique.; Données volumineuses.; Big data.; Cloud computing. Коллекции: EBSCO Разрешенные действия: –
Действие 'Прочитать' будет доступно, если вы выполните вход в систему или будете работать с сайтом на компьютере в другой сети
Действие 'Загрузить' будет доступно, если вы выполните вход в систему или будете работать с сайтом на компьютере в другой сети
Группа: Анонимные пользователи Сеть: Интернет |
Права на использование объекта хранения
Место доступа | Группа пользователей | Действие | ||||
---|---|---|---|---|---|---|
Локальная сеть Финуниверситета | Все | |||||
Интернет | Читатели | |||||
Интернет | Анонимные пользователи |
Оглавление
- Cover
- Title page
- Copyright and Credits
- Contributors
- Table of Contents
- Preface
- Section 1: AWS Data Engineering Concepts and Trends
- Chapter 1: An Introduction to Data Engineering
- Technical requirements
- The rise of big data as a corporate asset
- The challenges of ever-growing datasets
- Data engineers – the big data enablers
- Understanding the role of the data engineer
- Understanding the role of the data scientist
- Understanding the role of the data analyst
- Understanding other common data-related roles
- The benefits of the cloud when building big data analytic solutions
- Hands-on – creating and accessing your
AWS account
- Creating a new AWS account
- Accessing your AWS account
- Summary
- Chapter 2: Data Management Architectures for Analytics
- Technical requirements
- The evolution of data management for analytics
- Databases and data warehouses
- Dealing with big, unstructured data
- A lake on the cloud and a house on that lake
- Understanding data warehouses and data marts – fountains of truth
- Distributed storage and massively parallel processing
- Columnar data storage and efficient data compression
- Dimensional modeling in data warehouses
- Understanding the role of data marts
- Feeding data into the warehouse – ETL and ELT pipelines
- Building data lakes to tame the variety and volume of big data
- Data lake logical architecture
- Bringing together the best of both worlds with the lake house architecture
- Data lakehouse implementations
- Building a data lakehouse on AWS
- Hands-on – configuring the AWS Command Line Interface tool and creating an S3 bucket
- Installing and configuring the AWS CLI
- Creating a new Amazon S3 bucket
- Summary
- Chapter 3: The AWS Data Engineer's Toolkit
- Technical requirements
- AWS services for ingesting data
- Overview of Amazon Database Migration Service (DMS)
- Overview of Amazon Kinesis for streaming data ingestion
- Overview of Amazon MSK for streaming data ingestion
- Overview of Amazon AppFlow for ingesting data from SaaS services
- Overview of Amazon Transfer Family for ingestion using FTP/SFTP protocols
- Overview of Amazon DataSync for ingesting from on-premises storage
- Overview of the AWS Snow family of devices for large data transfers
- AWS services for transforming data
- Overview of AWS Lambda for light transformations
- Overview of AWS Glue for serverless Spark processing
- Overview of Amazon EMR for Hadoop ecosystem processing
- AWS services for orchestrating big data pipelines
- Overview of AWS Glue workflows for orchestrating Glue components
- Overview of AWS Step Functions for complex workflows
- Overview of Amazon managed workflows for Apache Airflow
- AWS services for consuming data
- Overview of Amazon Athena for SQL queries in the data lake
- Overview of Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures
- Overview of Amazon QuickSight for visualizing data
- Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket
- Creating a Lambda layer containing the AWS Data Wrangler library
- Creating new Amazon S3 buckets
- Creating an IAM policy and role for your Lambda function
- Creating a Lambda function
- Configuring our Lambda function to be triggered by an S3 upload
- Summary
- Chapter 4: Data Cataloging, Security, and Governance
- Technical requirements
- Getting data security and governance right
- Common data regulatory requirements
- Core data protection concepts
- Personal data
- Encryption
- Anonymized data
- Pseudonymized data/tokenization
- Authentication
- Authorization
- Putting these concepts together
- Cataloging your data to avoid the data swamp
- How to avoid the data swamp
- The AWS Glue/Lake Formation data catalog
- AWS services for data encryption and security monitoring
- AWS Key Management Service (KMS)
- Amazon Macie
- Amazon GuardDuty
- AWS services for managing identity and permissions
- AWS Identity and Access Management (IAM) service
- Using AWS Lake Formation to manage data lake access
- Hands-on – configuring Lake Formation permissions
- Creating a new user with IAM permissions
- Transitioning to managing fine-grained permissions with AWS Lake Formation
- Summary
- Section 2: Architecting and Implementing Data Lakes and Data Lake Houses
- Chapter 5: Architecting Data Engineering Pipelines
- Technical requirements
- Approaching the data pipeline architecture
- Architecting houses and architecting pipelines
- Whiteboarding as an information-gathering tool
- Conducting a whiteboarding session
- Identifying data consumers and understanding their requirements
- Identifying data sources and ingesting data
- Identifying data transformations and optimizations
- File format optimizations
- Data standardization
- Data quality checks
- Data partitioning
- Data denormalization
- Data cataloging
- Whiteboarding data transformation
- Loading data into data marts
- Wrapping up the whiteboarding session
- Hands-on – architecting a sample pipeline
- Detailed notes from the project "Bright Light" whiteboarding meeting of GP Widgets, Inc
- Summary
- Chapter 6: Ingesting Batch and Streaming Data
- Technical requirements
- Understanding data sources
- Data variety
- Data volume
- Data velocity
- Data veracity
- Data value
- Questions to ask
- Ingesting data from a relational database
- AWS Database Migration Service (DMS)
- AWS Glue
- Other ways to ingest data from a database
- Deciding on the best approach for ingesting from a database
- Ingesting streaming data
- Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK)
- Hands-on – ingesting data with AWS DMS
- Creating a new MySQL database instance
- Loading the demo data using an Amazon EC2 instance
- Creating an IAM policy and role for DMS
- Configuring DMS settings and performing a full load from MySQL to S3
- Querying data with Amazon Athena
- Hands-on – ingesting streaming data
- Configuring Kinesis Data Firehose for streaming delivery to Amazon S3
- Configuring Amazon Kinesis Data Generator (KDG)
- Adding newly ingested data to the Glue Data Catalog
- Querying the data with Amazon Athena
- Summary
- Chapter 7: Transforming Data to Optimize for Analytics
- Technical requirements
- Transformations – making raw data
more valuable
- Cooking, baking, and data transformations
- Transformations as part of a pipeline
- Types of data transformation tools
- Apache Spark
- Hadoop and MapReduce
- SQL
- GUI-based tools
- Data preparation transformations
- Protecting PII data
- Optimizing the file format
- Optimizing with data partitioning
- Data cleansing
- Business use case transforms
- Data denormalization
- Enriching data
- Pre-aggregating data
- Extracting metadata from unstructured data
- Working with change data capture (CDC) data
- Traditional approaches – data upserts and SQL views
- Modern approaches – the transactional data lake
- Hands-on – joining datasets with AWS
Glue Studio
- Creating a new data lake zone – the curated zone
- Creating a new IAM role for the Glue job
- Configuring a denormalization transform using AWS Glue Studio
- Finalizing the denormalization transform job to write to S3
- Create a transform job to join streaming and film data using AWS Glue Studio
- Summary
- Chapter 8: Identifying and Enabling Data Consumers
- Technical requirements
- Understanding the impact of data democratization
- A growing variety of data consumers
- Meeting the needs of business users with data visualization
- AWS tools for business users
- Meeting the needs of data analysts with structured reporting
- AWS tools for data analysts
- Meeting the needs of data scientists and ML models
- AWS tools used by data scientists to work with data
- Hands-on – creating data transformations with AWS Glue DataBrew
- Configuring new datasets for AWS Glue DataBrew
- Creating a new Glue DataBrew project
- Building your Glue DataBrew recipe
- Creating a Glue DataBrew job
- Summary
- Chapter 9: Loading Data into a Data Mart
- Technical requirements
- Extending analytics with data warehouses/data marts
- Cold data
- Warm data
- Hot data
- What not to do – anti-patterns for a data warehouse
- Using a data warehouse as a transactional datastore
- Using a data warehouse as a data lake
- Using data warehouses for real-time, record-level use cases
- Storing unstructured data
- Redshift architecture review and storage
deep dive
- Data distribution across slices
- Redshift Zone Maps and sorting data
- Designing a high-performance data warehouse
- Selecting the optimal Redshift node type
- Selecting the optimal table distribution style and sort key
- Selecting the right data type for columns
- Selecting the optimal table type
- Moving data between a data lake and Redshift
- Optimizing data ingestion in Redshift
- Exporting data from Redshift to the data lake
- Hands-on – loading data into an Amazon Redshift cluster and running queries
- Uploading our sample data to Amazon S3
- IAM roles for Redshift
- Creating a Redshift cluster
- Creating external tables for querying data in S3
- Creating a schema for a local Redshift table
- Running complex SQL queries against our data
- Summary
- Chapter 10: Orchestrating the Data Pipeline
- Technical requirements
- Understanding the core concepts for pipeline orchestration
- What is a data pipeline, and how do you orchestrate it?
- How do you trigger a data pipeline to run?
- How do you handle the failures of a step in your pipeline?
- Examining the options for orchestrating pipelines in AWS
- AWS Data Pipeline for managing ETL between data sources
- AWS Glue Workflows to orchestrate Glue resources
- Apache Airflow as an open source orchestration solution
- Pros and cons of using MWAA
- AWS Step Function for a serverless orchestration solution
- Pros and cons of using AWS Step Function
- Deciding on which data pipeline orchestration tool to use
- Hands-on – orchestrating a data pipeline using AWS Step Function
- Creating new Lambda functions
- Creating an SNS topic and subscribing to an email address
- Creating a new Step Function state machine
- Configuring AWS CloudTrail and Amazon EventBridge
- Summary
- Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning
- Chapter 11: Ad Hoc Queries with Amazon Athena
- Technical requirements
- Amazon Athena – in-place SQL analytics for the data lake
- Tips and tricks to optimize Amazon Athena queries
- Common file format and layout optimizations
- Writing optimized SQL queries
- Federating the queries of external data sources with Amazon Athena Query Federation
- Querying external data sources using Athena Federated Query
- Managing governance and costs with Amazon Athena Workgroups
- Athena Workgroups overview
- Enforcing settings for groups of users
- Enforcing data usage controls
- Hands-on – creating an Amazon Athena workgroup and configuring Athena settings
- Hands-on – switching Workgroups and running queries
- Summary
- Chapter 12: Visualizing Data with Amazon QuickSight
- Technical requirements
- Representing data visually for maximum impact
- Benefits of data visualization
- Popular uses of data visualizations
- Understanding Amazon QuickSight's core concepts
- Standard versus enterprise edition
- SPICE – the in-memory storage and computation engine for QuickSight
- Ingesting and preparing data from a variety of sources
- Preparing datasets in QuickSight versus performing ETL outside of QuickSight
- Creating and sharing visuals with QuickSight analyses and dashboards
- Visual types in Amazon QuickSight
- Understanding QuickSight's advanced features – ML Insights and embedded dashboards
- Amazon QuickSight ML Insights
- Amazon QuickSight embedded dashboards
- Hands-on – creating a simple QuickSight visualization
- Setting up a new QuickSight account and loading a dataset
- Creating a new analysis
- Summary
- Chapter 13: Enabling Artificial Intelligence and Machine Learning
- Technical requirements
- Understanding the value of ML and AI for organizations
- Specialized ML projects
- Everyday use cases for ML and AI
- Exploring AWS services for ML
- AWS ML services
- Exploring AWS services for AI
- AI for unstructured speech and text
- AI for extracting metadata from images and video
- AI for ML-powered forecasts
- AI for fraud detection and personalization
- Hands-on – reviewing reviews with Amazon Comprehend
- Setting up a new Amazon SQS message queue
- Creating a Lambda function for calling Amazon Comprehend
- Adding Comprehend permissions for our IAM role
- Adding a Lambda function as a trigger for our SQS message queue
- Testing the solution with Amazon Comprehend
- Summary
- Further reading
- Chapter 14: Wrapping Up the First Part of Your Learning Journey
- Technical requirements
- Looking at the data analytics big picture
- Managing complex data environments with DataOps
- Examining examples of real-world data pipelines
- A decade of data wrapped up for Spotify users
- Ingesting and processing streaming files at Netflix scale
- Imagining the future – a look at emerging trends
- ACID transactions directly on data lake data
- More data and more streaming ingestion
- Multi-cloud
- Decentralized data engineering teams, data platforms, and a data mesh architecture
- Data and product thinking convergence
- Data and self-serve platform design convergence
- Implementations of the data mesh architecture
- Hands-on – cleaning up your AWS account
- Reviewing AWS Billing to identify the resources being charged for
- Closing your AWS account
- Summary
- About Packt
- Other Books YouMay Enjoy
- Index
Статистика использования
Количество обращений: 0
За последние 30 дней: 0 Подробная статистика |