Электронная библиотека Финансового университета

     

Детальная информация

Eagar, Gareth. Data engineering with AWS: build and implement complex data pipelines using AWS / Gareth Eagar. — 1 online resource — <URL:http://elib.fa.ru/ebsco/3108938.pdf>.

Дата создания записи: 15.10.2021

Тематика: Cloud computing.; Big data.; Infonuagique.; Données volumineuses.; Big data.; Cloud computing.

Коллекции: EBSCO

Разрешенные действия:

Действие 'Прочитать' будет доступно, если вы выполните вход в систему или будете работать с сайтом на компьютере в другой сети Действие 'Загрузить' будет доступно, если вы выполните вход в систему или будете работать с сайтом на компьютере в другой сети

Группа: Анонимные пользователи

Сеть: Интернет

Права на использование объекта хранения

Место доступа Группа пользователей Действие
Локальная сеть Финуниверситета Все Прочитать Печать Загрузить
Интернет Читатели Прочитать Печать
-> Интернет Анонимные пользователи

Оглавление

  • Cover
  • Title page
  • Copyright and Credits
  • Contributors
  • Table of Contents
  • Preface
  • Section 1: AWS Data Engineering Concepts and Trends
  • Chapter 1: An Introduction to Data Engineering
    • Technical requirements
    • The rise of big data as a corporate asset
    • The challenges of ever-growing datasets
    • Data engineers – the big data enablers
      • Understanding the role of the data engineer
      • Understanding the role of the data scientist
      • Understanding the role of the data analyst
      • Understanding other common data-related roles
    • The benefits of the cloud when building big data analytic solutions
    • Hands-on – creating and accessing your AWS account
      • Creating a new AWS account
      • Accessing your AWS account
    • Summary
  • Chapter 2: Data Management Architectures for Analytics
    • Technical requirements
    • The evolution of data management for analytics
      • Databases and data warehouses
      • Dealing with big, unstructured data
      • A lake on the cloud and a house on that lake
    • Understanding data warehouses and data marts – fountains of truth
      • Distributed storage and massively parallel processing
      • Columnar data storage and efficient data compression
      • Dimensional modeling in data warehouses
      • Understanding the role of data marts
      • Feeding data into the warehouse – ETL and ELT pipelines
    • Building data lakes to tame the variety and volume of big data
      • Data lake logical architecture
    • Bringing together the best of both worlds with the lake house architecture
      • Data lakehouse implementations
      • Building a data lakehouse on AWS
    • Hands-on – configuring the AWS Command Line Interface tool and creating an S3 bucket
      • Installing and configuring the AWS CLI
      • Creating a new Amazon S3 bucket
    • Summary
  • Chapter 3: The AWS Data Engineer's Toolkit
    • Technical requirements
    • AWS services for ingesting data
      • Overview of Amazon Database Migration Service (DMS)
      • Overview of Amazon Kinesis for streaming data ingestion
      • Overview of Amazon MSK for streaming data ingestion
      • Overview of Amazon AppFlow for ingesting data from SaaS services
      • Overview of Amazon Transfer Family for ingestion using FTP/SFTP protocols
      • Overview of Amazon DataSync for ingesting from on-premises storage
      • Overview of the AWS Snow family of devices for large data transfers
    • AWS services for transforming data
      • Overview of AWS Lambda for light transformations
      • Overview of AWS Glue for serverless Spark processing
      • Overview of Amazon EMR for Hadoop ecosystem processing
    • AWS services for orchestrating big data pipelines
      • Overview of AWS Glue workflows for orchestrating Glue components
      • Overview of AWS Step Functions for complex workflows
      • Overview of Amazon managed workflows for Apache Airflow
    • AWS services for consuming data
      • Overview of Amazon Athena for SQL queries in the data lake
      • Overview of Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures
      • Overview of Amazon QuickSight for visualizing data
    • Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket
      • Creating a Lambda layer containing the AWS Data Wrangler library
      • Creating new Amazon S3 buckets
      • Creating an IAM policy and role for your Lambda function
      • Creating a Lambda function
      • Configuring our Lambda function to be triggered by an S3 upload
    • Summary
  • Chapter 4: Data Cataloging, Security, and Governance
    • Technical requirements
    • Getting data security and governance right
      • Common data regulatory requirements
      • Core data protection concepts
      • Personal data
      • Encryption
      • Anonymized data
      • Pseudonymized data/tokenization
      • Authentication
      • Authorization
      • Putting these concepts together
    • Cataloging your data to avoid the data swamp
      • How to avoid the data swamp
    • The AWS Glue/Lake Formation data catalog
    • AWS services for data encryption and security monitoring
      • AWS Key Management Service (KMS)
      • Amazon Macie
      • Amazon GuardDuty
    • AWS services for managing identity and permissions
      • AWS Identity and Access Management (IAM) service
      • Using AWS Lake Formation to manage data lake access
    • Hands-on – configuring Lake Formation permissions
      • Creating a new user with IAM permissions
      • Transitioning to managing fine-grained permissions with AWS Lake Formation
    • Summary
  • Section 2: Architecting and Implementing Data Lakes and Data Lake Houses
  • Chapter 5: Architecting Data Engineering Pipelines
    • Technical requirements
    • Approaching the data pipeline architecture
      • Architecting houses and architecting pipelines
      • Whiteboarding as an information-gathering tool
      • Conducting a whiteboarding session
    • Identifying data consumers and understanding their requirements
    • Identifying data sources and ingesting data
    • Identifying data transformations and optimizations
      • File format optimizations
      • Data standardization
      • Data quality checks
      • Data partitioning
      • Data denormalization
      • Data cataloging
      • Whiteboarding data transformation
    • Loading data into data marts
    • Wrapping up the whiteboarding session
    • Hands-on – architecting a sample pipeline
      • Detailed notes from the project "Bright Light" whiteboarding meeting of GP Widgets, Inc
    • Summary
  • Chapter 6: Ingesting Batch and Streaming Data
    • Technical requirements
    • Understanding data sources
      • Data variety
      • Data volume
      • Data velocity
      • Data veracity
      • Data value
      • Questions to ask
    • Ingesting data from a relational database
      • AWS Database Migration Service (DMS)
      • AWS Glue
      • Other ways to ingest data from a database
      • Deciding on the best approach for ingesting from a database
    • Ingesting streaming data
      • Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK)
    • Hands-on – ingesting data with AWS DMS
      • Creating a new MySQL database instance
      • Loading the demo data using an Amazon EC2 instance
      • Creating an IAM policy and role for DMS
      • Configuring DMS settings and performing a full load from MySQL to S3
      • Querying data with Amazon Athena
    • Hands-on – ingesting streaming data
      • Configuring Kinesis Data Firehose for streaming delivery to Amazon S3
      • Configuring Amazon Kinesis Data Generator (KDG)
      • Adding newly ingested data to the Glue Data Catalog
      • Querying the data with Amazon Athena
    • Summary
  • Chapter 7: Transforming Data to Optimize for Analytics
    • Technical requirements
    • Transformations – making raw data more valuable
      • Cooking, baking, and data transformations
      • Transformations as part of a pipeline
    • Types of data transformation tools
      • Apache Spark
      • Hadoop and MapReduce
      • SQL
      • GUI-based tools
    • Data preparation transformations
      • Protecting PII data
      • Optimizing the file format
      • Optimizing with data partitioning
      • Data cleansing
    • Business use case transforms
      • Data denormalization
      • Enriching data
      • Pre-aggregating data
      • Extracting metadata from unstructured data
    • Working with change data capture (CDC) data
      • Traditional approaches – data upserts and SQL views
      • Modern approaches – the transactional data lake
    • Hands-on – joining datasets with AWS Glue Studio
      • Creating a new data lake zone – the curated zone
      • Creating a new IAM role for the Glue job
      • Configuring a denormalization transform using AWS Glue Studio
      • Finalizing the denormalization transform job to write to S3
      • Create a transform job to join streaming and film data using AWS Glue Studio
    • Summary
  • Chapter 8: Identifying and Enabling Data Consumers
    • Technical requirements
    • Understanding the impact of data democratization
      • A growing variety of data consumers
    • Meeting the needs of business users with data visualization
      • AWS tools for business users
    • Meeting the needs of data analysts with structured reporting
      • AWS tools for data analysts
    • Meeting the needs of data scientists and ML models
      • AWS tools used by data scientists to work with data
    • Hands-on – creating data transformations with AWS Glue DataBrew
      • Configuring new datasets for AWS Glue DataBrew
      • Creating a new Glue DataBrew project
      • Building your Glue DataBrew recipe
      • Creating a Glue DataBrew job
    • Summary
  • Chapter 9: Loading Data into a Data Mart
    • Technical requirements
    • Extending analytics with data warehouses/data marts
      • Cold data
      • Warm data
      • Hot data
    • What not to do – anti-patterns for a data warehouse
      • Using a data warehouse as a transactional datastore
      • Using a data warehouse as a data lake
      • Using data warehouses for real-time, record-level use cases
      • Storing unstructured data
    • Redshift architecture review and storage deep dive
      • Data distribution across slices
      • Redshift Zone Maps and sorting data
    • Designing a high-performance data warehouse
      • Selecting the optimal Redshift node type
      • Selecting the optimal table distribution style and sort key
      • Selecting the right data type for columns
      • Selecting the optimal table type
    • Moving data between a data lake and Redshift
      • Optimizing data ingestion in Redshift
      • Exporting data from Redshift to the data lake
    • Hands-on – loading data into an Amazon Redshift cluster and running queries
      • Uploading our sample data to Amazon S3
      • IAM roles for Redshift
      • Creating a Redshift cluster
      • Creating external tables for querying data in S3
      • Creating a schema for a local Redshift table
      • Running complex SQL queries against our data
    • Summary
  • Chapter 10: Orchestrating the Data Pipeline
    • Technical requirements
    • Understanding the core concepts for pipeline orchestration
      • What is a data pipeline, and how do you orchestrate it?
      • How do you trigger a data pipeline to run?
      • How do you handle the failures of a step in your pipeline?
    • Examining the options for orchestrating pipelines in AWS
      • AWS Data Pipeline for managing ETL between data sources
      • AWS Glue Workflows to orchestrate Glue resources
      • Apache Airflow as an open source orchestration solution
      • Pros and cons of using MWAA
      • AWS Step Function for a serverless orchestration solution
      • Pros and cons of using AWS Step Function
      • Deciding on which data pipeline orchestration tool to use
    • Hands-on – orchestrating a data pipeline using AWS Step Function
      • Creating new Lambda functions
      • Creating an SNS topic and subscribing to an email address
      • Creating a new Step Function state machine
      • Configuring AWS CloudTrail and Amazon EventBridge
    • Summary
  • Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning
  • Chapter 11: Ad Hoc Queries with Amazon Athena
    • Technical requirements
    • Amazon Athena – in-place SQL analytics for the data lake
    • Tips and tricks to optimize Amazon Athena queries
      • Common file format and layout optimizations
      • Writing optimized SQL queries
    • Federating the queries of external data sources with Amazon Athena Query Federation
      • Querying external data sources using Athena Federated Query
    • Managing governance and costs with Amazon Athena Workgroups
      • Athena Workgroups overview
      • Enforcing settings for groups of users
      • Enforcing data usage controls
    • Hands-on – creating an Amazon Athena workgroup and configuring Athena settings
    • Hands-on – switching Workgroups and running queries
    • Summary
  • Chapter 12: Visualizing Data with Amazon QuickSight
    • Technical requirements
    • Representing data visually for maximum impact
      • Benefits of data visualization
      • Popular uses of data visualizations
    • Understanding Amazon QuickSight's core concepts
      • Standard versus enterprise edition
      • SPICE – the in-memory storage and computation engine for QuickSight
    • Ingesting and preparing data from a variety of sources
      • Preparing datasets in QuickSight versus performing ETL outside of QuickSight
    • Creating and sharing visuals with QuickSight analyses and dashboards
      • Visual types in Amazon QuickSight
    • Understanding QuickSight's advanced features – ML Insights and embedded dashboards
      • Amazon QuickSight ML Insights
      • Amazon QuickSight embedded dashboards
    • Hands-on – creating a simple QuickSight visualization
      • Setting up a new QuickSight account and loading a dataset
      • Creating a new analysis
    • Summary
  • Chapter 13: Enabling Artificial Intelligence and Machine Learning
    • Technical requirements
    • Understanding the value of ML and AI for organizations
      • Specialized ML projects
      • Everyday use cases for ML and AI
    • Exploring AWS services for ML
      • AWS ML services
    • Exploring AWS services for AI
      • AI for unstructured speech and text
      • AI for extracting metadata from images and video
      • AI for ML-powered forecasts
      • AI for fraud detection and personalization
    • Hands-on – reviewing reviews with Amazon Comprehend
      • Setting up a new Amazon SQS message queue
      • Creating a Lambda function for calling Amazon Comprehend
      • Adding Comprehend permissions for our IAM role
      • Adding a Lambda function as a trigger for our SQS message queue
      • Testing the solution with Amazon Comprehend
    • Summary
    • Further reading
  • Chapter 14: Wrapping Up the First Part of Your Learning Journey
    • Technical requirements
    • Looking at the data analytics big picture
      • Managing complex data environments with DataOps
    • Examining examples of real-world data pipelines
      • A decade of data wrapped up for Spotify users
      • Ingesting and processing streaming files at Netflix scale
    • Imagining the future – a look at emerging trends
      • ACID transactions directly on data lake data
      • More data and more streaming ingestion
      • Multi-cloud
      • Decentralized data engineering teams, data platforms, and a data mesh architecture
      • Data and product thinking convergence
      • Data and self-serve platform design convergence
      • Implementations of the data mesh architecture
    • Hands-on – cleaning up your AWS account
      • Reviewing AWS Billing to identify the resources being charged for
      • Closing your AWS account
    • Summary
  • About Packt
  • Other Books YouMay Enjoy
  • Index

Статистика использования

stat Количество обращений: 0
За последние 30 дней: 0
Подробная статистика