FinUniversity Electronic Library

Details

	Card	Table	RUSMARC

Marin, Ivan. Big data analysis with Python: combine Spark and Python to unlock the powers of parallel computing and machine learning / Ivan Marin, Ankit Shukla and Sarang VK. — 1 online resource (276 pages) — <URL:http://elib.fa.ru/ebsco/2102158.pdf>.

Record create date: 4/20/2019

Subject: Big data.; Python (Computer program language); Cloud computing.; Machine learning.; Big data.; Cloud computing.; Machine learning.; Python (Computer program language)

Collections: EBSCO

Allowed Actions: –

Action 'Read' will be available if you login or access site from another network Action 'Download' will be available if you login or access site from another network

Group: Anonymous

Network: Internet

Annotation

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control the data avalanche for you. With this book, you'll learn effective techniques to aggregate data into useful dimensions for posterior analysis, extract ...

Document access rights

	Network		User group		Action
	Finuniversity Local Network		All
	Internet		Readers
	Internet		Anonymous

Cover
Copyright
Table of Contents
Preface
Chapter 1: The Python Data Science Stack
- Introduction
- Python Libraries and Packages
  - IPython: A Powerful Interactive Shell
  - Exercise 1: Interacting with the Python Shell Using the IPython Commands
  - The Jupyter Notebook
  - Exercise 2: Getting Started with the Jupyter Notebook
  - IPython or Jupyter?
  - Activity 1: IPython and Jupyter
  - NumPy
  - SciPy
  - Matplotlib
  - Pandas
- Using Pandas
  - Reading Data
  - Exercise 3: Reading Data with Pandas
  - Data Manipulation
  - Selection and Filtering
  - Selecting Rows Using Slicing
  - Exercise 4: Data Selection and the .loc Method
  - Applying a Function to a Column
  - Activity 2: Working with Data Problems
- Data Type Conversion
  - Exercise 5: Exploring Data Types
- Aggregation and Grouping
  - Exercise 6: Aggregation and Grouping Data
  - NumPy on Pandas
- Exporting Data from Pandas
  - Exercise 7: Exporting Data in Different Formats
- Visualization with Pandas
  - Activity 3: Plotting Data with Pandas
- Summary
Chapter 2: Statistical Visualizations
- Introduction
- Types of Graphs and When to Use Them
  - Exercise 8: Plotting an Analytical Function
- Components of a Graph
  - Exercise 9: Creating a Graph
  - Exercise 10: Creating a Graph for a Mathematical Function
- Seaborn
- Which Tool Should Be Used?
- Types of Graphs
  - Line Graphs
  - Time Series Plots
  - Exercise 11: Creating Line Graphs Using Different Libraries
- Pandas DataFrames and Grouped Data
  - Activity 4: Line Graphs with the Object-Oriented API and Pandas DataFrames
  - Scatter Plots
  - Activity 5: Understanding Relationships of Variables Using Scatter Plots
  - Histograms
  - Exercise 12: Creating a Histogram of Horsepower Distribution
  - Boxplots
  - Exercise 13: Analyzing the Behavior of the Number of Cylinders and Horsepower Using a Boxplot
- Changing Plot Design: Modifying Graph Components
  - Title and Label Configuration for Axis Objects
  - Exercise 14: Configuring a Title and Labels for Axis Objects
  - Line Styles and Color
  - Figure Size
  - Exercise 15: Working with Matplotlib Style Sheets
- Exporting Graphs
  - Activity 6: Exporting a Graph to a File on Disk
  - Activity 7: Complete Plot Design
- Summary
Chapter 3: Working with Big Data Frameworks
- Introduction
- Hadoop
  - Manipulating Data with the HDFS
  - Exercise 16: Manipulating Files in the HDFS
- Spark
  - Spark SQL and Pandas DataFrames
  - Exercise 17: Performing DataFrame Operations in Spark
  - Exercise 18: Accessing Data with Spark
  - Exercise 19: Reading Data from the Local Filesystem and the HDFS
  - Exercise 20: Writing Data Back to the HDFS and PostgreSQL
- Writing Parquet Files
  - Exercise 21: Writing Parquet Files
  - Increasing Analysis Performance with Parquet and Partitions
  - Exercise 22: Creating a Partitioned Dataset
- Handling Unstructured Data
  - Exercise 23: Parsing Text and Cleaning
  - Activity 8: Removing Stop Words from Text
- Summary
Chapter 4: Diving Deeper with Spark
- Introduction
- Getting Started with Spark DataFrames
  - Exercise 24: Specifying the Schema of a DataFrame
  - Exercise 25: Creating a DataFrame from an Existing RDD
  - Exercise 25: Creating a DataFrame Using a CSV File
- Writing Output from Spark DataFrames
  - Exercise 27: Converting a Spark DataFrame to a Pandas DataFrame
- Exploring Spark DataFrames
  - Exercise 28: Displaying Basic DataFrame Statistics
  - Activity 9: Getting Started with Spark DataFrames
- Data Manipulation with Spark DataFrames
  - Exercise 29: Selecting and Renaming Columns from the DataFrame
  - Exercise 30: Adding and Removing a Column from the DataFrame
  - Exercise 31: Displaying and Counting Distinct Values in a DataFrame
  - Exercise 32: Removing Duplicate Rows and Filtering Rows of a DataFrame
  - Exercise 33: Ordering Rows in a DataFrame
  - Exercise 34: Aggregating Values in a DataFrame
  - Activity 10: Data Manipulation with Spark DataFrames
- Graphs in Spark
  - Exercise 35: Creating a Bar Chart
  - Exercise 36: Creating a Linear Model Plot
  - Exercise 37: Creating a KDE Plot and a Boxplot
  - Activity 11: Graphs in Spark
- Summary
Chapter 5: Handling Missing Values and Correlation Analysis
- Introduction
- Setting up the Jupyter Notebook
- Missing Values
  - Exercise 38: Counting Missing Values in a DataFrame
  - Exercise 39: Counting Missing Values in All DataFrame Columns
  - Fetching Missing Value Records from the DataFrame
- Handling Missing Values in Spark DataFrames
  - Exercise 40: Removing Records with Missing Values from a DataFrame
  - Exercise 41: Filling Missing Values with a Constant in a DataFrame Column
- Correlation
  - Exercise 42: Computing Correlation
  - Activity 12: Missing Value Handling and Correlation Analysis with PySpark DataFrames
- Summary
Chapter 6: Exploratory Data Analysis
- Introduction
- Defining a Business Problem
  - Problem Identification
  - Requirement Gathering
  - Data Pipeline and Workflow
  - Identifying Measurable Metrics
  - Documentation and Presentation
- Translating a Business Problem into Measurable Metrics and Exploratory Data Analysis (EDA)
  - Data Gathering
  - Analysis of Data Generation
  - KPI Visualization
  - Feature Importance
  - Exercise 43: Identify the Target Variable and Related KPIs from the Given Data for the Business Problem
  - Exercise 44: Generate the Feature Importance of the Target Variable and Carry Out EDA
- Structured Approach to the Data Science Project Life Cycle
  - Data Science Project Life Cycle Phases
  - Phase 1: Understanding and Defining the Business Problem
  - Phase 2: Data Access and Discovery
  - Phase 3: Data Engineering and Pre-processing
  - Activity 13: Carry Out Mapping to Gaussian Distribution of Numeric Features from the Given Data
  - Phase 4: Model Development
- Summary
Chapter 7: Reproducibility in Big Data Analysis
- Introduction
- Reproducibility with Jupyter Notebooks
  - Introduction to the Business Problem
  - Documenting the Approach and Workflows
  - Explaining the Data Pipeline
  - Explain the Dependencies
  - Using Source Code Version Control
  - Modularizing the Process
- Gathering Data in a Reproducible Way
  - Functionalities in Markdown and Code Cells
  - Explaining the Business Problem in the Markdown
  - Providing a Detailed Introduction to the Data Source
  - Explain the Data Attributes in the Markdown
  - Exercise 45: Performing Data Reproducibility
- Code Practices and Standards
  - Environment Documentation
  - Writing Readable Code with Comments
  - Effective Segmentation of Workflows
  - Workflow Documentation
  - Exercise 46: Missing Value Preprocessing with High Reproducibility
- Avoiding Repetition
  - Using Functions and Loops for Optimizing Code
  - Developing Libraries/Packages for Code/Algorithm Reuse
  - Activity 14: Carry normalisation of data
- Summary
Chapter 8: Creating a Full Analysis Report
- Introduction
- Reading Data in Spark from Different Data Sources
  - Exercise 47: Reading Data from a CSV File Using the PySpark Object
  - Reading JSON Data Using the PySpark Object
- SQL Operations on a Spark DataFrame
  - Exercise 48: Reading Data in PySpark and Carrying Out SQL Operations
  - Exercise 49: Creating and Merging Two DataFrames
  - Exercise 50: Subsetting the DataFrame
- Generating Statistical Measurements
  - Activity 15: Generating Visualization Using Plotly
- Summary
Appendix
Index

Usage statistics

Access count: 0
Last 30 days: 0
Detailed usage statistics