Pandas Beginners Guide. From None to One!

Introduction to Pandas:

Pandas is a popular open-source library for data manipulation and analysis in Python. It provides powerful tools for cleaning, transforming, and analyzing data, making it a valuable resource for anyone working with data in Python.

The key features of Pandas include:

  1. Data Structures: Pandas provides two main data structures, Series and DataFrame, which allow for efficient and flexible storage and manipulation of data. Series is a one-dimensional array-like object that can hold any data type, while DataFrame is a two-dimensional table-like data structure that can store heterogeneous data.

  2. Data Cleaning and Transformation: Pandas provides a wide range of tools for cleaning and transforming data, such as handling missing values, dropping duplicates, and transforming data into different formats.

  3. Data Selection and Indexing: Pandas allows for easy selection and indexing of data based on specific criteria, such as slicing data, selecting rows and columns, and filtering data based on conditions.

  4. Grouping and Aggregation: Pandas allows for grouping and aggregation of data based on specific criteria, such as calculating summary statistics (e.g., mean, median, and standard deviation) for subsets of data.

  5. Importing and Exporting Data: Pandas provides functionality for reading and writing data in various formats, such as CSV, Excel, SQL, and JSON.

Pandas is useful for data analysis for several reasons:

  1. Efficiency: Pandas is designed to handle large datasets efficiently, making it an ideal tool for working with big data.

  2. Flexibility: Pandas provides a wide range of tools for working with data, making it easy to clean, transform, and analyze data in different ways.

  3. Integration: Pandas integrates well with other Python libraries, such as NumPy and Matplotlib, allowing for easy data analysis and visualization.

Overall, Pandas is a powerful and versatile library that is widely used in data analysis and data science for its efficiency, flexibility, and ease of use.

Installing Pandas:

Here is a step-by-step guide on how to install Pandas in Python:

  1. Open a command prompt or terminal window on your computer.

  2. Type "pip install pandas" and press enter. This will initiate the installation of the Pandas library through the Python Package Manager (pip).

  3. Wait for the installation process to complete. This may take a few minutes depending on your internet connection speed and computer specifications.

  4. Once the installation is complete, you can verify that Pandas has been installed by opening a Python interpreter and typing "import pandas" and pressing enter. If there are no error messages, this means that the installation was successful.

Alternatively, if you are using an integrated development environment (IDE) for Python, such as Jupyter Notebook or PyCharm, you can install Pandas through the IDE's built-in package manager. Simply search for "pandas" in the package manager and install it from there.

That's it! You now have Pandas installed on your computer and can start using it for data analysis and manipulation in Python.

Pandas Data Structures:

Pandas provides two main data structures for data manipulation and analysis: Series and DataFrame.

A Series is a one-dimensional labeled array that can hold any data type, such as integers, floats, strings, or even Python objects. It is similar to a NumPy array but has additional functionality such as labeling the data with an index. A Series consists of two parts: the data and the index.

To create a Series in Pandas, you can use the pd.Series() constructor. For example, the following code creates a Series of four integers with default index:

import pandas as pd

data = [1, 2, 3, 4]
s = pd.Series(data)
print(s)

Output:

0    1
1    2
2    3
3    4
dtype: int64

A DataFrame is a two-dimensional labeled data structure that can hold multiple data types, such as integers, floats, strings, and Python objects. It is similar to a spreadsheet or a SQL table and consists of rows and columns. A DataFrame can be created from various data sources, such as a CSV file, a SQL query, or a Python dictionary.

To create a DataFrame in Pandas, you can use the pd.DataFrame() constructor. For example, the following code creates a DataFrame with three columns and three rows:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie'],
        'age': [25, 30, 35],
        'gender': ['F', 'M', 'M']}
df = pd.DataFrame(data)
print(df)

Output:

       name  age gender
0     Alice   25      F
1       Bob   30      M
2   Charlie   35      M

In a DataFrame, each column is a Series. You can access a column of a DataFrame using the column name as a key, like this:

ages = df['age']
print(ages)

Output:

0    25
1    30
2    35
Name: age, dtype: int64

In summary, a Series is a one-dimensional labeled array, and a DataFrame is a two-dimensional labeled data structure with multiple columns. They are the two main data structures in Pandas and are essential for data manipulation and analysis.

Importing and Exporting Data:

Pandas provides a range of functions to import and export data from various file formats. Here's how you can read data from and write data to some common file formats:

CSV

To read data from a CSV file, you can use the pd.read_csv() function. For example, the following code reads data from a CSV file called data.csv and creates a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')
print(df)

To write data to a CSV file, you can use the to_csv() method of a DataFrame. For example, the following code writes the DataFrame df to a CSV file called output.csv:

df.to_csv('output.csv', index=False)

The index=False argument specifies that the index of the DataFrame should not be written to the CSV file.

Excel

To read data from an Excel file, you can use the pd.read_excel() function. For example, the following code reads data from an Excel file called data.xlsx and creates a DataFrame:

import pandas as pd

df = pd.read_excel('data.xlsx')
print(df)

To write data to an Excel file, you can use the to_excel() method of a DataFrame. For example, the following code writes the DataFrame df to an Excel file called output.xlsx:

df.to_excel('output.xlsx', index=False)

SQL

To read data from a SQL database, you can use the pd.read_sql() function. For example, the following code reads data from a SQL database and creates a DataFrame:

import pandas as pd
import sqlite3

conn = sqlite3.connect('example.db')
df = pd.read_sql('SELECT * FROM table_name', conn)
print(df)

To write data to a SQL database, you can use the to_sql() method of a DataFrame. For example, the following code writes the DataFrame df to a SQL table called new_table:

import sqlite3

conn = sqlite3.connect('example.db')
df.to_sql('new_table', conn, if_exists='replace', index=False)

The if_exists='replace' argument specifies that if the table new_table already exists, it should be replaced with the DataFrame df. The index=False argument specifies that the index of the DataFrame should not be written to the SQL table.

JSON

To read data from a JSON file, you can use the pd.read_json() function. For example, the following code reads data from a JSON file called data.json and creates a DataFrame:

import pandas as pd

df = pd.read_json('data.json')
print(df)

To write data to a JSON file, you can use the to_json() method of a DataFrame. For example, the following code writes the DataFrame df to a JSON file called output.json:

df.to_json('output.json', orient='records')

The orient='records' argument specifies that the output should be in JSON record format.

Indexing and Selection:

In Pandas, you can select specific rows and columns in a DataFrame or Series using indexing and selection techniques. Here are some commonly used methods:

Indexing Columns

To select a single column of a DataFrame, you can use the square bracket notation with the column name as the index. For example, the following code selects the column called column_name:

df['column_name']

You can also select multiple columns by passing a list of column names to the square bracket notation. For example, the following code selects the columns called column1 and column2:

df[['column1', 'column2']]

Indexing Rows

To select a single row of a DataFrame, you can use the loc[] or iloc[] accessor with the row label or integer index, respectively. For example, the following code selects the row with index 0:

df.iloc[0]

You can also select multiple rows by passing a list of row labels or integer indexes to the loc[] or iloc[] accessor. For example, the following code selects the rows with indexes 0 and 1:

df.iloc[[0, 1]]

Slicing Data

To select a range of rows or columns, you can use slicing. For example, the following code selects the first three rows:

df.iloc[:3]

You can also slice columns by passing a range of column names to the square bracket notation. For example, the following code selects the columns from column1 to column3:

df.loc[:, 'column1':'column3']

Boolean Indexing

Boolean indexing allows you to select rows or columns based on a condition. For example, the following code selects the rows where the value in column_name is greater than 5:

df[df['column_name'] > 5]

You can also use multiple conditions by using the & (and) or | (or) operators. For example, the following code selects the rows where the value in column1 is greater than 5 and the value in column2 is less than 10:

df[(df['column1'] > 5) & (df['column2'] < 10)]

These are some of the common methods for indexing and selecting data in Pandas.

Data Cleaning and Transformation:

Data cleaning and transformation are crucial steps in data analysis. Pandas provides a wide range of functions to handle missing data, drop duplicates, rename columns, and apply various functions to the data.

Handling Missing Data

Missing data is a common problem in data analysis. Pandas provides functions to handle missing data such as isna(), fillna(), and dropna(). The isna() function returns a Boolean mask indicating which values are missing:

df.isna()

The fillna() function allows you to fill missing values with a specific value or method:

df.fillna(0) # fill missing values with 0
df.fillna(method='ffill') # forward fill missing values
df.fillna(method='bfill') # backward fill missing values

The dropna() function removes missing values from the data:

df.dropna() # drop rows with missing values

Dropping Duplicates

Duplicates can distort your analysis results. Pandas provides the drop_duplicates() function to remove duplicates from your data:

df.drop_duplicates() # drop duplicates

Renaming Columns

Pandas allow you to rename columns using the rename() function:

df.rename(columns={'old_name': 'new_name'})

Applying Functions to Data

Pandas provides several functions to apply various functions to the data, such as apply(), map(), and applymap(). The apply() function applies a function to each column or row of the DataFrame:

df.apply(function, axis=0) # apply function to each column
df.apply(function, axis=1) # apply function to each row

The map() function applies a function to each element of a Series:

series.map(function)

The applymap() function applies a function to each element of a DataFrame:

df.applymap(function)

These are some of the common methods for data cleaning and transformation in Pandas.

Aggregation and Grouping:

Aggregation and grouping are essential concepts in data analysis. Pandas provides a wide range of functions to group data based on specific criteria, calculate summary statistics, and aggregate data.

Grouping Data

Pandas allow you to group data based on specific criteria using the groupby() function. The groupby() function returns a group object that you can use to apply functions on each group:

grouped = df.groupby('column')

You can also group data based on multiple columns:

grouped = df.groupby(['column1', 'column2'])

Aggregating Data

Pandas provide several functions to aggregate data, such as mean(), median(), sum(), count(), and max(). These functions return summary statistics for each group:

grouped.mean() # calculate the mean of each group
grouped.median() # calculate the median of each group
grouped.sum() # calculate the sum of each group
grouped.count() # count the number of rows in each group
grouped.max() # calculate the maximum value in each group

Applying Custom Functions

You can also apply custom functions to each group using the apply() function:

grouped.apply(function)

The function parameter can be any Python function that operates on a DataFrame or Series.

Aggregating Data with Multiple Functions

Pandas allows you to apply multiple aggregation functions to each group using the agg() function:

grouped.agg(['mean', 'median', 'sum'])

This function returns a DataFrame with summary statistics for each group.

Filtering Data

You can filter data based on specific criteria using the filter() function:

grouped.filter(lambda x: x['column'].mean() > 10)

This function returns a DataFrame containing only the rows that satisfy the filter condition.

These are some of the common methods for aggregating and grouping data in Pandas.

Visualization:

Pandas provides a wide range of plotting and visualization functions that allow you to explore and analyze your data. Here are some of the basic plotting functions available in Pandas:

Line Plot

The plot() function can be used to create a line plot of a Series or DataFrame:

series.plot()
df.plot()

Bar Plot

The plot.bar() function can be used to create a bar plot of a Series or DataFrame:

series.plot.bar()
df.plot.bar()

Histogram

The plot.hist() function can be used to create a histogram of a Series or DataFrame:

series.plot.hist()
df.plot.hist()

Scatter Plot

The plot.scatter() function can be used to create a scatter plot of two Series:

df.plot.scatter(x='column1', y='column2')

Box Plot

The plot.box() function can be used to create a box plot of a Series or DataFrame:

series.plot.box()
df.plot.box()

Pie Chart

The plot.pie() function can be used to create a pie chart of a Series:

series.plot.pie()

Customization

Pandas provides several options to customize your plots, such as changing the title, axis labels, legend, and color scheme:

df.plot(title='Title', xlabel='X label', ylabel='Y label', legend=True, color=['red', 'green', 'blue'])

These are some of the basic plotting functions available in Pandas. You can explore more advanced plotting functions and customization options in the Pandas documentation.

Case Study:

let's work through an end-to-end example of using Pandas to analyze a real-world dataset.

The Dataset

We will be using the "Titanic" dataset, which contains information about passengers on the Titanic, including their age, sex, ticket class, and whether they survived the sinking of the ship. The dataset can be downloaded from Kaggle: kaggle.com/c/titanic/data

Step 1: Loading the Data

We will use the read_csv() function in Pandas to load the data from the CSV file:

import pandas as pd

df = pd.read_csv('titanic.csv')

Step 2: Exploring the Data

Let's take a look at the first few rows of the DataFrame using the head() function:

df.head()

This will give us a preview of the data and its structure.

Next, let's use the info() function to get more information about the data types of each column and the number of non-null values:

df.info()

This will help us identify any missing values or columns with incorrect data types.

Step 3: Cleaning the Data

Let's start by dropping any columns that are not relevant to our analysis:

df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

Next, let's check for missing values and deal with them accordingly:

df.isnull().sum()

This will give us a count of missing values in each column. We can see that there are missing values in the "Age", "Embarked", and "Fare" columns.

For the "Age" column, let's fill in the missing values with the median age:

median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

For the "Embarked" column, let's fill in the missing values with the mode (most common value):

mode_embarked = df['Embarked'].mode()[0]
df['Embarked'].fillna(mode_embarked, inplace=True)

For the "Fare" column, let's fill in the missing values with the median fare:

median_fare = df['Fare'].median()
df['Fare'].fillna(median_fare, inplace=True)

Let's check again for missing values:

df.isnull().sum()

Now we have no missing values in the DataFrame.

Step 4: Analyzing the Data

Let's start by exploring the survival rate of passengers based on their gender:

df.groupby('Sex')['Survived'].mean()

This will give us the mean survival rate for male and female passengers. We can see that female passengers had a higher survival rate than male passengers.

Next, let's explore the survival rate based on ticket class:

df.groupby('Pclass')['Survived'].mean()

This will give us the mean survival rate for passengers in each ticket class. We can see that passengers in first class had a higher survival rate than passengers in second or third class.

Finally, let's create a visualization to explore the relationship between age and survival:

import matplotlib.pyplot as plt

survived = df[df['Survived'] == 1]
not_survived = df[df['Survived'] == 0]

plt.hist([survived['Age'], not_survived['Age']], stacked=True, bins=20)
plt.legend(['Survived', 'Not survived'])
plt.xlabel('Age')
plt.ylabel('Count')
plt.show

Conclusion:

Pandas is a powerful Python library that provides functionality for working with structured data. It offers two main data structures - Series and DataFrame - that can be used to manipulate and analyze data. Pandas provides tools for importing and exporting data, cleaning and transforming data, and aggregating and grouping data. It also offers basic visualization functionality to help explore and analyze data.

In this end-to-end example using the Titanic dataset, we demonstrated how to load and explore data, clean and transform data, and analyze data using Pandas. With its comprehensive set of tools and functions, Pandas is an essential library for any data analyst or data scientist working with structured data in Python.

I hope this helps, you!!

More such articles:

https://medium.com/techwasti

https://www.youtube.com/channel/UCiTaHm1AYqMS4F4L9zyO7qA

https://www.techwasti.com/

\==========================**=========================

If this article adds any value to you then please clap and comment.

Let’s connect on Stackoverflow, LinkedIn, & Twitter.

Did you find this article valuable?

Support techwasti by becoming a sponsor. Any amount is appreciated!