Population analysis cases

demand

Source data acquisition – password: 6qpt

Import the file and view the original data
Combine population data with state abbreviation data
Delete the duplicate abstraction column in the merged data
View columns with missing data
Find out which states / regions make the value of state NaN and perform the de duplication operation
Fill in the correct values for the state items of these states / regions found, so as to remove all NaN in the state column
Merge state area data areas
We will find the missing data in the area(sq.mi) column and find out which rows
Remove rows with missing data
Find the population data for 2010
Calculate the population density of each state
Sort and find the five states with the highest population density df.sort_values()

1, Data preprocessing

Import package

import pandas as pd
import numpy as np
from pandas import DataFrame

Import file view original data

abb = pd.read_csv('./data/state-abbrevs.csv')
abb
# state: full name; abbreviation: abbreviation

pop = pd.read_csv('./data/state-population.csv')
pop

area = pd.read_csv('./data/state-areas.csv')
area

Combine state data [abb] and population data [pop]

Using outer: ensure data integrity
[Python 3] detailed usage of pandas.merge

abb_pop = pd.merge(abb,pop,left_on='abbreviation',right_on='state/region',how='outer')
abb_pop.head()

Delete the merged data column: abstraction

# inplace acts on the original DataFrame
abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
abb_pop.head()

View missing data in table

info()
isnull().any(axis=0)⭐

abb_pop.info()

# isnull() and any connections use
abb_pop.isnull().any(axis=0)

2, Data processing

Find out which states / regions make the state value NaN, and perform the de duplication operation [find the abbreviation corresponding to the empty state value, and perform the de duplication operation on the found abbreviation]

abb_pop.head()

Steps:

Null value found in state
Find the abbreviation corresponding to the null value in state
Weight removal of abbreviation

abb_pop['state'].isnull()

# Take the Boolean value as the index of the data and get the corresponding row data
abb_pop.loc[abb_pop['state'].isnull()]

abb_pop.loc[abb_pop['state'].isnull()]['state/region']

abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()

array(['PR', 'USA'], dtype=object)

Fill in the correct values for the state items of these states / regions found, so as to remove all NaN in the state column

fillna(): fill in consistent – not applicable to this item
- PR: fill in the corresponding full name USA: fill in the corresponding full name

realization:

Find the null value of the full name corresponding to the abbreviation according to the abbreviation
Fill the found null value with the corresponding full name

# 1. Find the row corresponding to PR
abb_pop['state/region'] == 'PR'

Although the display is False, True is hidden in it!

abb_pop.loc[abb_pop['state/region'] == 'PR']

# 2. Get row index
indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index
indexs

# 3. Batch assignment
abb_pop.loc[indexs,'state'] = 'PPPRRR'
abb_pop.loc[abb_pop['state/region'] == 'PR']

Similarly, the operation referred to as USA is the same

abb_pop['state/region'] == 'USA'
indexsusa = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
abb_pop.loc[indexsusa,'state'] = 'United States'
 abb_pop.loc[abb_pop['state/region'] == 'USA']

Merge state area data areas

abb_pop.head()

area.head()

abb_pop_area = pd.merge(abb_pop,area,on='state',how='outer')
abb_pop_area

We will find the missing data in the area(sq.mi) column and find out which rows

abb_pop_area['area (sq. mi)'].isnull()

abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()]

# Get index
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
indexs

Int64Index([2448, 2449, 2450, 2451, 2452, 2453, 2454, 2455, 2456, 2457, 2458,2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469,2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480,2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491,2492, 2493, 2494, 2495, 2496, 2497, 2498, 2499, 2500, 2501, 2502,2503, 2504, 2505, 2506, 2507, 2508, 2509, 2510, 2511, 2512, 2513,2514, 2515, 2516, 2517, 2518, 2519, 2520, 2521, 2522, 2523, 2524,2525, 2526, 2527, 2528, 2529, 2530, 2531, 2532, 2533, 2534, 2535, 2536, 2537, 2538, 2539, 2540, 2541, 2542, 2543], dtype='int64')

Remove rows with missing data

# drop series function 0: row 1: column
abb_pop_area.drop(labels=indexs,axis=0,inplace=True)

Find the population data for 2010

query(): condition query ⭐

abb_pop_area.head()

abb_pop_area.info()

abb_pop_area.query('ages == "total" & year == 2010')

Calculate the population density / area of each state and summarize the population density data to the source data (add a column)

abb_pop_area['population'] / abb_pop_area['area (sq. mi)']

abb_pop_area['midu']  = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
abb_pop_area.head()

Sort and find the five states with the highest population density df.sort_values()

abb_pop_area.sort_values(by='midu')

abb_pop_area.sort_values(by='midu',axis=0,ascending=False)

summary

On the index of iloc and loc

loc: display index
iloc: hidden world index

abb_pop_area.sort_values(by='midu',axis=0,ascending=False).iloc[0]['state']

'District of Columbia'

abb_pop_area.sort_values(by='midu',axis=0,ascending=False).loc[51]['state']

'Alaska'

Posted by razz on Sun, 24 Oct 2021 07:02:25 -0700

Programmer Group

Population analysis cases

demand

1, Data preprocessing

2, Data processing

summary

Hot Keywords