pandas 專題:美國普查

郭耀仁

來源

CourseraIntroduction to Data Science in Python 課程第二週作業

資料

In [1]:
import pandas as pd

census_df = pd.read_csv('https://storage.googleapis.com/py_ml_datasets/census.csv')
census_df.shape
Out[1]:
(3193, 100)
In [2]:
census_df.head()
Out[2]:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 ... RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015
0 40 3 6 1 0 Alabama Alabama 4779736 4780127 4785161 ... 0.002295 -0.193196 0.381066 0.582002 -0.467369 1.030015 0.826644 1.383282 1.724718 0.712594
1 50 3 6 1 1 Alabama Autauga County 54571 54571 54660 ... 7.242091 -2.915927 -3.012349 2.265971 -2.530799 7.606016 -2.626146 -2.722002 2.592270 -2.187333
2 50 3 6 1 3 Alabama Baldwin County 182265 182265 183193 ... 14.832960 17.647293 21.845705 19.243287 17.197872 15.844176 18.559627 22.727626 20.317142 18.293499
3 50 3 6 1 5 Alabama Barbour County 27457 27457 27341 ... -4.728132 -2.500690 -7.056824 -3.904217 -10.543299 -4.874741 -2.758113 -7.167664 -3.978583 -10.543299
4 50 3 6 1 7 Alabama Bibb County 22915 22919 22861 ... -5.527043 -5.068871 -6.201001 -0.177537 0.177258 -5.088389 -4.363636 -5.403729 0.754533 1.107861

5 rows × 100 columns

問題一

哪一個州(state)的郡(county)數最多?(注意 SUMLEV 變數)

def answer_one():
    '''
    這個函數應該回傳一個文字,州名
    '''
    return "答案"

問題二

假如僅考慮每州(state)人口最多的三個郡(county)計算人口總和,哪三個州總和數最多?(利用 CENSUS2010POP 變數)

def answer_two():
    '''
    這個函數應該回傳一個清單,三個州名
    '''
    return "答案"

問題三

哪個郡(county)在 2010-2015 期間人口改變數量的絕對值最高?(考慮 POPESTIMATE2010 到 POPESTIMATE2015 這六個變數) 提示:如果五年的人口數分別為 100, 120, 80, 105, 100, 130 則人口改變數量的絕對值為 |130-80| = 50

def answer_three():
    '''
    這個函數應該回傳一個文字,郡名
    '''
    return "答案"

問題四

篩選出屬於 REGION 1 或 2、開頭名稱為 Washington 並且 POPESTIMATE2015 大於 POPESTIMATE2014 的郡(county)

def answer_four():
    '''
    這個函數應該回傳一個 DataFrame,外型為 5x2,
    變數名稱為 ['STNAME', 'CTYNAME'],
    索引值由小到大排列
    '''
    return "答案"

參考解答

In [3]:
def answer_one():
    filtered_df = census_df[census_df['SUMLEV'] == 50]
    grouped_census_df = filtered_df[['STNAME', 'CTYNAME']].groupby(['STNAME']).count()
    max_idx = grouped_census_df['CTYNAME'].idxmax()
    return max_idx

answer_one()
Out[3]:
'Texas'
In [4]:
def answer_two():
    filtered_df = census_df[census_df['SUMLEV'] == 50]
    sorted_df = filtered_df.sort_values(['STNAME', 'CENSUS2010POP'], ascending = [True, False])
    unique_states = list(set(sorted_df['STNAME'].values))
    sum_pops = []
    for state in unique_states:
        state_df = sorted_df[sorted_df['STNAME'] == state]
        sum_pop = state_df['CENSUS2010POP'][:3].sum()
        sum_pops.append(sum_pop)
    ser = pd.Series(sum_pops, index = unique_states)
    ser = ser.sort_values(ascending = False)
    top_three = list(ser[:3].index)
    return top_three

answer_two()
Out[4]:
['California', 'Texas', 'Illinois']
In [5]:
def answer_three():
    filtered_df = census_df[census_df['SUMLEV'] == 50]
    filtered_df = filtered_df.reset_index(drop = True)
    nrow = filtered_df.shape[0]
    pop_diff = []
    for i in range(nrow):
        cty = filtered_df.loc[i, 'POPESTIMATE2010':'POPESTIMATE2015']
        max_pop = cty.max()
        min_pop = cty.min()
        diff_abs = abs(max_pop - min_pop)
        pop_diff.append(diff_abs)
    filtered_df['Pop_Diff_Abs'] = pop_diff
    filtered_df = filtered_df.set_index('CTYNAME', drop = True)
    max_idx = filtered_df['Pop_Diff_Abs'].idxmax()
    return max_idx

answer_three()
Out[5]:
'Harris County'
In [6]:
def answer_four():
    filtered_df = census_df[census_df['SUMLEV'] == 50]
    is_region_1or2 = filtered_df['REGION'].isin([1, 2])
    region_1or2 = filtered_df[is_region_1or2]
    ctyname_pattern = "^Washington."
    is_Washington_something = region_1or2['CTYNAME'].str.match(ctyname_pattern)
    washing_df = region_1or2[is_Washington_something]
    is_larger = washing_df['POPESTIMATE2015'] > washing_df['POPESTIMATE2014']
    final_df = washing_df[is_larger]
    final_df = final_df.loc[:, ['STNAME', 'CTYNAME']]
    return final_df

answer_four()
Out[6]:
STNAME CTYNAME
896 Iowa Washington County
1419 Minnesota Washington County
2345 Pennsylvania Washington County
2355 Rhode Island Washington County
3163 Wisconsin Washington County