Power of Python Pandas

Power of Python Pandas

The ease of extracting and summarizing large amounts of data using Python Pandas is powerful.  Below is an example of using airline data to find out how many passengers went to an airport, accident rate based on reference codes, deaths and the causes of accidents.  With a few lines of code, you can quickly see the power of Python Pandas.  Enjoy!

Examine passenger volume data and incident and accident data for airlines and airports. See the prior posting on previously manipulated air transportation data.

The Data

Data is from the U.S. Dept. of Transportation (USDOT) and the U.S. Federal Aviation Administration (FAA). The USDOT data are for calendar year 2014.

You can find additional documentation on the FAA site, above, if you find that it’s useful.

Note that in the USDOT data file there are commas in quantities that exceed one thousand. Where this is the case, the quantity is enclosed in double quotes.

Python Panda Tasks

Using Python Pandas and the data outlined above, there are four main tasks:

Task A:  Determine for the six U.S. airports ABQ, BOI, BUF, CMH, SEA, and TPA, how many passenger departures and arrivals they each had in 2014, and also which airline was each airport’s largest departure carrier and the largest arrival carrier.

Task B:  For each of these six airports, determine what airports the largest number of arrivals were from, and the airport that the largest number of departures went to.

Task C:  For each of these six airports, determine the number of accidents or incidents that occurred at them between 2010 and 2014 inclusive, according to the FAA. Determine the number of deaths that occurred in each event for each of these five airports.

Task D:  Determine what the top ten (primary) causes of 2010-2014 incidents and accidents are for all events resulting in deaths regardless of where they occurred. Provide descriptions (not codes) for the causes.

Air trips and bad things – winter 2017 CC by SA Lynd Bacon & Associates, Ltd. DBA Lorma Buena Associates is licensed under CC BY 4.0

Task A:  Passenger Volume for Six Airports and Largest Carrier of Departures and Arrivals

import pandas as pd

import csv

filein=pd.read_csv(“d:/GrEx4/2014 CY-YTD Passenger Raw Data.csv”, encoding=’utf-8′, thousands=’,’,

na_values=”)

df=filein.filter(items=[‘OriginApt’, ‘DestApt’, ‘Total’, ‘Carrier’])

#choose the selected airports

selapts =[‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’]

#create a dataframe

dfselect=pd.DataFrame(df)

#reduce the dataframe to only selectedairports

#————Question 1A ———

paxdf = pd.DataFrame(filein, columns = [‘OriginApt’, ‘DestApt’, ‘Carrier’, ‘Total’ ])

selectoriginapt = paxdf[paxdf.OriginApt.isin([‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’])]

q1paxbyapt = selectoriginapt.pivot_table([‘Total’], index = [‘OriginApt’], aggfunc=sum)

print (‘\nQ1: Largest airports by number of passengers’)

print (q1paxbyapt)

#————–Questin 1B—————

#part b largest carriers by departrues or OriginApt

selairpots = dfselect.loc[dfselect[‘OriginApt’].isin(selapts),(‘OriginApt’, ‘DestApt’, ‘Carrier’, ‘Total’)]

q1carriersbyorigin = pd.DataFrame(selairpots)

carrbyorg=q1carriersbyorigin.loc[q1carriersbyorigin[‘OriginApt’].isin(selapts),([‘OriginApt’,

‘DestApt’,’Carrier’, ‘Total’])]

carrbyorg[‘Total’].convert_objects(convert_numeric=True)

AptsLgArrivalsQ1B = carrbyorg.groupby([‘OriginApt’])[‘Carrier’, ‘Total’].max()

print (‘Select Airports largest carriers by Origin’)

print (AptsLgArrivalsQ1B)

print (‘Select Airports largest carriers by Destination: NA – no flight based on selected airports in DestApt’)

selairpotsdest =dfselect.loc[dfselect[‘DestApt’].isin(selapts),( ‘DestApt’,’OriginApt’, ‘Carrier’, ‘Total’)]

q1carriersbydest = pd.DataFrame(selairpotsdest)

carrbydest=q1carriersbydest.loc[q1carriersbydest[‘DestApt’].isin(selairpotsdest),([‘OriginApt’, ‘DestApt’,’Carrier’, ‘Total’])]

AptsLgArrivalsQ1B = carrbydest.groupby([‘DestApt’])[‘Carrier’, ‘Total’].max()

[/code]

Output

Q1  Largest airports by number of passengers
Total
OriginApt
ABQ 264
BOI 98
BUF 3891
CMH 57213
SEA 3678369
TPA 615820
Select airports largest carriers by Origin
Carrier Total
OriginApt
ABQ UA 133
BOI ZX 59
BUF YX 272
CMH XP 4617
SEA UA 33414
TPA WS 25275

Task B:  Largest Arrivals and Departures based on Six Selected Airports

</pre>
<p style="padding-left: 60px;">import pandas as pd</p>
<p style="padding-left: 60px;">import csv</p>
<p style="padding-left: 60px;">#Read File</p>
<p style="padding-left: 60px;">filein=pd.read_csv("d:/GrEx4/2014 CY-YTD Passenger Raw Data.csv", encoding</p>
<p style="padding-left: 60px;">='utf-8', thousands=',', na_values='')</p>
<p style="padding-left: 60px;">df=filein.filter(items=['OriginApt', 'DestApt', 'Total', 'Carrier'])</p>
<p style="padding-left: 60px;">#choose the selected airports</p>
<p style="padding-left: 60px;">selapts =['ABQ', 'BOI', 'BUF', 'CMH', 'SEA', 'TPA']</p>
<p style="padding-left: 60px;">#create a dataframe</p>
<p style="padding-left: 60px;">dfselect=pd.DataFrame(df)</p>
<p style="padding-left: 60px;">#reduce the dataframe to only selectedairports</p>
<p style="padding-left: 60px;">selairpots =dfselect.loc[dfselect['OriginApt'].isin(selapts),('OriginApt',</p>
<p style="padding-left: 60px;">'DestApt', 'Total')]</p>
<p style="padding-left: 60px;">#part 2 - no airlines in the selapts are in the 6 selected airports</p>
<p style="padding-left: 60px;">selairpots2 =dfselect.loc[dfselect['DestApt'].isin(selapts),('OriginApt',</p>
<p style="padding-left: 60px;">'DestApt', 'Total')]</p>
<p style="padding-left: 60px;">#create subset of dataframe to use for this portion of the question</p>
<p style="padding-left: 60px;">q1arrivalsbyorigin = pd.DataFrame(selairpots)</p>
<p style="padding-left: 60px;">NumArrivalbyOrig3=q1arrivalsbyorigin.loc[q1arrivalsbyorigin['OriginApt'].isin(selapts),(['OriginApt', 'DestApt', 'Total'])]</p>
<p style="padding-left: 60px;">#Convert the column to numbers - two lines leave the below in otherwise will create an error</p>
<p style="padding-left: 60px;">#ck on types of data - 2 lines</p>
<p style="padding-left: 60px;">q1arrivalsbyorigin['Total'] = q1arrivalsbyorigin['Total'].apply(pd.to_numeric, errors='coerce') #messes up numbers don't use</p>
<p style="padding-left: 60px;">q1arrivalsbyorigin['Total'].convert_objects(convert_numeric=True)</p>
<p style="padding-left: 60px;">#set index</p>
<p style="padding-left: 60px;">print ('Question 2 Part A: The number of arrivals from each origin are: ')</p>
<p style="padding-left: 60px;">AptsLgArrivalsQ2A = q1arrivalsbyorigin.groupby(['OriginApt'])['DestApt', 'Total'].max()</p>
<p style="padding-left: 60px;">print (AptsLgArrivalsQ2A)</p>
<p style="padding-left: 60px;">print ('Question 2 Part A: There are no selected airports in the destination ')</p>

<h3>

Output:

Question 2 Part A: The number of arrivals from each origin are:
OriginApt DestApt Total
ABQ YVR 133
BOI YYZ 59
BUF YYZ 272
CMH YYZ 4617
SEA YYZ 33414
TPA ZRH 25275
Question 2 Part A: There are no selected airports in the destination

Task C:  Number of Accidents/Incidents at Selected Airports with the Associated Number of Deaths.

</pre>
<p style="padding-left: 60px;">import pandas as pd</p>
<p style="padding-left: 60px;">import zipfile</p>
<p style="padding-left: 60px;">#read the zip file</p>
<p style="padding-left: 60px;">zip_ref = zipfile.ZipFile('d:/GrE4/a2010_14.zip', 'r')</p>
<p style="padding-left: 60px;">zip_ref.extractall('d:/GrE4/a2010_14')</p>
<p style="padding-left: 60px;">zip_ref.close()</p>
<p style="padding-left: 60px;">#assign na values</p>
<p style="padding-left: 60px;">na_values = ['', 'N/A', '0', '000000']</p>
<p style="padding-left: 60px;">filein = pd.read_table('d:/GrE4/a2010_14/A2010_14.txt', na_values='NULL',)</p>
<p style="padding-left: 60px;">df=filein.filter(items=['c143', 'c76', 'c1'])</p>
<p style="padding-left: 60px;">#remove spaces from the end of the words</p>
<p style="padding-left: 60px;">df['c143'] = df['c143'].str.strip()</p>
<p style="padding-left: 60px;">df['c1'] = df['c1'].str.strip()</p>
<p style="padding-left: 60px;">#put the cleaned version into a dataframe</p>
<p style="padding-left: 60px;">dffaa = pd.DataFrame(df)</p>
<p style="padding-left: 60px;">dffaa2 = pd.DataFrame(df)</p>
<p style="padding-left: 60px;">dffaa2.to_csv('d:/20170223df2-dffaa2.csv')</p>
<p style="padding-left: 60px;">selapts = ['ABQ', 'BOI', 'BUF', 'CMH', 'SEA', 'TPA']</p>
<p style="padding-left: 60px;">print ('----------incidents ------------------')</p>
<p style="padding-left: 60px;">#The GOLDEN NUGGET on the lookup of the subset and return results</p>
<p style="padding-left: 60px;">numincidents=dffaa2.loc[dffaa2['c143'].isin(selapts),'c1']</p>
<p style="padding-left: 60px;">selectairports=dffaa2.loc[dffaa2['c143'].isin(selapts),(['c143', 'c1']) ]</p>
<p style="padding-left: 60px;">dffaa2.set_index(['c143', 'c1'])</p>
<p style="padding-left: 60px;">print ('Question 3 Part A: The number of incident and accidents by airport are:')</p>
<p style="padding-left: 60px;">#Create a new dataframe</p>
<p style="padding-left: 60px;">dfselectairport=pd.DataFrame(selectairports)</p>
<p style="padding-left: 60px;">#print count incidents/accidents by airport</p>
<p style="padding-left: 60px;">NumAIbyAirport=dfselectairport.groupby(['c143', 'c1'])['c1'].count()</p>
<p style="padding-left: 60px;">print NumAIbyAirport</p>
<p style="padding-left: 60px;">#The GOLDEN NUGGET on the lookup of the subset and return results</p>
<p style="padding-left: 60px;">numincidents=dffaa2.loc[dffaa2['c143'].isin(selapts),'c76']</p>
<p style="padding-left: 60px;">selectairports2=dffaa.loc[dffaa['c143'].isin(selapts),(['c143', 'c76']) ]</p>
<p style="padding-left: 60px;">dffaa.set_index(['c143', 'c76'])</p>
<p style="padding-left: 60px;">#------------------------------</p>
<p style="padding-left: 60px;">print ('Question 3 Part B: The number of deaths by airport are:')</p>
<p style="padding-left: 60px;">#Create a new dataframe</p>
<p style="padding-left: 60px;">dfselectairport2=pd.DataFrame(selectairports2)</p>
<p style="padding-left: 60px;">#print count incidents/accidents by airport</p>
<p style="padding-left: 60px;">deathsbyairport=dfselectairport2.groupby(['c143'])['c76'].count()</p>
<p style="padding-left: 60px;">print deathsbyairport</p>
<p style="padding-left: 60px;">#boolean version to see what's in the df</p>
<p style="padding-left: 60px;">selaptmatch=dffaa['c76'].isin(selapts)</p>
<p style="padding-left: 60px;">#The GOLDEN NUGGET on the lookup of the subset and return results</p>
<p style="padding-left: 60px;">numdeaths=dffaa.loc[dffaa['c143'].isin(selapts),'c76'].count()</p>
<p style="padding-left: 60px;">print ('---The total number of deaths at all airports: ')</p>
<p style="padding-left: 60px;">print numdeaths</p>

<h3>

Output

Question 3 Part A: The number of incident and accidents by airport are:
ABQ A 5
I 2
BOI A 3
I 3
CMH A 1
I 3
SEA A 3
I 3
TPA A 2
I 4
Name: c1, dtype: int64
Question 3 Part B: The number of deaths by airport are:
ABQ 7
BOI 6
CMH 4
SEA 6
TPA 6
Name: c76, dtype: int64
The total number of deaths at all airports:
29

Task D:  Top 10 Causes of Death with Descriptions

</pre>
<p style="padding-left: 30px;">zip_ref = zipfile.ZipFile('d:/GrE4/a2010_14.zip', 'r')</p>
<p style="padding-left: 30px;">zip_ref.extractall('d:/GrE4/a2010_14')</p>
<p style="padding-left: 30px;">zip_ref.close()</p>
<p style="padding-left: 30px;">#assign na values</p>
<p style="padding-left: 30px;">na_values = ['', 'N/A', '0', '000000']</p>
<p style="padding-left: 30px;">filein = pd.read_table('d:/GrE4/a2010_14/A2010_14.txt', na_values='NULL',dtype={'c78':'str'})</p>
<p style="padding-left: 30px;">df=filein.filter(items = [None, 'c78'])</p>
<p style="padding-left: 30px;">#fill na's with blanks so they can be deleted in the next step - did so to only come up with accident causes</p>
<p style="padding-left: 30px;">df = df.fillna({</p>
<p style="padding-left: 30px;">'c78': ''})</p>
<p style="padding-left: 30px;">#strip spaces</p>
<p style="padding-left: 30px;">df['c78']=df['c78'].str.strip()</p>
<p style="padding-left: 30px;">#compress to only lines with values</p>
<p style="padding-left: 30px;">filter=df['c78'] !=""</p>
<p style="padding-left: 30px;">cldf2=df[filter]</p>
<p style="padding-left: 30px;">cldf2.to_csv('d:/q4cldf.csv') #check the file</p>
<p style="padding-left: 30px;">#convert set index and create new df</p>
<p style="padding-left: 30px;">cldf3 = cldf2.set_index('c78')</p>
<p style="padding-left: 30px;">cldf4 = pd.DataFrame(cldf3)</p>
<p style="padding-left: 30px;">sublevel_map = {</p>
<p style="padding-left: 30px;">'32': 'Pilot Fatigue/Windshield', '85':'Loss of Tail/Gusty Winds', '73':'Settling w/power / Pilot Fatigue', '72':'Settling w/ Power / Windshield'}</p>
<p style="padding-left: 30px;">#cldf4['c78'].map(lambda x: sublevel_map[x.lower()])</p>
<p style="padding-left: 30px;">mapping = {32:'Pilot Fatigue/Windshield', 85:'Loss of Tail/Gusty Winds', 73:'Settling w/power / Pilot Fatigue',72:'Settling w/ Power / Windshield', 24:'Windshield/Rough Sod', 27:'Windshield/Settling w/ Power', 61:'Struck Animal/Vision obscured', 29:'Windshield/invalid', 21:'Windshield/Vision Obscured'}</p>
<p style="padding-left: 30px;">cldf2.index.names = list(map(lambda name: mapping.get(name, name), cldf4.index.names))</p>
<p style="padding-left: 30px;">cldf2.rename(columns=mapping, inplace=True)</p>
<p style="padding-left: 30px;">#cldf2.replace([4],['jojojo'])</p>
<p style="padding-left: 30px;">#cldf2 = cldf2'[c78'].replace({'32':'Pilot Fatigue/Windshield', '85':'Loss of Tail/Gusty Winds', '73':'Settling w/power / Pilot Fatigue', '72':'Settling w/ Power / Windshield', '24':'Windshield/Rough Sod', '27':'Windshield/Settling w/ Power', '61':'Struck Animal/Vision obscured', '29':'Windshield/invalid', '21':'Windshield/Vision Obscured'})</p>
<p style="padding-left: 30px;">df.rename(columns = {'c78': 'Primary_Cause'}, inplace=True)</p>
<p style="padding-left: 30px;">#replace the codes with reasons in text</p>
<p style="padding-left: 30px;">cldf2.is_copy = False</p>
<p style="padding-left: 30px;">cldf2.loc[cldf2['c78'] == "32", 'c78'] = '32: Pilot Fatigue/Windshield'</p>
<p style="padding-left: 30px;">cldf2.loc[cldf2['c78'] == "85", 'c78'] = '85: Loss of Tail/Gusty Winds'</p>
<p style="padding-left: 30px;">cldf2.loc[cldf2['c78'] == "73", 'c78'] = '73: Settling w/power / Pilot Fatigue'</p>
<p style="padding-left: 30px;">cldf2.loc[cldf2['c78'] == "72", 'c78'] = '72: Settling w/ Power / Windshield'</p>
<p style="padding-left: 30px;">cldf2.loc[cldf2['c78'] == "29", 'c78'] = '29: Windshield/invalid'</p>
<p style="padding-left: 30px;">cldf2.loc[cldf2['c78'] == "24", 'c78'] = '24: Windshield/Rough Sod'</p>
<p style="padding-left: 30px;">cldf2.loc[cldf2['c78'] == "27", 'c78'] = '27: Windshield/Settling w/ Power'</p>
<p style="padding-left: 30px;">cldf2.loc[cldf2['c78'] == "61", 'c78'] = '61: Struck Animal/Vision obscured'</p>
<p style="padding-left: 30px;">cldf2.loc[cldf2['c78'] == "21", 'c78'] = '21: Windshield/Vision Obscured'</p>
<p style="padding-left: 30px;">cldf2.loc[cldf2['c78'] == "71", 'c78'] = '**********************Other********************'</p>
<p style="padding-left: 30px;">#show the top ten using groupby method</p>
<p style="padding-left: 30px;">top10=cldf2.groupby(['c78']).size().sort_values(inplace=False, ascending=False)[:(10)]</p>
<p style="padding-left: 30px;">print 'The Top 10 Reasons are:', top10</p>

Output

The Top 10 Reasons are: c78
32: Pilot Fatigue/Windshield 549
85: Loss of Tail/Gusty Winds 224
73: Settling w/power / Pilot Fatigue 87
72: Settling w/ Power / Windshield 85
24: Windshield/Rough Sod 62
27: Windshield/Settling w/ Power 52
61: Struck Animal/Vision obscured 50
29: Windshield/invalid 44
21: Windshield/Vision Obscured 38
**********************other******************** 35

As shown above, Python Pandas are a fabulous tool to extract information from a database, make inquiries on and summarize data.

Leave a Reply