Python Data Types
Python Data Types: Putting Airline Data In Order
The data used in the python data types example is from OpenFlights.org and contains three data files, one for airports, one for routes and one for airlines. The data are for up to January 2012. The data in the file airports.dat looks like this:
1,”Goroka”,”Goroka”,”Papua New Guinea”,”GKA”,”AYGA”,-6.081689,145.391881,5282,10,”U”,”Pacific/Port_Moresby”
2,”Madang”,”Madang”,”Papua New Guinea”,”MAG”,”AYMD”,-5.207083,145.7887,20,10,”U”,”Pacific/Port_Moresby”
3,”Mount Hagen”,”Mount Hagen”,”Papua New Guinea”,”HGU”,”AYMH”,-5.826789,144.295861,5388,10,”U”,”Pacific/Port_Moresby”
4,”Nadzab”,”Nadzab”,”Papua New Guinea”,”LAE”,”AYNZ”,-6.569828,146.726242,239,10,”U”,”Pacific/Port_Moresby”
What you have here is a character (comma in this case) separated value file.
Here are the fields in this file, according to OpenFlights.org:
- Airport ID : Unique OpenFlights identifier for this airport.
- Name : Name of airport. May or may not contain the City name
- City : Main city served by airport. May be spelled differently from Name
- Country : Country or territory where airport is located
- IATA/FAA : 3-letter FAA code, for airports located in Country “United States of America”. 3-letter IATA code, for all other airports. Blank if not assigned
- ICAO : 4-letter ICAO code. Blank if not assigned
- Latitude : Decimal degrees, usually to six significant digits. Negative is South, positive is North
- Longitude : Decimal degrees, usually to six significant digits. Negative is West, positive is East
- Altitude : In feet
- Timezone : Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5
- DST : Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown)
- Tz : database time zoneTimezone in “tz” (Olson) format, eg. “America/Los_Angeles”
OpenFlights says: the data is ISO 8859-1 (Latin-1) encoded, with no special characters.
Note: Rules for daylight savings time change from year to year and from country to country. The current data is an approximation for 2009, built on a country level. Most airports in DST-less regions in countries that generally observe DST (eg. AL, HI in the USA, NT, QL in Australia, parts of Canada) are marked incorrectly.”
The other two files, routes.dat and airlines.dat, are similar to airports.dat. The fields in airlines.dat are:
- Airline ID : Unique OpenFlights identifier for this airline
- Name : Name of the airline
- Alias : Alias of the airline. For example, All Nippon Airways is commonly known as “ANA”
- IATA : 2-letter IATA code, if available
- ICAO : 3-letter ICAO code, if available
- Callsign : Airline callsign
- Country : Country or territory where airline is incorporated
- Active : “Y” if the airline is or has until recently been operational, “N” if it is defunct. This field is not reliable: in particular, major airlines that stopped flying long ago, but have not had their IATA code reassigned (eg. Ansett/AN), will incorrectly show as “Y”
Additional information about the airlines.dat data from OpenFlights:
The data is ISO 8859-1 (Latin-1) encoded. The special value \N is used for “NULL” to indicate that no value is available, and is understood automatically by MySQL if imported. Notes: Airlines with null codes/callsigns/countries generally represent user-added airlines. Since the data is intended primarily for current flights, defunct IATA codes are generally not included. For example, “Sabena” is not listed with a SN IATA code, since “SN” is presently used by its successor Brussels Airlines.
routes.dat has the following data fields:
- Airline : 2-letter (IATA) or 3-letter (ICAO) code of the airline.
- Airline ID : Unique OpenFlights identifier for airline (see Airline).
- Source airport : 3-letter (IATA) or 4-letter (ICAO) code of the source airport.
- Source airport ID : Unique OpenFlights identifier for source airport (see Airport)
- Destination airport : 3-letter (IATA) or 4-letter (ICAO) code of the destination airport.
- Destination airport ID : Unique OpenFlights identifier for destination airport (see Airport)
- Codeshare : “Y” if this flight is a codeshare (that is, not operated by Airline, but another carrier), empty otherwise.
- Stops : Number of stops on this flight (“0” for direct)
- Equipment : 3-letter codes for plane type(s) generally used on this flight, separated by spaces
Here’s some additional information about routes.dat: The data is ISO 8859-1 (Latin-1) encoded. The special value \N is used for “NULL” to indicate that no value is available, and is understood automatically by MySQL if imported. (Note: the \N is how missings values were coded. It’s not be taken as the newline character.)
Notes:
- Routes are directional: if an airline operates services from A to B and from B to A, both A-B and B-A are listed separately
- Routes where one carrier operates both its own and codeshare flights are listed only once
To Do
- Read each data file into a Pandas DataFrame. Add meaningful names to the columns of each DataFrame
- Print the first three rows of each DataFrame and the results it produces
- Check each DataFrame for duplicate records. For each, report the number of duplicates you found
- Provide commented, syntactically correct code and the results it produced
- Describe the data types of the columns in each of the DataFrames
- Provide commented, syntactically correct code and the results it produced
- Determine how many of the airlines are “defunct”
- Provide a definition of what a defunct airline is
- Provide commented, syntactically correct code and the results it produced
- Determine how many “routes from nowhere” there are in the data. These are flights that don’t originate from an airport
- Provide your commented, syntactically correct code and the results it produced
- Save your Data Frames for future use. Pickle them, put them in a shelve db, on in a the tables of a SQL db
- Provide commented, syntactically correct code and the results it produced
The Task at Hand: Getting Airline Data In Order CC by SA Lynd Bacon & Associates, Ltd. DBA Lorma Buena Associates is licensed under CC BY 4.0.
Code for Different Python Data Types
#Christa Taylor
import pandas as pd
import csv
import sys
import pickle
#—-Attention on the Airlines file—-
print ‘—AIRLINES—\ n’
#reads each of the files using a comma delimiter
#no header airlines_columns creates the column names that is called upon in airlines
airlines_columns =[‘AirlineID’,’Name’,’Alias’,’IATA’,’ICAO’,’CallSign’,’Country’
#added index_col in airlines only as that appears to be what it is so no need to duplicate
airlines = pd.read_csv(‘airlines.dat’, sep=’,’, encoding=’utf-8′,header=None, names
airlines.shape
print airlines[0:3]
#replace \N values – need the r in front otherwise will return it as a end of line function
airlines[‘Alias’] = airlines[‘Alias’].replace(to_replace=[r”\ \N”],value=’NO ALIAS’
#replace blank values with No Data – put zero in the blanks first then replace the zero
airlines[‘IATA’].fillna(0)
airlines[‘ICAO’].fillna(0)
airlines[‘CallSign’].fillna(0)
#item 1: change NaN
airlines.dropna(how=’any’).shape
airlines[‘IATA’].fillna(value=’NO DATA’,inplace=True)
airlines[‘ICAO’].fillna(value=’NO DATA’,inplace=True)
airlines[‘CallSign’].fillna(value=’MISSING’,inplace=True)
airlines.shape
#airlines[‘Destination Airport ID’] = routes[‘Destination Airport ID’].replace(to_replace=[r”\\N”],value=’MISSING’,regex=True)
#item 1: change NaN to dashes
airlines.replace({‘\N’: ‘-‘}, regex=True)
airlines.dropna(how=’any’).shape
airlines.shape
# item 2: Report the number of duplicate records found based on several columns but not all columns
dup_airlines = airlines.duplicated().sum()
print (‘\ nThe number of duplicates in the Airlines data: ‘) + repr (dup_airlines)
print (‘\ n The duplicated entries in Airlines are: \ n’) + repr (airlines.loc[airlines
#remove the duplicated entries last for fun but use False to remove all
airlines.drop_duplicates(keep=’first’)
#item 3: describe the data types of the columns in the DataFrame
#remember object is a string
print ‘\ nData types in airlines are: \ n’ + repr (airlines.dtypes)
# item 4: how many airlines are defunct. Look in column for item and sum
defunctair = airlines[‘Active’].str.contains(r’N’).sum()
print (‘Definition: A defunct airline is where Active is equal to “N”‘)
print (‘\ nThe number of defunct airlines are: ‘) + repr(defunctair) +’.’
# item 6: Save DataFrames to pickled file and then check with output to csv file
airlines.to_pickle(“airlinepickled.pkl”)
#check to see what the pickled file looks like
airlines = pd.read_pickle(“airlinepickled.pkl”)
airlines.to_csv(‘airlinespickled.csv’, encoding=’utf-8′)
#—-Attention now to the Routes file—-
print ‘\ n—ROUTES—\ n’
#item 1: Read the file and add names to the colums
#same as above – read comma delimited file, create column headers and assign
routes_columns =[‘Airline’,’ID’,’Source Airport’,’Source Airport ID’,’Destination Airport’
routes= pd.read_csv(‘routes.dat’, sep=”,”, encoding=’utf-8′,header=None, names=routes_columns)
routes.shape
print routes[0:3]
#replace \N values – need the r in front otherwise will return it as a end of line function
routes[‘ID’] = routes[‘ID’].replace(to_replace=[r”\ \N”],value=’MISSING’,regex=True
routes[‘Source Airport’] = routes[‘Source Airport’].replace(to_replace=[r”\ \N”],value
routes[‘Source Airport ID’] = routes[‘Source Airport ID’].replace(to_replace=[r”
routes[‘Destination Airport ID’] = routes[‘Destination Airport ID’].replace(to_replace
#item 1: change NaN to dashes
routes.replace({‘\N’: ‘-‘}, regex=True)
routes.dropna(how=’any’).shape
routes.shape
# item 2: Report the number of duplicate records found after the first one
routes.duplicated().sum()
routes.loc[routes.duplicated(keep=’first’), :]
#different counts all duplicates
routes.loc[routes.duplicated(keep=False), :]
# item 2: Report the number of duplicate records found based on several columns but not all columns – second method to ck only subset of columns not all
dup_routes = routes.duplicated(subset=[‘Airline’,’ID’,’Source Airport’,’Source Airport ID’
print (‘\ nThe number of duplicates in the routes data: ‘) + repr (dup_routes) +
#remove the duplicated entries last for fun but use False to remove all
print (‘\ n The duplicated entries in routes are: \ n’) + repr (routes.loc[routes
#remove the duplicated entries last for fun but use False to remove all
routes.drop_duplicates(keep=’first’)
#item 3: describe the data types of the columns in the DataFrame
#remember object is a string
print ‘\ nData types in routes are: \ n’ + repr (routes.dtypes)
#item 5: how many routes from nowhere
#for fun drops all routes with any rows with missing values
#routes.dropna(how=’any’).shape
#for missing data in either (how=any) source aiport or source airport ID (for both how would = all)
#routes.dropna(subset=[‘Source Airport’, ‘Source Airport ID’],how=’any’).shape
import numpy as np
df=routes[‘Source Airport ID’]
np.bincount
nw_count = pd.Series(df).value_counts()
nw_count = nw_count[‘NOWHERE’]
print ‘\ nThe number of routes from nowhere are: ‘ + repr(nw_count) +’.\ n’
# item 6: Save DataFrames to pickled file and then check with output to csv file
routes.to_pickle(“routespickled.pkl”)
#check to see what the pickled file looks like
routes = pd.read_pickle(“routespickled.pkl”)
routes.to_csv(‘routespickled.csv’, encoding=’utf-8′)
#—-Attention now to the Airports file—-
print ‘\ n —AIRPORTS—\ n’
#item 1: Read the file and add names to the colums
#same as above – read comma delimited file, create column headers and assign
airports_columns =[‘Airport ID’,’Name’,’City’,’Country’,’IATA/FAA’,’ICAO’,’Latitude’
airports = pd.read_csv(‘airports.dat’, sep=”,”, encoding=’utf-8′,header=None, names
airports.shape
print airports[0:3]
#replace blank values with No Data – put zero in the blanks first then replace the zero
airports[‘IATA/FAA’].fillna(0)
#item 1: change NaN
airports.dropna(how=’any’).shape
airports[‘IATA/FAA’].fillna(value=’No Data’,inplace=True)
airports.shape
#item 1: change NaN to dashes
airports.replace({‘\N’: ‘-‘}, regex=True)
airports.dropna(how=’any’).shape
airports.shape
#replace \N values – need the r in front otherwise will return it as a end of line function
airports[‘ICAO’] = airports[‘ICAO’].replace(to_replace=[r”\ \N”],value=’MISSING’
#item 1: change NaN to dashes
airports.replace({‘\N’: ‘-‘}, regex=True)
airports.dropna(how=’any’).shape
airports.shape
# item 2: Report the number of duplicate records
dup_airports = airports.duplicated(subset=[‘Name’, ‘City’, ‘Country’,’IATA/FAA’,
print (‘\ n The number of duplicates in the Airports data: ‘) + repr (dup_airports)
#remove the duplicated entries last for fun but use False to remove all
print (‘\ n The duplicated entries in airports are: \ n’) + repr (airports.loc[airports
#remove the duplicated entries last for fun but use False to remove all
airports.drop_duplicates(keep=’first’)
# item 3: describe the data types of the columns in the DataFrame
print ‘\ nData types in airports are: \ n’ + repr (airports.dtypes)
# item 6: Save DataFrames to pickled file and then check with output to csv file
airports.to_pickle(“airportspickled.pkl”)
#check to see what the pickled file looks like
airports = pd.read_pickle(“airportspickled.pkl”)
airports.to_csv(‘airportspickled.csv’, encoding=’utf-8′)
print ‘—————-END——————‘
Output:
—AIRLINES—
Name Alias IATA ICAO CallSign Country Active
AirlineID
1 Private flight \N – NaN NaN NaN Y
2 135 Airways \N NaN GNL GENERAL United States N
3 1Time Airline \N 1T RNX NEXTIME South Africa Y
The number of duplicates in the Airlines data: 1.
The duplicated entries in Airlines are:
Name Alias IATA ICAO CallSign Country
\
AirlineID
169 Angola Air Charter NO ALIAS NO DATA AGO ANGOLA CHARTER Angola
Active
AirlineID
169 N
Data types in airlines are:
Name object
Alias object
IATA object
ICAO object
CallSign object
Country object
Active object
dtype: object
Definition: A defunct airline is where Active is equal to “N”
The number of defunct airlines are: 4886.
—ROUTES—
Airline ID Source Airport Source Airport ID Destination Airport \
0 2B 410 AER 2965 KZN
1 2B 410 ASF 2966 KZN
2 2B 410 ASF 2966 MRV
1
Destination Airport ID Codeshare Stops Equipment
0 2990 NaN 0 CR2
1 2990 NaN 0 CR2
2 2962 NaN 0 CR2
The number of duplicates in the routes data: 0.
The duplicated entries in routes are:
Empty DataFrame
The number of routes from nowhere are: 239.
—AIRPORTS—
Airport ID Name City Country IATA/FAA ICAO \
0 1 Goroka Goroka Papua New Guinea GKA AYGA
1 2 Madang Madang Papua New Guinea MAG AYMD
2 3 Mount Hagen Mount Hagen Papua New Guinea HGU AYMH
Latitude Longitude Altitude Timezone DST Tz
0 -6.081689 145.391881 5282 10.0 U Pacific/Port_Moresby
1 -5.207083 145.788700 20 10.0 U Pacific/Port_Moresby
2 -5.826789 144.295861 5388 10.0 U Pacific/Port_Moresby
The number of duplicates in the Airports data: 29.
The duplicated entries in airports are:
Airport ID Name City \
6523 7950 Ol Kiombo Mara Intrepids
6526 7953 Busan Busan
6533 7960 Xakan Xakan
6538 7965 Khao Sok National Park Surat Thani
6562 7989 JALOU Jalu
6632 8059 Hong Kong Macau Ferry Terminal Hong Kong
6634 8061 Macau Taipa Ferry Terminal Macau
6702 8129 TCO Tengiz
6723 8151 Field 21 Wainwright
2
6732 8160 Crocodile Camp Air Strip Tsavo East
6740 8168 Steenberg Helipad Cape Town
6768 8197 Atmautluak Airport Atmautluak
6833 8262 Algeciras Heliport Algeciras
6835 8264 Kai Tak International Airport Hong Kong
6842 8271 Singita Sabi Sands Sabi Sands
6897 8326 SCC4 MFO
6898 8327 SCC4 MFO
6954 8383 Pogapa Airstrip Pogapa
6993 8425 Matheson Island Airport Matheson Island
7018 8450 Athuruga Athuruga
7137 8569 North Sea Buchan Alpha
7211 8643 Boipeba Boipeba Island
7219 8651 Khulna Seaplane Landing Site Khulna
7220 8652 Khulna Seaplane Landing Site Khulna
7605 9040 Kitoi Khamar-Daban
7674 9109 Costa maya port Costa maya
7687 9122 Mount Aso helipad Aso
7735 9170 Royal Island Resort Royal Island Resort
7878 9313 Komandoo Komandoo
Country IATA/FAA ICAO Latitude Longitude Altitude \
6523 Kenya No Data MISSING -1.409569 35.110788 5006
6526 South Korea No Data MISSING 35.179444 129.075556 0
6533 Botswana No Data MISSING 1.000000 1.000000 1
6538 Thailand No Data MISSING 8.936667 98.530278 1200
6562 Libya No Data MISSING 29.142222 21.380556 0
6632 Hong Kong No Data MISSING 22.289372 114.152153 0
6634 Macau No Data MISSING 22.163300 113.574040 0
6702 Kazakhstan No Data MISSING 46.303056 53.427500 0
6723 Canada No Data MISSING 52.830601 -111.100998 2260
6732 Kenya No Data MISSING -3.076400 39.242800 0
6740 South Africa No Data MISSING -34.070582 18.424067 50
6768 United States No Data MISSING 60.866667 -162.273056 18
6833 Spain No Data MISSING 36.128889 -5.441111 98
6835 Hong Kong No Data MISSING 22.317700 114.202000 30
6842 South Africa No Data MISSING -24.840000 31.410000 3500
6897 Egypt No Data MISSING 30.524819 34.584025 2037
6898 Egypt No Data MISSING 30.524819 34.584025 2037
6954 Indonesia No Data MISSING -3.751800 136.842481 6060
6993 Canada No Data MISSING 51.732222 -96.934444 725
7018 Maldives No Data MISSING 3.000000 72.000000 0
7137 United Kingdom No Data MISSING 57.903900 0.031900 0
7211 Brazil No Data MISSING -13.565589 -38.939241 5
7219 Bangladesh No Data MISSING 22.464897 89.351261 0
7220 Bangladesh No Data MISSING 22.464897 89.351261 0
7605 Russia No Data MISSING 52.111000 101.986000 2000
7674 Mexico No Data MISSING 18.714000 -87.709000 0
7687 Japan No Data MISSING 32.884308 131.064638 1350
7735 Maldives No Data MISSING 5.162911 73.053578 0
7878 Maldives No Data MISSING 5.494003 73.424183 0
Timezone DST Tz
6523 3.0 N Africa/Nairobi
6526 9.0 U Asia/Seoul
6533 1.0 N \N
6538 7.0 U Asia/Bangkok
6562 2.0 U Africa/Tripoli
6632 8.0 N Asia/Hong_Kong
6634 8.0 N Asia/Macau
6702 5.0 U Asia/Oral
6723 -7.0 A America/Edmonton
6732 3.0 U Africa/Nairobi
6740 2.0 N Africa/Johannesburg
6768 -9.0 A America/Anchorage
6833 1.0 E Europe/Madrid
6835 8.0 U Asia/Hong_Kong
6842 2.0 N Africa/Johannesburg
6897 2.0 E Asia/Jerusalem
6898 2.0 E Asia/Jerusalem
6954 9.0 N Asia/Jayapura
6993 -6.0 A America/Winnipeg
7018 5.0 U Indian/Maldives
7137 0.0 U \N
7211 -3.0 S America/Fortaleza
7219 6.0 N Asia/Dhaka
7220 6.0 N Asia/Dhaka
7605 9.0 N Asia/Irkutsk
7674 -6.0 U America/Mexico_City
7687 9.0 N Asia/Tokyo
7735 5.0 U Indian/Maldives
7878 5.0 N Indian/Maldives
Data types in airports are:
Airport ID int64
Name object
City object
Country object
IATA/FAA object
ICAO object
Latitude float64
Longitude float64
Altitude int64
Timezone float64
DST object
Tz object
dtype: object
—————-END———–