Python Code Snippets

Handy Python Code Snippets

For when those moments arise and you know you’ve done something before and yet can’t seem to quickly find it, some quick Python Code that has served me well.

Use Variable Filenames

#variables
keyword = ‘term_to_use’
timeframe = ‘2020’
folder_in = ‘raw_data/’
folder_out = ‘processed’

#Open file:
df3 = pd.read_csv (“C:/users/” + ” folder_in + “all_raw_files/” + one_file + “.csv”, low_memory=False)

Write to file:  outputfile3 = open(“C:/users/” + folder_out + subfolder + keyword + timeframe +”_OUT” + “.csv”, ‘w’, encoding=”utf-8″, newline=”)  #add “index=False” to eliminate index column

Compare Two Data Frames

# Get all different values
diff_in_df = pd.merge(df1, df2, on=”optional_columnname”, how=’outer’, indicator=’Exist’)
diff_in_df = diff_in_df.loc[diff_in_df[‘Exist’] != ‘both’]
diff_in_df.info(verbose=True)

Create New Data Frame using Column Names

df2 = df[[“name”, “id”]].copy()
Rename all column names:  df2.columns = [“col1_new_name”, “col2_new_name2”]
Rename one column:  df.rename(columns = {‘index’: ‘ranking’}, inplace=True)

Or just drop specific columns:  df2 = df2.drop(columns = [‘column_to_dropA’, ‘column_to_dropB’])

Drop duplicates in specific column: df.drop_duplicates(subset = [‘col_name’], keep = “first”, inplace = True)

Get shape of dt:  df.shape

Get type:  type(df)

Text Cleaning

Remove Characters and Pop String (leaves NO spaces)

import string
string_to_clean = “STr1ing 2x space @Some4weird_ch^r-2021″
#string_to_clean = str(df[‘term’])

# Translation table which removes ONLY all digits and leaves no spaces between the characters
# setup the translation table that removes all numbers
translation_table = str.maketrans(”, ”, string.digits)

clean_string = string_to_clean.translate(translation_table)
print(clean_string)

String Split – keep last item
df[‘stringA’] = df.column_name.str.rsplit(‘.’).str.get(-1)

Drop text than is less than x in character length:  df2 = df[‘col_name_with_term’].str.findall(‘\w{x,}’).str.join(‘ ‘)  #change x to desired value

Drop Duplicate Terms in a Column

#df2 = df.drop_duplicates(‘col_name’)

Remove Custom Words

removewords = {‘goofyword1’, ‘omggoofyword2’, ‘supergoofyword3’    }
df2 = df[~df.term.isin(removewords)]

Clean Strings using RegEx

pattern = “[,.;@#?-!&$0123456789]+” #remove numeric values
pattern2 = ‘(?<!\d)[.,-;:](?!\d)’  #second option
df[‘term2’]= df[‘term’].str.replace(pattern,’ ‘, regex = True)
df[‘term2’]

Process Time

#pip install pytest-timeit
import timeit
tic = timeit.default_timer()
<code to process and time>
toc = timeit.default_timer()
toc – tic #elapsed time in seconds (ns = nanoseconds (10-9 ) µs is micro seconds (10-6)
print (toc – tic)

Quick Table

#!pip install sidetable
import sidetable
frame.stb.freq([‘category_name’, ‘cat2′], style=True) #, value=’numeric_variable’, thresh=5, #other options:  “value_counts, freq

table = table[tabl[‘category_name’].apply(lambda x: len(x)>1)]  #to filter the table with variable > 1 in column ‘category_name’ but can only be done with ‘style=False’
note: tabl should be table but WP will create a table with the code.

Dates

Format Date/Time

pd.to_datetime(df[‘ddate’], infer_datetime_format=True)

import datetime
frame[‘r_date’] = np.datetime64(‘2021-01-12’)
frame[‘days’] = r_date – frame[‘ddate’]
if frame[‘days’].dt.total_seconds !=0:
frame[‘pre_date’] = frame[‘ddate’]
else:
“0”

# combine all words into one big long string
import nltk
text_combined = str(text)
doc = nlp(text_combined)

for ent in doc.ents:
print(ent.text,ent.label_)
text_combined

Read a CSV file as a dictionary object

import csv
with open(‘C:/users/…filename.csv’) as myFile:

reader = csv.DictReader(myFile)
for row in reader:

print(row[‘term’])  #update the variable for the column(s) you want to import

Build a Dataframe

import pandas as pd

df = pd.DataFrame({‘w’: [], ‘sim’: []})

for wo, simi in zip(splitted, evaluated_s):

df = df.append({‘word’: wo, ‘sim’: evaluated_s}, ignore_index=True)

With CSV File:

import pandas as pd
c = pd.read_csv(“C:/filename.csv”, index_col=False)
print(c.dtypes)  #view datatypes

c = c.drop(columns = [‘Unnamed: 0’, ‘date’)  #drop columns