Top Words in COVID & Corona Domain Names – Text Basics
Top Words in COVID & Corona Domain Names
A few basics on determining the most popular terms in a dataframe of text. In this case, the dataframe is domain names that have been exported into a csv. For my example, I have a list of ~205,000 domain names with that has been compiled with ‘covid’ and/or ‘corona’. My purpose is to understand the other most popular terms associated with these two terms.
import pandas as pd import glob from pandas import ExcelWriter from pandas import ExcelFile domains = pd.read_csv("c:/users/chris/allcorona2.csv", delimiter=',' ) domains.head(n=10) #domains.tail(n=10) #create a subset of the dataframe domains = domains[['tld', 'domainname', 'term_split', 'termlength', 'spellerror', 'lema']] #or delete specific columns domains.drop(columns = ['ref', 'spcorrpercent'] ) domains
Clean the text by removing any spaces at the beginning of the word, lowercase the text and remove extra spaces. You’ll also see that I found some words that my dictionary did not properly split i.e. ‘co vid’ so I’ve taken these words and replaced them with the correct term – ‘covid’.
#split the words into unique elements, clean, stack, etc. import pandas as pd import re #create a new variable using term_split domains['clean'] = domains['term_split'] #remove any spaces at the beginning of the text domains['clean'] = domains['clean'].str.strip() # change all text to be lowercase domains['clean'] = [word.lower() for word in domains['clean']] #remove hyphens domains['clean'] = domains['clean'].astype(str).apply(lambda x: x.replace('-', '')) #my dictionary wasn't always so perfect in seperating 'covid' and corona - so I'm padding the term first domains['clean'] = domains['clean'].replace(["c ovid", "co vid", "covi d", "cov id", "covi d"], " covid ") domains['clean'] = domains['clean'].replace(["c orona", "co rona", "cor ona", "coro na", "coron a"], " corona ") #ensure terms are properly represented using various methods - replace, lambda domains['clean'] = domains['clean'].apply(str) domains['clean'] = domains['clean'].replace(to_replace = "virus", value = " virus ") domains['clean'] = domains['clean'].replace(str(domains['clean'])," ").replace("*virus*"," virus ") domains['clean'] = domains['clean'].replace({"corona", " corona "}) # then clean up any double spacing I created domains['clean'] = domains['clean'].astype(str).apply(lambda x: x.replace(' ', '')) domains['clean'] = domains['clean'].astype(str).apply(lambda x: x.replace('covid', ' covid ')) domains['clean'].str.split().tolist() #expand=False set(domains['clean']) #Other useful methods to clean text for ease of reference # .stack() # .drop_duplicates() # .replace(r'[\-\!\@\#\$\%\^\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\;\&\?\|]','')
With the terms all cleaned, I consolidate all of the terms by flattening them into a list and run my counter.
import itertools from itertools import chain import collections #add in a column for the number of words in each domainname domains['num_words'] = [len(sentence.split()) for sentence in domains['clean']] # Flatten the list of words domainwords = list(chain.from_iterable(map(str.split, domains['clean']))) # Create counter word_counts = collections.Counter(domainwords) # show the top x words word_counts.most_common(20)
Looking at the results of the counter, I noticed there are some words I want to exclude. Note, as these are domain names using a out-of-the-box cleaner does not fit my purpose so I’m doing it by hand. This is particularly helpful if you have scrapped files and want to remove html coding.
import itertools from itertools import chain import collections # remove words not wanted in the list. Keep in mind the domain names had to have the term covid or corona to be included in the dataset remove_words = ['a', 'and', 'to', 'the', 'in', 'of', 'for', 'is', 'on', 'e', 'it', 'be', 'as', 's', 'covid', 'corona', '19'] domainwords = [word.lower() for word in domainwords if word.lower() not in remove_words] # Create counter word_counts = collections.Counter(domainwords) # show the top x words word_counts.most_common(100)
Lists are boring…so a quick bar graph to show the top 25 terms.
from matplotlib import pyplot as plt word_counts_a = pd.DataFrame(word_counts.most_common(25), columns=['words', 'count']) word_counts_a.head() fig, ax = plt.subplots(figsize=(15, 8)) # Plot horizontal bar graph word_counts_a.sort_values(by = 'count').plot.barh(x = 'words', y = 'count', ax = ax , color = "blue") ax.set_title("Top Words contained with 'COVID' & 'Corona' Domain Names Registered in 2020") plt.show() len(word_counts_a)
We can easily see the biggest term is ‘virus’ with the corona/covid term which isn’t a big surprise but let’s zoom in a bit more, bar chart after the top three words.
#view a subset word_counts_b = word_counts_df[3:40] fig, ax = plt.subplots(figsize=(15, 17)) # Plot horizontal bar graph word_counts_b.sort_values(by = 'count').plot.barh(x = 'words', y = 'count', ax = ax , color = "purple") ax.set_title("Top Words (after the top word) contained with 'COVID' & 'Corona' Domain Names Registered in 2020") plt.show() len(word_counts_a) word_counts_df
So now we have a better perspective on the relative volume of the other terms.
Create a Pandas dataframe with the top 5,000 terms for further analysis. One last step:
word_counts_df = pd.DataFrame(word_counts.most_common(5000), columns = ['words', 'count']) #view a subset word_counts_b = word_counts_df[1:40] word_counts_df
And now, I can work on analyzing these words or take the entire dataset for further analysis.