STA141B Final Project: Gun Violence Analysis
¶

Group1: Haitong Zhu, Mingyu Zhu, Kaizhong Mu, Yibo Li
¶

I. Introduction¶

The school shooting incident that occurred on February 13th, 2023, at Michigan State University, was not an isolated event in the United States. Sadly, cases of gun violence have become all too common in recent years, with the number of incidents increasing steadily. This trend is alarming and raises serious concerns about the safety of individuals, especially for those who are part of vulnerable groups, including international students. As such, there is a pressing need to understand the root causes of gun violence and identify any patterns or commonalities among the perpetrators and their actions.

To address this issue, our group has undertaken a this project to explore gun violence data in the United States. Through this project, we aim to analyze and visualize various aspects of gun violence data, including the frequency and location of incidents, the demographics of perpetrators and victims, and the types of firearms used. We will also examine any underlying factors that may contribute to gun violence.

Goal¶

In this project, we try to explore the features of gun violence incidencts to have a better understanding of this complexed social issue. Specifically, we have following questions (when/where/who/why):

  1. When do gun violence occur? Is there any time-related patterns for gun violence incidents?
  2. Where do gun violence incidents occur the most? Which states have the highest number of incidents?
  3. Who are involved in gun violence? What are the demographics of victims and suspects in gun violence incidents?
  4. Why do gun violence occur? What are some of the common reasons behind gun violence incidents?
  5. Is there an accociation between the unemployment rate and the frequency of gun violence incidents?

II. Dataset¶

a. Data Description¶

Our Dataset is a csv file comes from Github. The author collected data from the GunViolenceArchive from 2013 to 2018.

In this dataset it contains 29 columns:

  1. column name: incident_id, Description: gunviolencearchive.org ID for incident;
  2. column name: date, Description: date of occurrence;
  3. column name: state, Description: state of occurrence;
  4. column name: city_or_county, Description: city or county of occurence;
  5. column name: address, Description:address of occurence;
  6. column name: n_killed, Description: number of people get killed from the incident;
  7. column name: n_injured, Description: number of people get killed from the incident;
  8. column name: incident_url, Description: link to gunviolencearchive.org webpage containing details of incident;
  9. column name: source_url, Description: link to online news story concerning incident;
  10. column name: incident_url_fields_missing, Description: ignore, always False;
  11. column name: congressional_district, Description: the congressional district of the incident;
  12. column name: gun_stolen, Description: where gun is stolen or unknown, format: key: gun ID, value: 'Unknown' or 'Stolen';
  13. column name: gun_type, Description: the type of gun used in the incident, format: key: gun ID, value: description of gun type;
  14. column name: incident_characteristics, Description: list of incident characteristics;
  15. column name: latitude, Description: Ignore;
  16. column name: location_description, Description: description of location where incident took place;
  17. column name: longitude, Description: Ignore;
  18. column name: n_guns_involved, Description: number of guns involved;
  19. column name: notes, Description: additional notes about the incident;
  20. column name: participant_age, Description: each participant ages, format: key: participant ID;
  21. column name: participant_age_group, Description: each participant age groups, format: key: participant ID, value: description of age group, e.g. 'Adult 18+';
  22. column name: participant_gender, Description: each participant gender: format: key: participant ID, value: 'Male' or 'Female';
  23. column name: participant_name, Description: the suspect's name; format: key: participant ID;
  24. column name: participant_relationship, Description: relation between the vitims and suspects, format: key: participant ID, value: relationship of participant to other participants;
  25. column name: participant_status, Description: the status of the participants after the incidents, format: key: participant ID, value: 'Arrested', 'Killed', 'Injured', or 'Unharmed';
  26. column name: participant_type, Description: identifier of vitims and suspects, format:key: participant ID, value: 'Victim' or 'Subject-Suspect';
  27. column name: sources, Description: links to online news stories concerning incident;
  28. column name: state_house_district, Description: Ignore;
  29. column name: state_senate_district, Description:Ignore;
In [93]:
import requests
import pandas as pd
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
import matplotlib as mpl
import seaborn as sn
import plotly.express as px
from wordcloud import WordCloud
import nltk
from nltk.corpus import wordnet
from bs4 import BeautifulSoup
In [94]:
gun = pd.read_csv('2013-2018.csv') # read data, retrieved the data frome https://github.com/jamesqo/gun-violence-data
gun_save = gun # save the original dataset

b. Data Cleaning¶

  • Drop meaningless columns
  • Modify messy cells into cells of lists
In [95]:
# these following columns are not needed in our project, so we deleted them here

del gun["incident_id"] # gunviolencearchive.org ID for incident
del gun["incident_url"] # link to gunviolencearchive.org webpage containing details of incident
del gun["source_url"] # link to online news story concerning incident
del gun["incident_url_fields_missing"] # ignore, always False
del gun["participant_status"] # key: participant ID, value: 'Arrested', 'Killed', 'Injured', or 'Unharmed'
del gun["sources"] # links to online news stories concerning incident
del gun['participant_name']

def modify_cell(x):
    """This function aims to clean the confusing punctuations and only select usful data"""
    if pd.isna(x):
        return x
    if not "::" in x:
        return x
    ture = []
    lst = x.split("||")
    for ele in lst:
        new = ele[3:]
        ture.append(new)
    return ture
# this function aims to clean the cells, remove :: and || from the cell, and ignore the first 3 letter after splits
In [96]:
gun_str = gun.applymap(str) # str all cells in dataframe
gun_str = gun_str.applymap(modify_cell) # clean the cell with the function
pd.set_option('display.max_colwidth', None)
In [97]:
# convert the n_killed column back to integer for later data visualization
gun_str['n_killed'] = gun_str['n_killed'].astype(int) 
# convert the n_injured column back to integer for later data visualization
gun_str['n_injured'] = gun_str['n_injured'].astype(int)

def column_sum(selected_column):
    '''This function is used to sum up all integers from a column'''
    x = 0
    for i in range(len(selected_column)):
        x = x + selected_column[i]
    return x
In [98]:
# Create new empty lists to store the number of victims and suspects in each incident
Victim = [0] * len(gun_str['participant_type'])
suspect = [0] * len(gun_str['participant_type'])

# Loop through the participant type column and count the number of victims and suspects in each incident
for i in range(len(gun_str['participant_type'])):
    Victim[i] = gun_str['participant_type'][0].count('Victim')
    suspect[i] = gun_str['participant_type'][0].count('Subject-Suspect')

# Create new columns in the dataframe for the number of victims and suspects in each incident
gun_str['Number_of_Victim'] = Victim
gun_str['Number_of_suspect'] = suspect

Here is the preview of dataframe after modifying

In [99]:
gun_str.head(5)
Out[99]:
date state city_or_county address n_killed n_injured congressional_district gun_stolen gun_type incident_characteristics ... notes participant_age participant_age_group participant_gender participant_relationship participant_type state_house_district state_senate_district Number_of_Victim Number_of_suspect
0 2013-01-01 Pennsylvania Mckeesport 1506 Versailles Avenue and Coursin Street 0 4 14.0 nan nan Shot - Wounded/Injured||Mass Shooting (4+ victims injured or killed excluding the subject/suspect/perpetrator, one location)||Possession (gun(s) found during commission of other crimes)||Possession of gun by felon or prohibited person ... Julian Sims under investigation: Four Shot and Injured [20] [Adult 18+, Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Male, Male, Male, Female] nan [Victim, Victim, Victim, Victim, Subject-Suspect] nan nan 4 1
1 2013-01-01 California Hawthorne 13500 block of Cerise Avenue 1 3 43.0 nan nan Shot - Wounded/Injured||Shot - Dead (murder, accidental, suicide)||Mass Shooting (4+ victims injured or killed excluding the subject/suspect/perpetrator, one location)||Gang involvement ... Four Shot; One Killed; Unidentified shooter in getaway car [20] [Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Male] nan [Victim, Victim, Victim, Victim, Subject-Suspect] 62.0 35.0 4 1
2 2013-01-01 Ohio Lorain 1776 East 28th Street 1 3 9.0 [Unknown, Unknown] [Unknown, Unknown] Shot - Wounded/Injured||Shot - Dead (murder, accidental, suicide)||Shots Fired - No Injuries||Bar/club incident - in or around establishment ... nan [25, 31, 33, 34, 33] [Adult 18+, Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Male, Male, Male, Male, Male] nan [Subject-Suspect, Subject-Suspect, Victim, Victim, Victim] 56.0 13.0 4 1
3 2013-01-05 Colorado Aurora 16000 block of East Ithaca Place 4 0 6.0 nan nan Shot - Dead (murder, accidental, suicide)||Officer Involved Incident||Officer Involved Shooting - subject/suspect/perpetrator killed||Drug involvement||Kidnapping/abductions/hostage||Under the influence of alcohol or drugs (only applies to the subject/suspect/perpetrator ) ... nan [29, 33, 56, 33] [Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Female, Male, Male, Male] nan [Victim, Victim, Victim, Subject-Suspect] 40.0 28.0 4 1
4 2013-01-07 North Carolina Greensboro 307 Mourning Dove Terrace 2 2 6.0 [Unknown, Unknown] [Handgun, Handgun] Shot - Wounded/Injured||Shot - Dead (murder, accidental, suicide)||Suicide^||Murder/Suicide||Attempted Murder/Suicide (one variable unsuccessful)||Domestic Violence ... Two firearms recovered. (Attempted) murder suicide - both succeeded in fulfilling an M/S and did not succeed, based on details. [18, 46, 14, 47] [Adult 18+, Adult 18+, Teen 12-17, Adult 18+] [Female, Male, Male, Female] [Family] [Victim, Victim, Victim, Subject-Suspect] 62.0 27.0 4 1

5 rows × 24 columns

III. Data Exploring¶

a. Distribution of Gun Violence by time¶

By Year¶

First we would like to know what kind of trend does the Gun Violen have. We created a bar plot for solving this question

For the data for 2013 and 2018 are too small and inconsistent with the increasing trend we see from year 2014 to 2017. We decide to delete the data for year 2013 and 2018.

2013 only has a few record of gun violence cases, and 2018 only recorded the first 3 months

In [100]:
# create a new variable, df1_added_year, which is a copy of the original dataframe gun_str
df1_added_year = gun_str.copy()

# extract the year from the date column of the dataframe and create a new column 'Year_of_incident'
df1_added_year['Year_of_incident'] = pd.to_datetime(df1_added_year['date']).dt.year
incident_counts= df1_added_year['Year_of_incident'].value_counts().sort_index()
incident_counts_del = incident_counts.drop(labels=[2013, 2018])
In [101]:
fig, ax = plt.subplots(figsize=(6,4))

incident_counts_del.plot(ax=ax, kind='bar', xlabel='Year', ylabel='Number of Incidents',
                         width = 0.5,alpha=0.5)

# Add text to each bar
for i, v in enumerate(incident_counts_del.values):
    ax.text(i, v+500, str(v), ha='center')

# Set the title
ax.set_title('Gun Violence Incidents by Year (2014-2017)')
Out[101]:
Text(0.5, 1.0, 'Gun Violence Incidents by Year (2014-2017)')

By Month¶

As we can see from the graph above, with the year increases, the number of Gun Violence in United State is increasing as well. We can roughly conclude that the Gun Violence in United State has a trend of increasing.

Now let us take a look at the Distribution of Gun Violence by month

In [102]:
# make a copy of the original dataframe gun_str
df1_month = gun_str.copy()

# extract the year and month from the date column of the dataframe and create a new column 'year_month_of_incidence'
df1_month['year_month_of_incidence'] = pd.to_datetime(df1_month['date']).dt.to_period('M')

# group the dataframe by year_month_of_incidence and count the number of incidents
incidents_by_month = df1_month.groupby('year_month_of_incidence')['date'].count()
In [124]:
df2_month = df1_month[df1_month['year_month_of_incidence'].dt.year != 2013]

# group the dataframe by year_month_of_incidence and count the number of incidents
incidents_by_month_2 = df2_month.groupby('year_month_of_incidence')['date'].count()

# create a line chart of incidents by month
fig_monthly_incidents_2 = px.line(incidents_by_month_2, x=incidents_by_month_2.index.astype(str), y='date',
                                labels={'x': 'Month-Year', 'date': 'Number of Incidents'},
                                title='Gun Violence Incidents by Month from 2014 to 2018')
fig_monthly_incidents_2.show(renderer='notebook')

The interactive plot of gun violence incidents by month shows a general increasing trend and a clear seasonal trend. February consistently has the lowest number of incidents each year, while the summer months, particularly July and August, have the highest number of incidents. Winter months, in general, have fewer incidents compared to summer months. This pattern suggests that gun violence incidents may be influenced by seasonal factors or environmental conditions.

b. Distribution of Gun Violence by location¶

Next, we aim to investigate the relationship between gun violence incidents and their geographical locations, particularly the states in the United States. To do this, we counted all the incidents by state and created an interactive map for visualization purposes.

In [104]:
# Make a new df to include and count the incidents by state and rename the columns
states_count = (gun_str['state']
               .value_counts()
               .reset_index()
               .rename(columns={'index': 'state_name', 'state': 'incidents'})
               .astype({'incidents': int}))

# create a dict for creating a new column state_abbreviation
us_state_to_abbrev = {
    "Alabama": "AL","Alaska": "AK","Arizona": "AZ","Arkansas": "AR","California": "CA",
    "Colorado": "CO","Connecticut": "CT","Delaware": "DE","Florida": "FL","Georgia": "GA",
    "Hawaii": "HI","Idaho": "ID","Illinois": "IL","Indiana": "IN","Iowa": "IA","Kansas": "KS",
    "Kentucky": "KY","Louisiana": "LA","Maine": "ME","Maryland": "MD","Massachusetts": "MA",
    "Michigan": "MI","Minnesota": "MN","Mississippi": "MS","Missouri": "MO","Montana": "MT",
    "Nebraska": "NE","Nevada": "NV","New Hampshire": "NH","New Jersey": "NJ","New Mexico": "NM",
    "New York": "NY","North Carolina": "NC","North Dakota": "ND","Ohio": "OH","Oklahoma": "OK",
    "Oregon": "OR","Pennsylvania": "PA","Rhode Island": "RI","South Carolina": "SC","South Dakota": "SD",
    "Tennessee": "TN","Texas": "TX","Utah": "UT","Vermont": "VT","Virginia": "VA",
    "Washington": "WA","West Virginia": "WV","Wisconsin": "WI","Wyoming": "WY",
    "District of Columbia": "DC","American Samoa": "AS","Guam": "GU","Northern Mariana Islands": "MP",
    "Puerto Rico": "PR","United States Minor Outlying Islands": "UM","U.S. Virgin Islands": "VI",
}

# make a new column which contains the 2 letters abbr. of the original state name for later mapping
states_count['state_abbr'] = states_count['state_name'].map(us_state_to_abbrev) 
In [125]:
# generate a map of the US territory divided by each states, the colors show the magnitude of gun violence incident in each state
# the darker the color, the more gun violence incidents reported in that state
# moving mouse on different colored states will show the number of gun violence incidents happened, the state name, and the state abbreviation.

fig_incidents_states = px.choropleth(states_count,
                                     locations='state_abbr',
                                     locationmode='USA-states',
                                     scope='usa',
                                     color='incidents',
                                     color_continuous_scale="Viridis_r",
                                     hover_data=['state_name', 'incidents'])
fig_incidents_states.update_layout(title_text='Gun Violence Incidents by State')
fig_incidents_states.show(renderer='notebook')

We have created a map of the United States which displays the incidence of gun violence in each state. The map employs a color scheme to represent the level of gun violence in each state, with darker colors indicating a higher incidence of gun violence. Hovering the mouse over any state will reveal the number of gun violence incidents, the state name, and the state abbreviation. Our analysis of the map shows that Illinois has reported the highest number of incidents, with 17.556k, followed by Florida with 15.029k incidents, and California with 16.306k incidents. Texas ranks fourth among all the states. Overall, the color of the map becomes lighter as one moves from the east coast to the west coast, with the exception of California, which remains very dark. Similarly, there is a higher incidence of gun violence in the southern states compared to the northern states.

c. Common Charactristics in Gun Violence cases¶

Through out the dataset, we also can find some common characteristics for the people who participated in the Gun Violence cases We will tried to find which biological gender group and age group are the most common people who attend this action.

First we extract gender and age from the original dataframe and seprate them into suspects and victims

In [106]:
# get the rows in the df if the length of data in 'participant_age' equals to length of data in 'participant_type'

df_age = gun_str[gun_str['participant_age'].apply(len) == gun_str['participant_type'].apply(len)]

# check the observations we have if we drop the data based on df_age
df_age_group2 = df_age[df_age['participant_age_group'].apply(len) == df_age['participant_type'].apply(len)]
# check the observations we have if we drop the data based on df_age_group2
df_gender2 = df_age_group2[df_age_group2['participant_gender'].apply(len) == df_age_group2['participant_type'].apply(len)]
gun = df_gender2  # rename the dataframe
In [107]:
import warnings
warnings.filterwarnings('ignore')   # to hide the warning messages

# Use list comprehension to create lists for each participant type
victim_ages = [[age for j, age in enumerate(row['participant_age']) if row['participant_type'][j] == 'Victim'] for i, row in gun.iterrows()]
suspect_ages = [[age for j, age in enumerate(row['participant_age']) if row['participant_type'][j] == 'Subject-Suspect'] for i, row in gun.iterrows()]
victim_age_groups = [[age_group for j, age_group in enumerate(row['participant_age_group']) if row['participant_type'][j] == 'Victim'] for i, row in gun.iterrows()]
suspect_age_groups = [[age_group for j, age_group in enumerate(row['participant_age_group']) if row['participant_type'][j] == 'Subject-Suspect'] for i, row in gun.iterrows()]
victim_genders = [[gender for j, gender in enumerate(row['participant_gender']) if row['participant_type'][j] == 'Victim'] for i, row in gun.iterrows()]
suspect_genders = [[gender for j, gender in enumerate(row['participant_gender']) if row['participant_type'][j] == 'Subject-Suspect'] for i, row in gun.iterrows()]

# Add new columns to the 'gun' DataFrame to store the participant data
gun['suspect_ages'] = suspect_ages
gun['victim_ages'] = victim_ages
gun['victim_age_group'] = victim_age_groups
gun['suspect_age_group'] = suspect_age_groups
gun['victim_genders'] = victim_genders
gun['suspect_genders'] = suspect_genders
In [108]:
gun[['victim_ages','suspect_ages','victim_age_group','suspect_age_group','victim_genders',
     'suspect_genders']].head(10)
Out[108]:
victim_ages suspect_ages victim_age_group suspect_age_group victim_genders suspect_genders
2 [33, 34, 33] [25, 31] [Adult 18+, Adult 18+, Adult 18+] [Adult 18+, Adult 18+] [Male, Male, Male] [Male, Male]
3 [29, 33, 56] [33] [Adult 18+, Adult 18+, Adult 18+] [Adult 18+] [Female, Male, Male] [Male]
4 [18, 46, 14] [47] [Adult 18+, Adult 18+, Teen 12-17] [Adult 18+] [Female, Male, Male] [Female]
6 [51, 40, 9, 5, 2] [15] [Adult 18+, Adult 18+, Child 0-11, Child 0-11, Child 0-11] [Teen 12-17] [Male, Female, Male, Female, Female] [Male]
14 [34, 28, 23, 29] [29] [Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Adult 18+] [Male, Male, Male, Male] [Male]
18 [18, 22, 21, 29] [19, 22, 23] [Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Adult 18+, Adult 18+, Adult 18+] [Male, Female, Female, Male] [Male, Male, Male]
23 [18, 18, 18, 19] [41] [Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Adult 18+] [Male, Male, Male, Female] [Male]
27 [18, 18, 18, 19] [15, 17] [Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Teen 12-17, Teen 12-17] [Male, Male, Male, Male] [Male, Male]
29 [23, 34, 17, 25] [] [Adult 18+, Adult 18+, Teen 12-17, Adult 18+] [] [Male, Male, Female, Female] []
32 [33, 28, 29, 21] [25] [Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Adult 18+] [Male, Female, Male, Male] [Male]

Gender¶

We will show the gender distribution by visualizing the data by using pie chart

In [109]:
gun_reindex=gun.reset_index()

# Select relevant columns for age and gender data
age_data = gun_reindex[['victim_ages', 'suspect_ages', 'victim_age_group', 'suspect_age_group']]
gender_data = gun_reindex[['victim_genders', 'suspect_genders']]

# Flatten victim and suspect gender lists and count male/female frequency
victim_genders = [gender for gender_list in gender_data['victim_genders'] for gender in gender_list]
victim_male_count = victim_genders.count('Male')
victim_female_count = victim_genders.count('Female')

suspect_genders = [gender for gender_list in gender_data['suspect_genders'] for gender in gender_list]
suspect_male_count = suspect_genders.count('Male')
suspect_female_count = suspect_genders.count('Female')
In [110]:
# Combined Plot
fig, axs = plt.subplots(1, 2)
fig.suptitle("Gender Ratio")
counts = [victim_male_count, victim_female_count]
counts2 = [suspect_male_count, suspect_female_count]
labels = ['Male', 'Female']
colors = ['slategrey', 'peachpuff']
# Subplot 1
axs[0].pie(counts, labels=labels, colors=colors, autopct="%1.1f%%")
axs[0].axis('equal')
axs[0].set_title('Gender Ratio of Victims')
# Subplot 2
axs[1].pie(counts2, labels=labels, colors=colors, autopct='%1.1f%%')
axs[1].axis("equal")
axs[1].set_title('Gender Ratio of Suspects')
Out[110]:
Text(0.5, 1.0, 'Gender Ratio of Suspects')

We can see that most of the people who involved in Gun Violence cases are males, regardless of vitims or suspects. Especially for the suspect group, there are about 92.5 percent people who are male

Age¶

Similarly, we will visualizing the age distribution by using a histogram

In [111]:
# create an empty list to store the victim ages
all_victim_ages = []

# iterate through each row of the DataFrame and append all the values in the "victim_ages" column to the new list
for index, row in age_data.iterrows():
    all_victim_ages += row['victim_ages']
In [112]:
# Convert the elements in the list to integers 

# create an empty list to store the integers
new_victim_age_list = []

for item in all_victim_ages:
    # use try() function to skip the elements that cannot be converted to integers
    try: 
        new_victim_age_list.append(int(item))
    except ValueError:
        continue
In [113]:
# create an empty list to store the victim ages
all_suspect_ages = []

# iterate through each row of the DataFrame and append all the values in the "victim_ages" column to the new list
for index, row in age_data.iterrows():
    all_suspect_ages += row['suspect_ages']
    
    
    
# Convert the elements in the list to integers 

# create an empty list to store the integers
new_suspect_age_list = []

for item in all_suspect_ages:
    # use try() function to skip the elements that cannot be converted to integers
    try: 
        new_suspect_age_list.append(int(item))
    except ValueError:
        continue
In [114]:
# Combined Plot
plt.rcParams["figure.autolayout"] = True
fig, axs = plt.subplots(1, 2, figsize = (9,4))
fig.suptitle('Distribution of Ages', fontsize=15)
x_lab = "Age"
y_lab = "Frequency"
# Subplot 1
axs[0].hist(new_victim_age_list, bins = 100, width = 1.5, color ='orange')
axs[0].set_title('Distribution of Victim Ages',fontsize=10)
axs[0].set_xlim([0, 80])
# Subplot 2
axs[1].hist(new_suspect_age_list, bins = 100, width = 1.5, color ='green')
axs[1].set_title('Distribution of Suspect Ages',fontsize=10)
axs[1].set_xlim([0, 80])
for i in range(2):
    axs[i].set_xlabel(x_lab)
    axs[i].set_ylabel(y_lab)

The distribution plots reveal that both victim and suspect age distributions are skewed to the right, suggesting that individuals involved in gun violence tend to be younger. The age groups with the highest frequency of involvement are 18-25 for victims and 18-30 for suspects.

d. Most Used Weapons in Crime¶

We wanted to know the types of gun that had the longest presence in crime in each state between 2013 and 2018.

In the dataset,there is a column in the data we collected that gives details of each case, including the type of gun used in the crime. To do this, first put everything in this column together, and then use NLP on it to return the most common words. We removed words that we thought were irrelevant to the question we were studying and identified all types of firearms. Finally, we count the number of times each firearm type was used and then we talk about the results shown in the Pie Chart below.

In [115]:
# original dataframe we use:
df=gun_str.copy()

# now we only fucous on the gun type for each state from 2013 to 2018
# so we select the data state gun_type variables to a ne dataframe we are going to use in our analysis 
df=df[['date','state','gun_type']]

# now append the gun_type list for each state, and we get the new dataframe 

def append_lists(x):
    result = []
    for l in x:
        result.extend(l)
    return result

grouped_df = df.groupby('date').agg({'state':"first",'gun_type': append_lists})

def filters(lst):
    return [element.replace(symbol, '') for element in lst for symbol in [':','::']]
grouped_df['gun_type'] = grouped_df['gun_type'].apply(lambda x: filters(x))

# delete the Unknow , n , a in the gun_type list. then we get the final dataset we can use 
stopwords = ["Unknown", "a", "n",":Unknown","::Unknown","Handgun","U","K","O","o","w",":"]
def filter_gun(lst):
    return [x for x in lst if x not in stopwords]
grouped_df['gun_type'] = grouped_df['gun_type'].apply(lambda x: filter_gun(x))

# let's generate a new dataframe:
def append_rows(group):
    b_combined = []
    [b_combined.extend(b) for b in group['gun_type'].tolist()]
    return pd.Series({'gun_type_sum': b_combined})

# group by column A and apply aggregation function to append rows
result = grouped_df.groupby('state').apply(append_rows)

# reset index to remove multi-level index
result = result.reset_index()

# display the resulting dataframe
df=result

def get_most_frequent(row):
    if len(row['gun_type_sum']) == 0:
        return pd.Series({'most_frequent': 0, 'count': 0})
    value_counts = pd.Series(row['gun_type_sum']).value_counts()
    most_frequent_value = value_counts.index[0]
    return pd.Series({'most_frequent': most_frequent_value, 'count': value_counts[0]})

result = df.merge(df.apply(get_most_frequent, axis=1), left_index=True, right_index=True)
In [116]:
# assume result is the DataFrame with the 'most_frequent' column
char = result[['most_frequent','count']]
category_counts = char['most_frequent'].value_counts()

# create a color map
color_map = plt.cm.get_cmap('Blues')

# create a pie chart
fig, ax = plt.subplots()
ax.pie(category_counts.values, labels=category_counts.index, autopct='%1.1f%%', 
       colors=color_map(np.linspace(0.3, 0.95, len(category_counts))), startangle=10, wedgeprops=dict(width=0.5, alpha=0.8), pctdistance=0.8)
ax.set_title('Most frequent gun types by state')
fig.tight_layout()
plt.show()

We find that 9mm pistol is most common used weapon used in some crimes. This is not surprising because the several advantages for 9mm pistol: it is relatively easy to use, making it a popular choice for individuals who may not have a lot of experience with firearms. and also it is easier to carry than some heavy weapons. The second highest is Rifle. The use of rifles in crime is extremely devastating because of their accuracy and long range. It can target quickly and accurately, causing grisly harm. Those may be the reason why it is highly used in the crime.

e. Causes of People commit Gun Violence¶

For answering this question, the use of numerical data is limited, instead, we need to use Natrual Language Processing method.

In the dataframe, there is a column which contains a brief summary of the details for each GV cases. By analyzing the most frequent words that showing in the incidence report can show us an general idea of what are the main causes for people committing Gun crime.

In [117]:
lemmatizer = nltk.WordNetLemmatizer()
def change_tag(tag):
    """
    FUNCTION:this is the fucntion to convert the brown POS tag into the wordnet tag
    For example, NNS, NNP and etc are all considered as noun, representing as 'n' in WordNet POS tag.
    RETURN: the words with wordnet tags 
    """
    table = {"N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV, "J": wordnet.ADJ}
    return table.get(tag[0], wordnet.NOUN)  # Default to a noun.
In [118]:
raw = ""
for x in gun_reindex["incident_characteristics"]:
    raw += x

words = nltk.word_tokenize(raw)
words = [x.lower() for x in words if x.isalnum()]
words_tags = nltk.pos_tag(words)
words = [lemmatizer.lemmatize(w, change_tag(t)) for (w, t) in words_tags]
stopword_self = ["gun", "dead", "fire", "find"]
stopwords = nltk.corpus.stopwords.words("english") + stopword_self
words = [w for w in words if w not in stopwords]

With the help of Python Wordcloud package, we can visulize the most frequent words in the report easily

In [119]:
#fq=nltk.FreqDist(words)
#%matplotlib inline 
#fq.plot(50,cumulative=False)
In [120]:
text = ' '.join(words)
stopwords=['injuredshot','le']
wordcloud = WordCloud(width=1500, height=1000, random_state=45,stopwords=stopwords, background_color="black", collocations=False).generate(text)

# display the word cloud
plt.figure(figsize=(15,10))
plt.title("WordCloud for GV Report")
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

We can observe that words accidental, murder and suicide have the biggest font in the picture above.

To the most surprising, the word suicide is one of the biggest word in the graph. This finding highlights the importance of addressing mental health issues and providing support for those who may be struggling with depression, anxiety or other conditions that may lead to suicidal thoughts. It also emphasizes the need for measures to help prevent people at risk of self-harm from obtaining firearms.

f. Relationship between Unemployment Rate and GV cases¶

Is unemployment one of the reason for people commiting Gun Violence crime? In order to answer this question, our group deside to find out the correlation in between Unemployment Rate and GV cases

In [121]:
# scraping from website
url = 'https://www.icip.iastate.edu/tables/employment/unemployment-states'
tables = pd.read_html(url)
df = tables[1].drop(["FIPS", "1980", "1990", "2000", "2010", "2018"], axis=1).drop(0, axis = 0)
df = df.set_index("Area Name")
# Then we extrat the number of total GV cases in 2017 and merge them into dataframe
start_date = '2017-01-01'
end_date = '2017-12-31'
mask = (gun_reindex['date'] > start_date) & (gun_reindex['date'] <= end_date)
gv_2017 = gun_reindex.loc[mask]
count2017 = gv_2017['state'] .value_counts().astype(float)
count2017 = count2017.reset_index().set_index("index")
merged_df = pd.merge(df[['2017']], count2017[['state']], left_index=True, right_index=True)
merged_df.columns = ["Unemployment Rate 2017", "Gun Violence"]
merged_df = merged_df.applymap(float)
# Correlation calculation
corr = merged_df["Unemployment Rate 2017"].corr(merged_df["Gun Violence"])
In [122]:
corr
Out[122]:
0.26200459307773666

The correction is about 0.262.

This number indcates a weak but positive correlationship in between these two column.

In other words, the raising of Unemployment Rate does lead to the increaing of Gun Violence

For better visulization, we create a Scatter Plot and fits a line:

In [123]:
x = merged_df["Unemployment Rate 2017"]
y = merged_df["Gun Violence"]
a, b = np.polyfit(x, y, 1)
plt.scatter(x, y, alpha = 0.5)
plt.plot(x, a*x+b, color = "pink")
plt.xlabel("Unemployment Rate 2017 (%)")
plt.ylabel("Gun Violence Cases")
plt.title("Scatter Plot of Unemployment Rate 2017 vs Gun Violence")
Out[123]:
Text(0.5, 1.0, 'Scatter Plot of Unemployment Rate 2017 vs Gun Violence')

IV. Conclusion¶

  • Distribution of Gun Violenct by time

    • Gun violence cases does have a trend of increasing year by year
    • The peak for gun violence happenes in the United State is the summer time
  • Distribution of Gun Violence by location:

    • State of Illinois has the most Gun Violence cases cumulatively from 2013 to 2018
    • State of California, Florida and Texas also has a lot of Gun Violence
    • States in the middle area of the United State usually has less Gun Violence compare to the coast States
  • Charactristics for people who involved in Gun Violences:

    • Gender: Regardless of victims and suspects, most people who involved in Gun Violences are males
    • Age: Regardless of victims and suspects, most people who involved in Gun Violences are in between 18 to 30
  • Most common Gun type used in Gun Violence: 9mm (Pistol)

  • Underlying causes for people to commit Gun Violence:

    • Accidental
    • Murder
    • Suicide
  • Relationship between Unemployment Rate and Gun Violence:

    • Weak Positive correlation
    • More unemployment causes more gun violences

Our project seeks to shed light on this complex issue and provide a comprehensive understanding of gun violence in the United States. We hope that our findings will contribute to ongoing efforts to prevent and reduce gun violence, making the country a safer place for all individuals.

V. Refrence¶

https://github.com/jamesqo/gun-violence-data

https://www.gunviolencearchive.org/

https://www.icip.iastate.edu/tables/employment/unemployment-states