Building a Logistic Regression
Create a logistic regression based on the bank data
provided.
The data is based on the marketing campaign efforts
of a Portuguese banking institution. The classification goal is to predict if
the client will subscribe a term deposit (variable y).
Note that the first column of the dataset is the
index.
Import the relevant libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# this part not be needed after the latests updates of the library
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
Load the ‘Example_bank_data.csv’
dataset.
from google.colab import files
uploaded = files.upload()
raw_data = pd.read_csv('Example_bank_data.csv')
raw_data
We want to know whether the bank
marketing strategy was successful, so we need to transform the outcome variable
into 0s and 1s in order to perform a logistic regression.
# We make sure to create a copy of the data before we start altering itNote that we don't change the original data we loaded.
data = raw_data.copy()
# Removes the index column that came with the data
data = data.drop(['Unnamed: 0'], axis = 1)
# We use the map function to change any 'yes' values to 1 and 'no' values to 0.
data['y'] = data['y'].map({'yes':1, 'no':0})
data
# Check the descriptive statistics
data.describe()
Declare the dependent and independent
variables
y = data['y']
x1 = data['duration']
Simple Logistic Regression
x = sm.add_constant(x1)
reg_log = sm.Logit(y,x)
results_log = reg_log.fit()
# Get the regression summary
results_log.summary()
# Create a scatter plot of x1 (Duration, no constant) and y (Subscribed)
plt.scatter(x1,y,color = 'C0')
# Don't forget to label your axes!
plt.xlabel('Duration', fontsize = 20)
plt.ylabel('Subscription', fontsize = 20)
plt.show()
np.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)})
#np.set_printoptions(formatter=None)
results_log.predict()
np.array(data['y'])
results_log.pred_table()
cm_df = pd.DataFrame(results_log.pred_table())
cm_df.columns = ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'})
cm_df
cm = np.array(cm_df)
accuracy_train = (cm[0,0]+cm[1,1])/cm.sum()
accuracy_train
No comments:
Post a Comment