Data Science

How to Analyze A/B Test Results....

Analyze A/B Test Results

sections:

Introduction

A/B tests are very commonly performed by data analysts and data scientists. For this project, you will be working to understand the results of an A/B test run by an e-commerce website. Your goal is to work through this notebook to help the company understand if they should:

Implement the new webpage,
Keep the old webpage, or
Perhaps run the experiment longer to make their decision.

Part I - Probability

To get started, let's import our libraries.

In [1]:

import pandas as pd
import numpy as np
import random
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

/opt/conda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

ToDo 1.1

Now, read in the ab_data.csv data. Store it in df. Below is the description of the data, there are a total of 5 columns:

Data columns	Purpose	Valid values
user_id	Unique ID	Int64 values
timestamp	Time stamp when the user visited the webpage	-
group	In the current A/B experiment, the users are categorized into two broad groups. The `control` group users are expected to be served with `old_page`; and `treatment` group users are matched with the `new_page`. However, some inaccurate rows are present in the initial data, such as a `control` group user is matched with a `new_page`.	`['control', 'treatment']`
landing_page	It denotes whether the user visited the old or new webpage.	`['old_page', 'new_page']`
converted	It denotes whether the user decided to pay for the company's product. Here, `1` means yes, the user bought the product.	`[0, 1]`

</center> Use your data frame to answer the questions in Quiz 1 of the classroom.

Tip: Please save your work regularly.

a. Read in the dataset from the ab_data.csv file and take a look at the top few rows here:

In [2]:

df = pd.read_csv('ab_data.csv')
df.head()

Out[2]:

	user_id	timestamp	group	landing_page	converted
0	851104	2017-01-21 22:11:48.556739	control	old_page	0
1	804228	2017-01-12 08:01:45.159739	control	old_page	0
2	661590	2017-01-11 16:55:06.154213	treatment	new_page	0
3	853541	2017-01-08 18:28:03.143765	treatment	new_page	0
4	864975	2017-01-21 01:52:26.210827	control	old_page	1

b. Use the cell below to find the number of rows in the dataset.

In [3]:

df.shape

Out[3]:

(294478, 5)

c. The number of unique users in the dataset.

In [4]:

df.nunique()

Out[4]:

user_id         290584
timestamp       294478
group                2
landing_page         2
converted            2
dtype: int64

d. The proportion of users converted.

In [5]:

df.converted.mean()*100

Out[5]:

11.965919355605511

e. The number of times when the "group" is treatment but "landing_page" is not a new_page.

In [6]:

df.query('landing_page != "new_page" & group == "treatment"').shape

Out[6]:

(1965, 5)

f. Do any of the rows have missing values?

In [7]:

df.isnull().any(axis=1).sum()

Out[7]:

ToDo 1.2

In a particular row, the group and landing_page columns should have either of the following acceptable values:

user_id	timestamp	group	landing_page	converted
XXXX	XXXX	`control`	`old_page`	X
XXXX	XXXX	`treatment`	`new_page`	X

It means, the control group users should match with old_page; and treatment group users should match with the new_page.

However, for the rows where treatment does not match with new_page or control does not match with old_page, we cannot be sure if such rows truly received the new or old webpage.

Use Quiz 2 in the classroom to figure out how should we handle the rows where the group and landing_page columns don't match?

a. Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz. Store your new data frame in df2.

In [8]:

# Remove the inaccurate rows, and store the result in a new dataframe df2

df2 = df.query('landing_page == "new_page" & group == "treatment"') 
              
df2 = df2.append(df.query('landing_page == "old_page" & group == "control"')
                 ,ignore_index=True)

df2.head()

Out[8]:

	user_id	timestamp	group	landing_page	converted
0	661590	2017-01-11 16:55:06.154213	treatment	new_page	0
1	853541	2017-01-08 18:28:03.143765	treatment	new_page	0
2	679687	2017-01-19 03:26:46.940749	treatment	new_page	1
3	817355	2017-01-04 17:58:08.979471	treatment	new_page	1
4	839785	2017-01-15 18:11:06.610965	treatment	new_page	1

In [9]:

df2.tail()

Out[9]:

	user_id	timestamp	group	landing_page
290580	718310	2017-01-21 22:44:20.378320	control	old_page
290581	751197	2017-01-03 22:28:38.630509	control	old_page
290582	945152	2017-01-12 00:51:57.078372	control	old_page
290583	734608	2017-01-22 11:45:03.439544	control	old_page
290584	697314	2017-01-15 01:20:28.957438	control	old_page

In [10]:

# Double Check all of the incorrect rows were removed from df2 - 
# Output of the statement below should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

Out[10]:

ToDo 1.3

Use df2 and the cells below to answer questions for Quiz 3 in the classroom.

a. How many unique user_ids are in df2?

In [11]:

df2.nunique()

Out[11]:

user_id         290584
timestamp       290585
group                2
landing_page         2
converted            2
dtype: int64

b. There is one user_id repeated in df2. What is it?

In [12]:

df2[df2.user_id.duplicated()].user_id

Out[12]:

1404    773192
Name: user_id, dtype: int64

c. Display the rows for the duplicate user_id?

In [13]:

df2[df2.user_id.duplicated()]

Out[13]:

	user_id	timestamp	group	landing_page	converted
1404	773192	2017-01-14 02:55:59.590927	treatment	new_page	0

d. Remove one of the rows with a duplicate user_id, from the df2 data frame.

In [14]:

# Remove one of the rows with a duplicate user_id..
# Hint: The dataframe.drop_duplicates() may not work in this case because the rows with duplicate user_id are not entirely identical. 
df2.drop(index = 1404 , inplace = True)

# Check again if the row with a duplicate user_id is deleted or not
df2[df2.user_id.duplicated()]

Out[14]:

	user_id	timestamp	group	landing_page	converted

ToDo 1.4

Use df2 in the cells below to answer the quiz questions related to Quiz 4 in the classroom.

a. What is the probability of an individual converting regardless of the page they receive?

Tip: The probability you'll compute represents the overall "converted" success rate in the population and you may call it $p_{p o p u l a t i o n}$ .

In [15]:

df2.converted.sum() / df2.shape[0]

Out[15]:

0.11959708724499628

b. Given that an individual was in the control group, what is the probability they converted?

In [16]:

df_c = df2.query('group == "control"')
df_c.converted.sum() / df_c.shape[0]

Out[16]:

0.1203863045004612

c. Given that an individual was in the treatment group, what is the probability they converted?

In [17]:

df_t = df2.query('group == "treatment"')
df_t.converted.sum() / df_t.shape[0]

Out[17]:

0.11880806551510564

Tip: The probabilities you've computed in the points (b). and (c). above can also be treated as conversion rate. Calculate the actual difference (obs_diff) between the conversion rates for the two groups. You will need that later.

In [18]:

# Calculate the actual difference (obs_diff) between the conversion rates for the two groups.

obs_diff = df_t.converted.sum() / df_t.shape[0] - df_c.converted.sum() / df_c.shape[0] 

obs_diff

Out[18]:

-0.0015782389853555567

d. What is the probability that an individual received the new page?

In [19]:

df2.query('landing_page == "new_page"').shape[0] / df2.shape[0]

Out[19]:

0.5000619442226688

e. Consider your results from parts (a) through (d) above, and explain below whether the new treatment group users lead to more conversions.

conclusions:
From (b) and (c) results we can see that treatment probability (0.11880806551510564) is less than control probability (0.1203863045004612) (Old page > New Page)
But the difference (0.0015782389853555567) is very small and that is only a sample, so we cannot know for sure that the new page leads to more conversions

Part II - A/B Test

Since a timestamp is associated with each event, you could run a hypothesis test continuously as long as you observe the events.

However, then the hard questions would be:

Do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time?
How long do you run to render a decision that neither page is better than another?

These questions are the difficult parts associated with A/B tests in general.

ToDo 2.1

For now, consider you need to make the decision just based on all the data provided.

Recall that you just calculated that the "converted" probability (or rate) for the old page is slightly higher than that of the new page (ToDo 1.4.c).

If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should be your null and alternative hypotheses ( $H_{0}$ and $H_{1}$ )?

You can state your hypothesis in terms of words or in terms of $p_{o l d}$ and $p_{n e w}$ , which are the "converted" probability (or rate) for the old and new pages respectively.

Null hypothese $H_{0}$ : $p_{o l d}$ => $p_{n e w}$
Alternative hypothese $H_{1}$ : $p_{o l d}$ < $p_{n e w}$

ToDo 2.2 - Null Hypothesis $H_{0}$ Testing

Under the null hypothesis $H_{0}$ , assume that $p_{n e w}$ and $p_{o l d}$ are equal. Furthermore, assume that $p_{n e w}$ and $p_{o l d}$ both are equal to the converted success rate in the df2 data regardless of the page. So, our assumption is:

p_{n e w}

p_{o l d}

p_{p o p u l a t i o n}

In this section, you will:

Simulate (bootstrap) sample data set for both groups, and compute the "converted" probability $p$ for those samples.

Use a sample size for each group equal to the ones in the df2 data.

Compute the difference in the "converted" probability for the two samples above.

Perform the sampling distribution for the "difference in the converted probability" between the two simulated-samples over 10,000 iterations, and calculate an estimate.

Use the cells below to provide the necessary parts of this simulation. You can use Quiz 5 in the classroom to make sure you are on the right track.

a. What is the conversion rate for $p_{n e w}$ under the null hypothesis?

In [20]:

df2.converted.mean()

Out[20]:

0.11959708724499628

b. What is the conversion rate for $p_{o l d}$ under the null hypothesis?

In [21]:

df2.converted.mean()

Out[21]:

0.11959708724499628

c. What is $n_{n e w}$ , the number of individuals in the treatment group?

Hint: The treatment group users are shown the new page.

In [22]:

df_t.shape[0]

Out[22]:

d. What is $n_{o l d}$ , the number of individuals in the control group?

In [23]:

df_c.shape[0]

Out[23]:

e. Simulate Sample for the treatment Group
Simulate $n_{n e w}$ transactions with a conversion rate of $p_{n e w}$ under the null hypothesis.

Hint: Use numpy.random.choice() method to randomly generate $n_{n e w}$ number of values.
Store these $n_{n e w}$ 1's and 0's in the new_page_converted NumPy array.

In [24]:

# Simulate a Sample for the treatment Group
new_page_converted = df_t.sample(df2.shape[0], replace=True).converted

f. Simulate Sample for the control Group
Simulate $n_{o l d}$ transactions with a conversion rate of $p_{o l d}$ under the null hypothesis.
Store these $n_{o l d}$ 1's and 0's in the old_page_converted NumPy array.

In [25]:

# Simulate a Sample for the control Group
old_page_converted = df_c.sample(df2.shape[0], replace=True).converted

g. Find the difference in the "converted" probability $(p {^{'}}_{n e w}$ - $p {^{'}}_{o l d})$ for your simulated samples from the parts (e) and (f) above.

In [26]:

 new_page_converted.mean() - old_page_converted.mean()

Out[26]:

-0.00054029127550037082

h. Sampling distribution
Re-create new_page_converted and old_page_converted and find the $(p {^{'}}_{n e w}$ - $p {^{'}}_{o l d})$ value 10,000 times using the same simulation process you used in parts (a) through (g) above.

Store all $(p {^{'}}_{n e w}$ - $p {^{'}}_{o l d})$ values in a NumPy array called p_diffs.

In [27]:

# Sampling distribution 
p_diffs = []
size = df2.shape[0]
for _ in range(10000):
    samp = df2.sample(size, replace=True)
    new_page_converted = samp.query('group == "treatment"').converted.mean()
    old_page_converted = samp.query('group == "control"').converted.mean()

    p_diffs.append(new_page_converted - old_page_converted)

In [38]:

(p_diffs).mean()

Out[38]:

-0.001570631157834118

In [33]:

obs_diff

Out[33]:

-0.0015782389853555567

i. Histogram
Plot a histogram of the p_diffs. Does this plot look like what you expected? Use the matching problem in the classroom to assure you fully understand what was computed here.

Also, use plt.axvline() method to mark the actual difference observed in the df2 data (recall obs_diff), in the chart.

Tip: Display title, x-label, and y-label in the chart.

In [34]:

# Convert to numpy array
p_diffs = np.array(p_diffs)

# Plot sampling distribution
plt.hist(p_diffs);

In [35]:

# create distribution under the null hypothesis
null_vals = np.random.normal(0, p_diffs.std(), p_diffs.size)

plt.hist(null_vals);

In [36]:

# Plot observed statistic with the null distibution
plt.hist(null_vals);
plt.axvline(obs_diff, c='red')

Out[36]:

<matplotlib.lines.Line2D at 0x7f94df3b0c50>

j. What proportion of the p_diffs are greater than the actual difference observed in the df2 data?

In [37]:

# p-value
(null_vals > obs_diff).mean()

Out[37]:

0.90900000000000003

k. Please explain in words what you have just computed in part j above.

What is this value called in scientific studies?
What does this value signify in terms of whether or not there is a difference between the new and old pages? Hint: Compare the value above with the "Type I error rate (0.05)".

This value is called: P-Value
If P-Value is large: null hypothesis ( $H_{0}$ ) is more likly to be true (the new page doesn't have better conversion)
If P-Value <= Type I error: then we reject null hypothesis ( $H_{0}$ ) and choose the alternative hypotheses ( $H_{1}$ ) (the new page have better conversion)
conclusion: P-Value(0.9) > Type I error(0.05) so, the null hypothesis ( $H_{0}$ ) is more likly to be true and the new page doesn't have better conversion

l. Using Built-in Methods for Hypothesis Testing
We could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance.

Fill in the statements below to calculate the:

convert_old: number of conversions with the old_page
convert_new: number of conversions with the new_page
n_old: number of individuals who were shown the old_page
n_new: number of individuals who were shown the new_page

In [39]:

import statsmodels.api as sm

# number of conversions with the old_page
convert_old = df_c.converted.sum()

# number of conversions with the new_page
convert_new = df_t.converted.sum()

# number of individuals who were shown the old_page
n_old = df_c.shape[0]

# number of individuals who received new_page
n_new = df_t.shape[0]

m. Now use sm.stats.proportions_ztest() to compute your test statistic and p-value. Here is a helpful link on using the built-in.

The syntax is:

proportions_ztest(count_array, nobs_array, alternative='larger')

where,

count_array = represents the number of "converted" for each group
nobs_array = represents the total number of observations (rows) in each group
alternative = choose one of the values from [‘two-sided’, ‘smaller’, ‘larger’] depending upon two-tailed, left-tailed, or right-tailed respectively.
Hint:
It's a two-tailed if you defined $H_{1}$ as $(p_{n e w} = p_{o l d})$ .
It's a left-tailed if you defined $H_{1}$ as $(p_{n e w} < p_{o l d})$ .
It's a right-tailed if you defined $H_{1}$ as $(p_{n e w} > p_{o l d})$ .

The built-in function above will return the z_score, p_value.

About the two-sample z-test

Recall that you have plotted a distribution p_diffs representing the difference in the "converted" probability $(p {^{'}}_{n e w} - p {^{'}}_{o l d})$ for your two simulated samples 10,000 times.

Another way for comparing the mean of two independent and normal distributions is a two-sample z-test. You can perform the Z-test to calculate the Z_score, as shown in the equation below:

Z_{s c o r e} = \frac{(p {^{'}}_{n e w} - p {^{'}}_{o l d}) - (p_{n e w} - p_{o l d})}{\sqrt{\frac{σ_{n e w}^{2}}{n_{n e w}} + \frac{σ_{o l d}^{2}}{n_{o l d}}}}

where,

$p^{'}$ is the "converted" success rate in the sample
$p_{n e w}$ and $p_{o l d}$ are the "converted" success rate for the two groups in the population.
$σ_{n e w}$ and $σ_{n e w}$ are the standard deviations for the two groups in the population.
$n_{n e w}$ and $n_{o l d}$ represent the size of the two groups or samples (it's the same in our case)

Z-test is performed when the sample size is large, and the population variance is known. The z-score represents the distance between the two "converted" success rates in terms of the standard error.

Next step is to make a decision to reject or fail to reject the null hypothesis based on comparing these two values:

$Z_{s c o r e}$
$Z_{α}$ or $Z_{0.05}$ , also known as critical value at 95% confidence interval. $Z_{0.05}$ is 1.645 for one-tailed tests, and 1.960 for two-tailed test. You can determine the $Z_{α}$ from the z-table manually.

Decide if your hypothesis is either a two-tailed, left-tailed, or right-tailed test. Accordingly, reject OR fail to reject the null based on the comparison between $Z_{s c o r e}$ and $Z_{α}$ .

Hint:
For a right-tailed test, reject null if $Z_{s c o r e}$ > $Z_{α}$ .
For a left-tailed test, reject null if $Z_{s c o r e}$ < $Z_{α}$ .

In other words, we determine whether or not the $Z_{s c o r e}$ lies in the "rejection region" in the distribution. A "rejection region" is an interval where the null hypothesis is rejected if the $Z_{s c o r e}$ lies in that region.

Reference:

Example 9.1.2 on this page/09%3A_Two-Sample_Problems/9.01%3A_Comparison_of_Two_Population_Means-_Large_Independent_Samples), courtesy www.stats.libretexts.org

Tip: You don't have to dive deeper into z-test for this exercise. Try having an overview of what does z-score signify in general.

In [40]:

import statsmodels.api as sm
# ToDo: Complete the sm.stats.proportions_ztest() method arguments
z_score, p_value = sm.stats.proportions_ztest([convert_new,convert_old], [n_new,n_old], alternative='larger')

print(z_score, p_value)

-1.31092419842 0.905058312759

n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages? Do they agree with the findings in parts j. and k.?

Tip: Notice whether the p-value is similar to the one computed earlier. Accordingly, can you reject/fail to reject the null hypothesis? It is important to correctly interpret the test statistic and p-value.

The z-score( $Z_{s c o r e}$ ) represents the distance between the two "converted" success rates in terms of the standard error. z-score used to make a decision to reject or fail to reject the null hypothesis based on comparing it with $Z_{α}$ , where:
For a right-tailed test, reject null if $Z_{s c o r e}$ > $Z_{α}$ .
For a left-tailed test, reject null if $Z_{s c o r e}$ < $Z_{α}$ .
If P-Value <= Type I error: then we reject null hypothesis ( $H_{0}$ ) and choose the alternative hypotheses ( $H_{1}$ ) (the new page have better conversion)
In our case:
$Z_{s c o r e}$ (1.31092419842) < $Z_{α}$ (1.645)
P-Value(0.9) > Type I error(0.05)
which means the null hypothesis ( $H_{0}$ ) will not rejected and the new page doesn't have better conversion
Although the values are slightly different, we found that z-score and p-value agree with the findings in parts j. and k.

Part III - A regression approach

ToDo 3.1

In this final part, you will see that the result you achieved in the A/B test in Part II above can also be achieved by performing regression.

a. Since each row in the df2 data is either a conversion or no conversion, what type of regression should you be performing in this case?

logistic regression.

b. The goal is to use statsmodels library to fit the regression model you specified in part a. above to see if there is a significant difference in conversion based on the page type a customer receives. However, you first need to create the following two columns in the df2 data frame:

intercept - It should be 1 in the entire column.
ab_page - It's a dummy variable column, having a value 1 when an individual receives the treatment, otherwise 0.

In [41]:

df2['intercept'] = 1
df2[['ab_page' , 'old_page']] = pd.get_dummies(df2['landing_page'])

df2.head()

Out[41]:

	user_id	timestamp	group	landing_page	converted	intercept	ab_page
0	661590	2017-01-11 16:55:06.154213	treatment	new_page	0	1	1
1	853541	2017-01-08 18:28:03.143765	treatment	new_page	0	1	1
2	679687	2017-01-19 03:26:46.940749	treatment	new_page	1	1	1
3	817355	2017-01-04 17:58:08.979471	treatment	new_page	1	1	1
4	839785	2017-01-15 18:11:06.610965	treatment	new_page	1	1	1

c. Use statsmodels to instantiate your regression model on the two columns you created in part (b). above, then fit the model to predict whether or not an individual converts.

In [42]:

log_mod = sm.Logit(df2['converted'], df2[['intercept', 'ab_page']])
results = log_mod.fit()

Optimization terminated successfully.
         Current function value: 0.366118
         Iterations 6

d. Provide the summary of your model below, and use it as necessary to answer the following questions.

In [43]:

results.summary2()

Out[43]:

Model:	Logit	No. Iterations:	6.0000
Dependent Variable:	converted	Pseudo R-squared:	0.000
Date:	2021-11-25 21:47	AIC:	212780.3502
No. Observations:	290584	BIC:	212801.5095
Df Model:	1	Log-Likelihood:	-1.0639e+05
Df Residuals:	290582	LL-Null:	-1.0639e+05
Converged:	1.0000	Scale:	1.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
intercept	-1.9888	0.0081	-246.6690	0.0000	-2.0046	-1.9730
ab_page	-0.0150	0.0114	-1.3109	0.1899	-0.0374	0.0074

e. What is the p-value associated with ab_page? Why does it differ from the value you found in Part II?

Hints:

What are the null and alternative hypotheses associated with your regression model, and how do they compare to the null and alternative hypotheses in Part II?
You may comment on if these hypotheses (Part II vs. Part III) are one-sided or two-sided.
You may also compare the current p-value with the Type I error rate (0.05).

the p-value associated with ab_page is 0.1899
In logistic regression (Part III) the hypothesis is two-sided while in A/B Testing (Part II) the hypothesis is one-sided, that's why the p-value differ from the value we found in Part II
P-Value(0.9) > Type I error(0.05) so, the null hypothesis ( $H_{0}$ ) is more likly to be true and the new page doesn't have better conversion

f. Now, you are considering other things that might influence whether or not an individual converts. Discuss why it is a good idea to consider other factors to add to your regression model. Are there any disadvantages to adding additional terms into your regression model?

It is a good idea to consider other factors because there are many factors that can effect individual converts when added to the regression model, as( age, education, country, etc.)
The disadvantage of considering other factors is that the model will be more complex and we should use higher-order terms to help predict the result better. also, we need to know if there is any Multicollinearity

g. Adding countries
Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives in.

You will need to read in the countries.csv dataset and merge together with your df2 datasets on the appropriate rows. You call the resulting data frame df_merged. Here are the docs for joining tables.
Does it appear that country had an impact on conversion? To answer this question, consider the three unique values, ['UK', 'US', 'CA'], in the country column. Create dummy variables for these country columns.
Hint: Use pandas.get_dummies() to create dummy variables. You will utilize two columns for the three dummy variables.
Provide the statistical output as well as a written response to answer this question.

In [44]:

# Read the countries.csv
df_3 = pd.read_csv('countries.csv')
df_3.head()

Out[44]:

	user_id	country
0	834778	UK
1	928468	US
2	822059	UK
3	711597	UK
4	710616	UK

In [45]:

# Join with the df2 dataframe
df_m = df2.merge(df_3, on ='user_id' , how='inner')
df_m.nunique()

Out[45]:

user_id         290584
timestamp       290584
group                2
landing_page         2
converted            2
intercept            1
ab_page              2
old_page             2
country              3
dtype: int64

In [46]:

# Create the necessary dummy variables
df_m[['CA','UK','US']]=pd.get_dummies(df_m['country'])
df_m.head()

Out[46]:

	user_id	timestamp	group	landing_page	converted	intercept	ab_page	country	CA	UK	US
0	661590	2017-01-11 16:55:06.154213	treatment	new_page	0	1	1	US	0	0	1
1	853541	2017-01-08 18:28:03.143765	treatment	new_page	0	1	1	US	0	0	1
2	679687	2017-01-19 03:26:46.940749	treatment	new_page	1	1	1	CA	1	0	0
3	817355	2017-01-04 17:58:08.979471	treatment	new_page	1	1	1	UK	0	1	0
4	839785	2017-01-15 18:11:06.610965	treatment	new_page	1	1	1	CA	1	0	0

h. Fit your model and obtain the results
Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if are there significant effects on conversion. Create the necessary additional columns, and fit the new model.

Provide the summary results (statistical output), and your conclusions (written response) based on the results.

Tip: Conclusions should include both statistical reasoning, and practical reasoning for the situation.
Hints:
Look at all of p-values in the summary, and compare against the Type I error rate (0.05).
Can you reject/fail to reject the null hypotheses (regression model)?
Comment on the effect of page and country to predict the conversion.

In [47]:

# Fit your model, and summarize the results

log_mod = sm.Logit(df_m['converted'], df_m[['intercept', 'ab_page' , 'UK' , 'US']])
results = log_mod.fit()
results.summary2()

Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6

Out[47]:

Model:	Logit	No. Iterations:	6.0000
Dependent Variable:	converted	Pseudo R-squared:	0.000
Date:	2021-11-25 21:48	AIC:	212781.1253
No. Observations:	290584	BIC:	212823.4439
Df Model:	3	Log-Likelihood:	-1.0639e+05
Df Residuals:	290580	LL-Null:	-1.0639e+05
Converged:	1.0000	Scale:	1.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
intercept	-2.0300	0.0266	-76.2488	0.0000	-2.0822	-1.9778
ab_page	-0.0149	0.0114	-1.3069	0.1912	-0.0374	0.0075
UK	0.0506	0.0284	1.7835	0.0745	-0.0050	0.1063
US	0.0408	0.0269	1.5161	0.1295	-0.0119	0.0934

In [48]:

np.exp(-0.0149)*100,np.exp(0.0506)*100,np.exp(0.0408)*100

Out[48]:

(98.521045572274687, 105.19020483004984, 104.16437559600236)

By comparing all of p-values against the Type I error rate (0.05): It's clear that all p-values (0.1912 , 0.0745 , 1.5161) are greater than Type I error rate (0.05)

So, the null hypothesis ( $H_{0}$ ) is more likely to be true and the new page doesn't have a better conversion

For every one-point increase in ab_page, the conversion will be 98.5% less likely (1.5% more likely) to happen, holding all other variables constant.
If an individual is from the UK, he is 5.19% more likely to convert than if he is from CA, holding all other variables constant.
If an individual is from the US, he is 4.16% more likely to convert than if he is from CA, holding all other variables constant.

In [49]:

# Adding Higher order terms

df_m['UK_new'] = df_m['UK'] * df_m['ab_page']
df_m['US_new'] = df_m['US'] * df_m['ab_page']
df_m.head()

Out[49]:

	user_id	timestamp	group	landing_page	converted	intercept	ab_page	country	CA	UK	US	UK_new	US_new
0	661590	2017-01-11 16:55:06.154213	treatment	new_page	0	1	1	US	0	0	1	0	1
1	853541	2017-01-08 18:28:03.143765	treatment	new_page	0	1	1	US	0	0	1	0	1
2	679687	2017-01-19 03:26:46.940749	treatment	new_page	1	1	1	CA	1	0	0	0	0
3	817355	2017-01-04 17:58:08.979471	treatment	new_page	1	1	1	UK	0	1	0	1	0
4	839785	2017-01-15 18:11:06.610965	treatment	new_page	1	1	1	CA	1	0	0	0	0

In [50]:

# Fit the model, and summarize the results

log_mod = sm.Logit(df_m['converted'], df_m[['intercept', 'ab_page' , 'UK' , 'US', 'UK_new', 'US_new']])
results = log_mod.fit()
results.summary2()

Optimization terminated successfully.
         Current function value: 0.366109
         Iterations 6

Out[50]:

Model:	Logit	No. Iterations:	6.0000
Dependent Variable:	converted	Pseudo R-squared:	0.000
Date:	2021-11-25 21:48	AIC:	212782.6602
No. Observations:	290584	BIC:	212846.1381
Df Model:	5	Log-Likelihood:	-1.0639e+05
Df Residuals:	290578	LL-Null:	-1.0639e+05
Converged:	1.0000	Scale:	1.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
intercept	-2.0040	0.0364	-55.0077	0.0000	-2.0754	-1.9326
ab_page	-0.0674	0.0520	-1.2967	0.1947	-0.1694	0.0345
UK	0.0118	0.0398	0.2957	0.7674	-0.0663	0.0899
US	0.0175	0.0377	0.4652	0.6418	-0.0563	0.0914
UK_new	0.0783	0.0568	1.3783	0.1681	-0.0330	0.1896
US_new	0.0469	0.0538	0.8718	0.3833	-0.0585	0.1523

In [51]:

np.exp(-0.0674)*100 , np.exp(0.0118)*100 , np.exp(0.0175)*100 , np.exp(0.0783)*100 , np.exp(0.0469)*100

Out[51]:

(93.482119806188351,
 101.18698946484011,
 101.76540221507618,
 108.14470441230692,
 104.80172021191829)

By comparing all of the p-values against the Type I error rate (0.05): It's clear that all p-values are greater than the Type I error rate (0.05)

So, the null hypothesis ( $H_{0}$ ) is more likely to be true and the new page doesn't have a better conversion

For every one-point increase in ab_page, the conversion will be 93.48% less likely (6.52% more likely) to happen, holding all other variables constant.
If an individual is from the UK, he is 1.18% more likely to convert than if he is from CA, holding all other variables constant.
If an individual is from the US, he is 1.76% more likely to convert than if he is from CA, holding all other variables constant.
If an individual is from the UK and a new page user, he is 8.14% more likely to convert than if he is from CA and old page user, holding all other variables constant.
If an individual is from the US and a new page user, he is 4.8% more likely to convert than if he is from CA and old page user, holding all other variables constant.

Andrew Samy
Mar, 27 2022