Monte Carlo & Python

Oooooootay!!!!! Now, we’re going to use Python to simulate the Monte Carlo concept. Now, what in God’s name is Monte Carlo? It’s a concept that builds upon the predictive analytics we’ve been doing.

Okay, to explain a Monte Carlo the best way is to go back to linear regressions. In linear regressions, we predict results based on the data we have in front of us, but what if that data fluctuates? Furthermore, what if the one time we use it’s for stats that maybe are on one end of the spectrum or the other? We’d get extreme results or outliers and not necessarily the results we’d get any other time.

To eliminate that we’d run a Monte Carlo which would run numbers at random, hundreds maybe even thousands of times, to weed out the extremes and give us a more accurate prediction.

The Monte Carlo itself was pioneered by the Manhattan Project scientists. They were testing nukes but didn’t know how much uranium they needed to test. Being that they had limited uranium this posed a serious problem. So, they ran Monte Carlos to find reliable numbers of material needed and reduced how much material they wasted.

Here we’re going to use Python to run Monte Carlos for us. We’re going to simulate the rolling of dice and see how many times we win and lose.

First, as always we import the tools we need to make this thing work. This time we only need one the random module. So, let’s import:
import random

What random does is just as it’s name suggests, it runs and chooses a random number.

This is going to be the basis for our method which will simulate a dice roll. First, let’s create our method to roll the dice.
def rollDice():
roll = random.randint(2,12)
return roll

Okay, random.randint(2,12) is the main worker in this. What this does is it chooses a random number based on the outcomes of a dice roll. In the parentheses are the outcomes of the dice roll. Then we assign that to the variable roll. We then return roll to be used later in the code.

Next, we’re set our counters. For my program, I’m going to use 3. One to keep count of how times the Monte Carlo has run, one to tabulate the wins, and another to tabulate the losses. Code is here:

count = 0
wins = 0
losses = 0

Next, we do another form of looping. This time we do a while loop. A while loop is just as it says it runs while whatever you’ve placed as it’s condition to keep running while your loop runs. So, here’s the code and I’ll go thru and explain.

while count < 1000:
result = rollDice()
if result < 8:
print ("You lose.")
losses+=1
else:
print ("You are a winner.")
wins+=1
print(result)
count+=1

Okay, so this says while the count is less than 1000, run the RollDice() method and save it to result. Then we use another loop to say if that result is less than 8, print out “You lose.” and add a 1 to the number of losses. If not, print out “You are the winner”and add to a 1 number of wins. With each ruling the program will also print out the result, then add 1 to the count. It will do this until it reaches 1000 and then stop.

Lastly, we summarize everything. We’ll use another loop that will change depending on whether we got more wins than losses or vice versa. Here’s the code:

if wins > losses:
print ("Congratualtions, you won "+ str(wins) +" times.")
print ("You lost "+str(losses)+" times.")
else:
print ("You won "+ str(wins) +" times.")
print ("However, you lost "+str(losses)+" times.")

print ("Please try again.")

Okay, what we did was had the program print out what we with the win variable that shows how many wins we counted. Then the same for the losses. Depending on which is more the output of the program changes. Then we add a “Goodbye” message and that’s that. We’ve made a program that runs Monte Carlos for us.

Here are the screens.

Screenshot from 2015-11-16 13:12:04

Python & Multiple Regression

So, now we’re going to code for multiple regression in Python. Multiple Regression takes multiple inputs in order to predict the output. This means that we’re going to have to learn how to write multiple inputs into our arrays in order to do the regression. It’s simple.

Arrays, which are what we have been using to place our data into, can hold alot of different types of input. They can hold numbers, they can hold strings, but they also can hold other arrays. So, what we’re going to do is make our x array hold multiple x arrays, write the data into each array, then call the x array to run the multiple regression with all x inputs into the regression.

First, I’ll post the data that we are doing a regression on in order for it to be easier to follow along with when we’re coding it.

Now, we’re going to make an array for each of the inputs in the data. We then take those arrays and place them into the x array that we’re going to use to do the regression.

y=[]
bed = []
bath = []
sqFt = []
x = [bed, bath, sqFt]

Now, we write the data from the CSV into the arrays we set up in the last step.

for row in file:
bed.append(int(row[5]))
bath.append(int(row[6]))
sqFt.append(int(row[7]))
y.append(float(row[8]))

Okay, this time we’re using 2 different modules to make this thing work. This time we’re using Numpy & StatsModels, we’re gonna import them both, and assign them to variables.

import numpy as nump
import statsmodels.api as gres

Now, we have a function that calls everything and all we’re gonna do is place our variables in. Then we’ll call the function to see the output.

def model(y, x):
x = nump.array(x).T
x = gres.add_constant(x)
results = gres.OLS(endog=y, exog=x).fit()
return results

print (model(y,x).summary())

And here’s the result.

Screenshot from 2015-11-11 15:51:54

Showing Off

Aight, we know how to get Python to work for us. Now, let’s get more efficient. Our objective this time is to extract the GDP data from our CSV and run a regression. Simple. But this time we’re going to separate the data by region.

Now, the GDP data is already compiled and we’ll turn it into a CSV for Python to read and use it. But I’ll display it at the bottom.

Now, last time we covered importing the tools we need and we’ll do that again.

import csv
import scipy.stats as num

Now, it’s almost the same as last time, we need to open the file in the environment for it to read it. Turn that into a variable then call it for use. Make arrays, fill them, and then run the regression. What’s so different?

This time we’re going to make functions which will give us the exact data we’re looking for when we call the function.

First, let’s make a function. To make a function you first have to define the function. To define a function you’re going to use the def tag then name the function followed by a parentheses then a colon. Should look something like this:
def theNameOfYourFunctionHere():

Normally, the parentheses holds what’s called an argument basically, it’s the data you input into the function to make it work it’s magic for you. For this, we won’t need to input anything.

Now, after you create your function, press enter to get the next line, tab to indent then place all of your code into the your function. As long as it remains lined up with your original indent, it’ll all run under the name of your function. Create your function, place all of your code underneath, and that’s it. When you want to run it, call the function in your prompt, it’ll run.

So, I created 4 functions. One to run all the data as a whole then 3 others that runs all of the regions separately. In order to separate the regions, right under my for loop, I placed a bit code:
if row[4] == "EMEA":
uno.append(float(row[2]))
dos.append(float(row[3]))

Basically, I placed a loop within a loop to filter out the stuff I wanted. This practice is called nested loop in programming. What this loop does is the program reads the 5th column in the row in the CSV file looks to see if it’s EMEA, if it is it writes what in the 3rd and 4th column in the row which are where the GDP and Fertility data are held. Since computers start counting at 0, we call the 5th column by calling the 4th spot or row[4]. Same with the GDP and Fertility columuns, 3rd column? row[2]. 4th column? row[3].

I did this for each of the regions and all you have to do is call each function to get it to run the data. I added a prompt to the top to inform the user how to use the program and that’s it. Fully functional program that runs the data its given and separates out what you want. Here’s the screens.

Screenshot from 2015-11-09 18:55:09

Screenshot from 2015-11-09 18:55:22

Screenshot from 2015-11-09 18:55:27

Becoming a Snake Charmer

We’ve started our Python journey, we’ve learned how to make Python work math for us, and we’re gonna run Python from a file, read & manipulate data from a file, and a little iteration to work it like the pros do.

First, before we start coding we’re gonna do a little pre-gaming. Instead of inputting the data into the Python, we’re gonna have Python read the data and input the values itself. But first we have to put the data into a format that Python can read.

Python reads data from .CSV files. You can create .CSV files from almost any spreadsheet program. This means that with Excel, Libre & Open Office’s Calc programs, or Google Sheets, you can create the files needed for Python to read it.

Once you’ve created the files, open any text editor to start coding the Python program that will handle all of this. The one thing to remember when naming your file is the .py suffix you need on the end of the your file so that the compiler can understand that your file is a Python program.

Alright, the everything all set let’s start coding.

The first thing you want to do is import all the tools you’re gonna need to do the job. So, we start off with Scipy’s stats functions and a new tool, CSV.

CSV is the module in Python that allows you to read .CSV files into your program. We’ll be importing both of these tools. So, the first line should like this:
import csv
import scipy.stats as num

Next, we’ll read the file into the program then assign it to a variable in order to make it easier to call later. Code looks like this:
c = open('12-12.csv', 'rt')
file = csv.reader(c)

So, in that code, we used the open method to open the file, then assigned it to a variable. Then we used csv’s reader to read the variable and assign it to it’s own variable. Sounds complicated but all we did was take 2 actions and turn them into one.

Now, we’re gonna run a loop to grab all of the data out of the file and read them into an array which we can manipulate. So, what we need are 2 arrays to hold the X and Y values. So, we’re gonna initialize 2 arrays, one for X and one for Y. The [] stand for an empty array. So, you’re just declare X and Y equal to the brackets and you got the arrays. Code is here:
x = []
y = []

Next, we run a loop that will read each row of the .CSV file. Then we’ll take that data and put it into the arrays we made earlier. So, we’ll run a for loop. The loop will read each row until it reaches it’s end. We’re gonna grab each value as the reader reads em and place them into the arrays. Here’s the code then I’ll explain more:
for row in file:
x.append(int(row[0]))
y.append(int(row[1]))

Okay, the for loop reads each row in the file variable we made that opens that .CSV file in our program. Then as the reader reads them, we use the .append function to add them to arrays we made for X and Y. Since, the .CSV has 2 columns, the reader reads it has 2 dimensional array. This means that the reader has them in 2 columns, side by side. In order to call to each position, we have to tell row which position to place in which array. The 1st position is X, the 2nd position is Y. However, in programming language the 1st position in an array is counted as 0. So, row[0] is for X, row[1] is for Y.

Lastly, we have to turn the data in the .CSV file into something that Scipy can work with. Most of the time, when you have a program reading something from another file, you have to do something called Parsing. Programs read data in different ways, it also sees different types of data. Most words are known as strings. When the csv module is reading the data in the file, it sees them as strings. Strings, however, can’t be added or used computationally so we have to convert them into numbers that can be read. This is what parsing is. In programming languages, most numbers fall into the int data type. This tells the program that this is a number and should be treated as such. So, when we use .append we tell it to parse the data into an int by placing the int() into the .append method. Then inside of int(), we place row[]. So, basically, we added a value into the array, then turned into a real number with int(), then told the program what we wanted to grab with row[]. It’s a mouthful but that little bit of code does all of that.

Now, that we’ve got all the data, let’s confirm we got all it by printing it out like so:
print x
print y

Now, that we’ve got our X and Y variables all neat and in a format Scipy can deal with, we’ll let Scipy do it’s thing. We already imported Scipy’s Stats functions, so now we call on those functions, tell them what we want, and what variables we’re gonna use. Here’s the code:
slope, intercept, rvalue, pvalue, stderr = num.linregress(x,y)

Now, once we’ve declared the what we want, let’s display ’em.
print (slope )
print (intercept)
print (rvalue)
print (pvalue)
print (stderr)

That’s it. We’ve made Python work for us. Below, I’ll display my code and the outputs for all my problems.

Problem 12.7

12-7

Problem 12.8

12-8

Problem 12.11

12-11

Problem 12.12

12-12

Using Python to do the dirty work……

Alright, so we run using Python. So now, let’s USE Python. Python can be used for a whole host of activities. Today, we’ll use Python to run linear regression.

So, my book is a 6th edition while the current book is 8th so my numbers and problems will look a little different, but the commands are what’s going to be key.

As Prof. Holman went over in class, in order to make sure that you are able to run the commands needed to run linear regression in the Python console, make sure you load or import the modules or packages needed to run the work.

So, we’ll load Scipy, which is a tool that allows you to run Linear Regression. So, in the console type in:
import scipy

Now, the tool is loaded and we’re ready to go.

To call different functions and methods, you call the module then a period and the name of function. Then you specify the data by placing them into a parentheses. Sometimes, you’re calling a function within a function, which means adding as extra period and the name of the function.

So, for stats functions in Scipy you type:
scipy.stats

And to go a step further and do a Pearson, you type:
scipy.stats.Pearsonr(Your x variable, Your y variable)

Syntax in programming is IMPORTANT if you don’t put these commands in perfectly, your program won’t run. Be careful, however, Python is much more forgiving than other programming languages.

Typically, to run Linear Regression the code would look like this:
scipy.stats.linregress(x,y)

But what if that’s too much and you don’t wanna type all of that. You can import the functions of a method to another variable then call the methods from that variable.

So, let’s say I have a variable named stuff. I can call all of the stats functions from the stuff with this code:
import scipy.stats as stuff

Now, when I want to run Linear Regression, I type:
stuff.linregress(x,y)

If it’s Pearson then I type:
stuff.Pearsonr(x,y)

You can even assign a variable to a single method. Suppose I wanna do the variable lr to only do Linear Regression. The command then becomes:
lr = stuff.linregress

Then to run Linear Regression using your variable. You type:
lr(xy)

So, with that let’s get these problems done. My book is a different edition so I chose different problems. I chose 12.7, 12.8, 12.11, and 12.12.

My variables of x and y have the problem number behind it. So, x in 12.7 becomes x7, y becomes y7. X in 12.8 becomes x8, y becomes y8. So on and so forth.

Now, the screenshot of the code.

Problem 12.7
12.7

Problem 12.8
12.8

Problem 12.11
12.11

Problem 12.12
12.12

So, that’s it. That’s Python in a nutshell. I’m working on finding modules that will do data visualization as well. Good luck and Godspeed.

PYTHON!!!!!!!!!!!!!

So, today we use Python. Python is the programmer’s coding language. It’s what God would use to program. Simple, easy to use compared to other languages, and straightforward. Today, we print “Hello World”.

Basically, you open your console and type this into the console

print "Hello World"

print is the command that writes and displays whatever you place in front of it as long as you take that text and close it in “” or ”.

Your code should look like this:

Screenshot from 2015-10-28 11:16:47

Regression Output Interpretation

Okay, we’re doing an interpretation of an Anova (Analysis of Variance) chart. Using all the different tests you can find out if you’re getting the correct info & if it’s significant or just some numbers.

This time I have an Anova chart and I’ve color coded most of the chart to make it easier to read. Most of the things you need to find are Bolded and I’ll drape the answer in the color on the sheet to find. I’ll also explain different parts that hung me up that might help you understand better.

What is the R-Squared value?
15
How many observations are in the data set being analyzed?
137
Name the variables and their corresponding coefficients.
OPEC Spot Price 2.18980366377032
U.S. Finished Motor Gasoline Production 0.0123271155276218
U.S. Natural Gas Wellhead Price 6.06254143285014

What are the t-stat values for each independent variable?
OPEC Spot Price 23.8605965749108
U.S. Finished Motor Gasoline Production 3.77447566972457
U.S. Natural Gas Wellhead Price 6.15120186774688

What are the p-values for each independent variable?
OPEC Spot Price 0
U.S. Finished Motor Gasoline Production 0.000240640334457683
U.S. Natural Gas Wellhead Price
0.00000000841889297062203

Okay, now’s a good place to pause and explain what all this means in order for the next question to make sense. What we are looking for are the P & T-values. These 2 plays into each other and I’ll explain. T values determine whether or not the individual number you’re getting from a variable is statistically significant. So, does this stat I’m looking at mean anything? Any T Value over 95% or .95 is statistically significant when in combination with a P Value that’s below alpha. Most T values are going to be over 1. The higher the T value, the most significant it is.

When examining P values, we are asking did we cover a wide enough arc when we grabbed samples to examine? The reason is imagine that you’re looking for 4 leaf clovers and you’re searching by neighborhood. Imagine you found multiple batches of 4 leaf clovers, would you be able to honestly say that your town has more clovers than anyone or just that neighborhood or even that block? P values are ways to make sure that you’re not getting data from the same batch. It ensures that any significance you find isn’t a fluke or random occurrence but a true statistical event.

P Values have to be below Alpha (α = .05) or 1 in 20 most times to be significant. The lower the P value the more significant it is.

So, in order for a stat to be significant is must:
1. Have a T Value above .95
2. Have  a P Value below .05

Now, let’s answer the next question.
Which variable is the most significant?
OPEC Spot Price, it has a higher T Value AND a lower P Value than the other variables

Aight, lets stop again. Remember, when I said T values examine the significance of a single stat? F Statistic takes that concept and expands it to the entire group of data to find out if the all the data itself is significant as a whole. In the same role as P Value, F Significance is making sure that the data you receive isn’t a fluke or out the same batch. It’s called the P Value of the F Stat at times.
What is the F statistic for the model?
1071.416213
What is the significance level of the F statistic?
0

Okay, a little terminology flip & we’re home free. The next 3 are easy enough to find, we’ve done these before. What may get confusing is understanding what is being asked. Depending on where you reading text from & who is teaching, Sum of Squares can be referred to by a different name or even have different acronyms for each stat.

Sum of Squares Error or SSE can also be referred to as the Residual Sum of Squares.
Sum of Squares Regression or SSR is also referred to as Regressive Sum of Squares.
Sum of Squares for Y or SSyy is the most confusing, though. It is referred to as the Sum of All Y Squares or the Total Sum of Squares or SST.
What is SSE, the Sum of Squares Error?
24037.6954816273
What is SSyy, the Sum of Squares for y?
604963.484671533
What is SSR, the Sum of Squares Regression?
580925.789189906

Aight, that’s all I got. Hope I helped explain things. Later.

Demonstration Problem 13.1

Aight, now we’re doing a full readover of an ANOVA or Analysis of Variance. I don’t have sheet itself so I’ll just answer.

What is the R-Squared value?
.710
How many observations are in the data set being analyzed?
15
Name the variables and their corresponding coefficients.
X Variable 1 5.7103
X Variable 2 -0.4169
X Variable 3 -3.4715

What are the t-stat values for each independent variable?
X Variable 1 3.19
X Variable 2 -1.29
X Variable 3 -2.41

What are the p-values for each independent variable?
X Variable 1 0.0087
X Variable 2 0.2222
X Variable 3 0.0349

Which variable is the most significant?
X Variable 1
What is the F statistic for the model?
8.96
What is the significance level of the F statistic?
0.0027
What is SSE, the Sum of Squares Error?
131723.20
What is SSyy, the Sum of Squares for y?
453670.00
What is SSR, the Sum of Squares Regression?
321946.82

Multiple Regression

Aight, so now we’re doing Multiple Regression Formula. Multiple Regression is the same as Linear Regression, the only difference is instead for plugging one input we’re gonna be plugging in multiple inputs thus the name Multiple Regression.

For 13.2, we have 3 inputs (x) to find the y. We let Sheets do the hard work, from the table you pull the coefficients and the intercept. I highlighted the coefficients, intercept, and R square, which I’ll explain how that factors in later.

So, the equation looks like this.

y = 118.475414523152 - 0.0646331667755486x1 - 0.87302230844643x2 + 0.331020145498286x3

Y Hat gives you the projected value of the formula. You use this to compare to the Actual Ys & create residuals.

But what if you want to find how accurate the formula itself is. This where R Squared comes into place.

The R Squared is in the table above as well as the Outputs. R Squared measures how accurate the formula & projections are on a scale of 0 to 1 like binary. 0 means it’s not very accurate. 1 means its dead on. So, now we look at R Squared.

R2 = 0.973233579280226

So judging by the R Square, the formula predicts the values pretty accurately.

Next, we want to plug in the values for the equation to see what the projected values are for it.

We’re using x1 = 33, x2 = 29, x3 = 13. The outputs for that are marked in red in the table above but also they are worked out in the text below.


x1 = 33 * -0.0646331667755486 = -2.132894504
x2 = 29 * -0.87302230844643 = -25.31764694
x3 = 4.303261891

And the equation…..


y = 118.475414523152 - 2.132894504 - 25.31764694 + 4.303261891

Y Hat = 95.32813497

Next, we’ll do 13.6. First, the table. R Squared, Coefficients, and the Intercept are all highlighted in yellow.

So, the equation looks like this.

y = 17.6772412270152 - 0.059444423724484x1 - 0.118365834687559x2

The coefficients for the equation are negative. So, that basically means the larger the debt ratio & dividend payout, the lower the Insider Ownership Rate is among companies.

However, the R Square for the formula might be a reason to pause.

R2 = 0.00459097147147536

It’s on the lower end of the scale of 0 to 1 which means the data fluctuates alot so, the formula may not come up with predicted values that are close to the actual values.

In-Class Activity 10/5

So, you know we’re knocking out Multiple Regression. Our objective is to use the data to re-create the Regression’s Equation for the model.

Luckily, Sheets does most of the work for you. It’s done almost the same as Linear Regression except you’re going to your X input the entire range of X variables.

What you need to pick out the parts of the equation are the X Coefficients & the Intercept. I highlighted them in yellow.

So, if we’re reading this correctly, the equation for 13.1 should read like this:

y = 25.029 - .05x1 + 1.928x2

x1 is the coefficient for the number of square feet. x2 is coefficient for the age of the house.

Next we do Activity 13.5

One extra X variable means one extra range to include in your input and it’s output into the table. Read More