Introduction:
The goals of this
assignment include running a regression in SPSS, interpretation of a regression
output and predicting results given dana, to be able to manipulate Excel data
and join using ArcGIS, to create a standardized residuals map in ArcGIS, and to
be able to connect the statistics and spatial outputs generated. There were two parts to this assignment. Part 1 used SPSS in order to test the relationship between crime and school lunches in town x. Part 2 consisted of using SPSS to perform a multiple regression analysis with a number of variables compared to 911 calls in Portland, Oregon. Three maps were produced in part 2 as well, and will be discussed below.
Part 1:
The following background information was given for part one
of this assignment. Data includes for given neighborhoods: Percent of kids that
get free lunch in the given areas and the crime rate per 100,000 people. A
study on crime rates and poverty was conducted for Town X. The local news station got a hold of some
data and made a claim that as the number of kids that get free lunches increase
so does crime. You feel this might a bit
silly so you run a regression equation with the given data. Using SPSS to do
you regression, determine if the news is correct? If a new area of town was identified as
have 23.5% with a free lunch, what would the corresponding crime rate be? How confident are you in this result?
Figure 1: This is the SPSS output that was generated in the part 1 regression analysis of this assignment.
Results/questions answered:
Question 1: Is the prediction correct?
The R squared value
in the SPSS output shown below was 0.173, which as very low output. This shows
that there is a weak relationship between crime and free school lunches in town
x. The independent variable in this problem was Perfreelunch, and the dependent
variable was crime in this given town (seen in the table as CrimeRate). The
0.173 output value indicates that Perfreelunch does not explain CrimeRate very
well, if the output r squared value was closer to 1, this would indicate that
the prediction made was correct. However, only 17.3% of CrimeRate in town x is
explained by PerFreelunch. Overall, the relationship between the variables Is
week, making the prediction false.
Question 2: If a new area of town was identified as have 23.5% with a
free lunch, what would the corresponding crime rate be?
The following equation format was used to produce a
regression equation to answer this question: y=ax+b, where a is the constant
and b is the slope (or regression coefficient). The regression equation used
was: y= 21.819+ 1.685x. In this case, the x value to be put in the regression
equation is 23.5%, or 0.235. When x was plugged into this regression equation,
the y value was 22.215, so this would be the crime rate in town x per 100,000 people.
Question 3: How confident are you in this result?
The null hypothesis in this situation is that there is no
significant relationship between PerFreeLunch and CrimeRate, and the
alternative hypothesis states that there is a significant relationship between
PerFreeLunch and CrimeRate. The significance level for this regression analysis
came out as .005, at the 95% significance level. With this output significance
level being below .05, we reject the null hypothesis, saying that there is a
statistical relationship between the PerFreeLunch and CrimeRate. So although
that the other outputs of this regression analysis state that there really is
not a significant statistical relationship between these two variables, the
hypothesis testing steps do reject the null hypothesis, which states that there
is some level of a relationship between these variables.
I am not very confident in these results, because there is
an extremely low number of crimes explained by free school lunches in the
second question, with only 22 crimes per 100,000 people being explained in this
situation. The r squared value produced in this regression analysis was 0.173,
which is extremely low and indicates a weak relationship between these
variables. In the end it can be noted that Perfreelunch only explains 17.3% of
CrimeRates in town x, which leads to very little confidence in the original
prediction. (also because the hypothesis testing steps indicate some presence
of a relationship)
Part 2:
The instructions for this part of the assignment were as follows:
The following data was provided for you regarding 911 calls
in Portland, OR. The City of Portland is
concerned about adequate responses to 911 calls. They are curious what factors might provide
explanations as to where the most calls come from. A company is interested in building a new
hospital and they are wondering how large an ER to build and the best place to
build it. While you cannot answer the
size of the ER, you might be able to provide some idea as to the influences
related to more or less calls and possibly where to build the hospital. The following data is provided to you:
Calls (number of 911 calls per census tract), Jobs, Renters,
LowEduc (Number of people with no HS Degree), AlcoholX (alcohol sales),
Unemployed, ForgnBorn (Foreign Born Pop), Med Income, CollGrads (Number of
College Grads)
Step 1 Part 2:
Dependent variable: 911 Calls
Independent variables: AlcoholX, LowEduc, PopDensity
The three independent variables above were chosen by me, and then used to run a regression analysis in order to see how much each of these independent variables influences the dependent variable-which is 911 calls in Portland, Oregon. AlcoholX represents the alcohol sales in Portland, LowEduc represents the number of people without a high school diploma, and PopDensity represents the population density per census tract in Portland. Using SPSS, a regression analysis was done for each independent variable in order to determine how much each independent variable influences the dependent variable.
The null hypothesis for this situation is that there is no statistical relationship between the independent variable (one of the three) and the dependent variable (911 calls). The alternative hypothesis is that there is a statistical relationship between the independent variable (one of the three) and the dependent variable of 911 calls.
The following four question were answered for each regression done for each of the three independent variables.
Question 2: How strong of predictors are your independent variables? How do you know?
Question 3:State how much change in each variable for ONE unit of change?
Question 4:Do you reject or fail to reject the null hypothesis?
Figure 2: This is the SPSS output for the regression analysis done between the independent variable AlcoholX and the dependent variable of 911 calls.
LowEduc: The relationship between the independent variable of LowEduc and the dependent variable of 911 calls is a positive relationship. The regression equation for this analysis is as follows: Y=0.166X+ 3.931. The value of 0.166 shows one unit of change in the dependent variable (911 calls). The r squared value for this output was 0.567, which is a fairly high value, and this indicates that LowEduc (the independent variable) explains 911 calls in Portland (dependent variable) 56.7% of the time. This shows that LowEduc is a strong predictor for the dependent variable of 911 calls. The significance level that was calculated for this regression was .000, and because this value is less than .05, we reject the null hypothesis in this situation. By rejecting the null hypothesis, this is stating that there is a statistical relationship between the independent variable of LowEduc and the dependent variable of 911 calls. Figure 3 shown below is the output table used for this analysis.
Figure 3: This is the regression analysis output for LowEduc (independent) and 911 calls (dependent).
PopDensity: The relationship between the independent variable of PopDensity and dependent variable of 911 calls is a positive relationship. The regression equation produced from the SPSS output (seen in Figure 4) is: Y= 21909.074X + 20.616. One unit of change in the dependent variable is represented by the value of 21,909.074. The r squared value for this regression was 0.004, which is extremely low, and indicates that PopDensity does not do a good job at all of explaining 911 calls, only explaining them .4% of the time. The significance level for this regression was 0.555, and because this is greater than .05, we fail to reject the null hypothesis in this situation. This is saying that there is no statistical relationship between the independent variable of PopDensity and the dependent variable of 911 calls in Portland. It can be noted that PopDensity is not a strong predictor of the dependent variable of 911 calls, due to the extremely low r squared value, and the high significance level that led to conclusion of failing to reject the null hypothesis, stating that there is no statistical relationship between the independent (PopDensity) and dependent (911 calls) variables.
Figure 4: This is the SPSS output from the regression analysis of PopDensity (independent) and 911 calls (dependent).
Step 2 Part 2 Choropleth Map
and Residual Map:
In this part of the assignment, two maps were produced. The first map that was produced was a choropleth map of the number of calls per Census Tract in Portland, Oregon. This map can be found below as Figure 5, and is also explained here. The second map that was produced was a standardized residual map, which was created using the independent variable from above that had the highest r squared value, which was LowEduc. This residual map can be seen below as Figure 6, and is also explained here.
Figure 5: This is the map that was generated for the number of 911 calls per Census Tract in Portland. It can be seen the largest number of calls, ranging 57-176 were made in the cluster of tracts in the northern part of Portland (which are represented as the maroon color). This high number of 911 calls is also seen in the southeast corner of the map in one tract. The central part of Portland gets around 19-56 911 calls, indicated by the dark orange color. It can be seen that there is clustering of high numbers of 911 calls in the north part of Portland, and that in the central part of the city there also is a somewhat high number of 911 calls. This map can be used to help the construction company determine where (and how large) the new hospital should be in Portland. Since most of the 911 calls are concentrated in the north (maroon) area of Portland, this would most likely be the best choice for where to build this new hospital.
Figure 6: This residual map shows the standard deviations of the residuals for LowEduc (the independent variable tested against 911 calls-the dependent). The yellow areas in the map indicate where LowEduc did a good job of explaining 911 calls, more yellow areas indicate that the prediction that LowEduc and its regression equation made is good. (the more yellow areas the better) (Y=0.166X+ 3.931) The areas that are dark red and dark blue indicate the outliers, which is where the regression equation has either over (dark blue) or under (dark red) predicted the dependent variable (911 calls). The lighter blue and red areas indicate the same over or under prediction, but at much lower level. The red values, where LowEduc under-predicted 911 calls seems to be right in the middle (ranging north to south) of the city, indicating that in the inner city area LowEduc under-predicted the 911 calls. The over-prediction of 911 calls (blue colored areas) are located more on the outer/surrounding tracts of the city, indicating that this is where LowEduc over-predicted the number of 911 calls.
Step 3 Part 2 Multiple
Regression:
The following instructions were given for step 3 of this assignment:
Please run a multiple regression report (in
SPSS) with all the variables listed above (Calls (number of 911 calls per
census tract), Jobs, Renters, LowEduc (Number of people with no HS Degree),
AlcoholX (alcohol sales), Unemployed, ForgnBorn (Foreign Born Pop), Med Income,
CollGrads (Number of College Grads)).
Turn on the Collinearity Diagnostics by clicking. Is Multicollinearity
present? Explain the results provided by
SPSS.
This part of the assignment consisted of running a multiple regression with the variables listed in the instructions above, collinearity diagnostics were also generated during the multiple regression analysis, and another standardized residuals map was generated based on the output. Figure 7 shows and explains the outputs for this multiple regression analysis.
Figure 7: This is the output that was generated using SPSS, for a multiple regression analysis. The collinearity diagnostics can also be seen for this output. The r squared value for this output is .783, which is quite high, and this indicates that the independent variables of this regression explained 78.3% of the 911 calls in Portland. The "Beta" column can be used to find out which independent variable had the most influence on this output, which turns out to be LowEduc, with a Beta value of .614. The collinearity diagnostics table can also be interpreted to see if collinearity is present in this data set. In order for multicollinearity to be present, the condition index value must exceed 30, and in this situation none of the independent variables had a condition index value above 30, which indicates that multicollinearity is not present. The Eigen values column can also be interpreted, and eigen values close to 0 account for little variance, which means multicollinearity maybe present and that you should investigate further in order to determine which variable may be causing this.
The next set of instructions for this part of the assignment are as follows:
Now run a Stepwise approach and explain the results. Which variable is the most important? Explain using the SPSS output. Go back to ArcGIS and make a Residual map
using the variables selected from the Stepwise Approach. Think how can this map help with the study
question?
Stepwise Multiple Regression:
Figure 8: This is the stepwise multiple regression output for step 3 of this assignment. Below this output will be analyzed, along with responses to the questions posed in the instructions (seen above).
The stepwise method of multiple regression only produces an output with independent variables that do not exhibit multicollinearity, in order to get a more accurate result (not influenced by collinearity). The three independent variables that were most important and influential in this multiple regression include Renters, LowEduc, and Jobs. This is because these three variables have a combined r squared value of 0.771, which is quite high and indicates that these three independent variables explain 77.1% of the 911 calls. When looking at the output table above in the "B" column, it can be seen that each of these variables has a positive slope, and therefore a positive relationship with the 911 calls. All three of these variables also all have significance levels that are less than .05, so in this case we would reject the null hypothesis (which states that there is no relationship between the variables). This is stating that we agree with the alternative hypothesis, which states that there is a statistical relationship between these three independent variables and the dependent variable of 911 calls. The Beta values for these three independent variables (found in the coefficients table above) are also the highest among all of the independent variables, which is another indicator that these variables are the most important in this stepwise multiple regression analysis. Below another standardized residuals map can be found (and is explained) as Figure 9.
Figure 9: This is the standardized residuals map created from the multiple regression analysis that was completed for this part of the assignment. The range of colors that was described in Figure 6 also applies to the legend and colors in this map, because they are both standardized residuals maps. Blue areas on the map indicate where the regression over-predicted the 911 calls, and the red areas indicate where the number of 911 calls were under-predicted by the multiple regression analysis. This map can be interpreted the same as the one in Figure 6, and all three maps produced in this assignment were helpful in determining where the ideal location is in Portland, Oregon to build a new hospital. The map in this figure shows that in the central part of Portland, the number of 911 calls that were under-predicted is quite high. This same pattern was seen in Figure 6 (and Figure 5), and this pattern indicates that central Portland (where all of the red colored areas are located) is where the construction company should chose to build the new hospital.
Conclusion:
SPSS was used throughout this assignment to generate a regression analysis, which would then be analyzed in order to determine the presence of statistical relationships. In part 1 of the assignment different independent variables were used to test the relationship between free school lunches and crime rates in town x. In part two of the assignment, a multiple regression and a stepwise multiple regression were done in order to determine how different independent variables influence the dependent variable of 911 calls in Portland. Three maps were generated in this assignment, all of which tie back to and help answer the study question at hand. The results of the multiple regression analysis and the maps generated (in part 2 of the assignment) indicate that the best place for the construction company to build a new hospital in Portland, Oregon would be in the central area of the city (the red areas that are clustered in the central part of each map)