Friday, May 19, 2017

Assignment 6: Regression Analysis

Introduction:                          
The goals of this assignment include running a regression in SPSS, interpretation of a regression output and predicting results given dana, to be able to manipulate Excel data and join using ArcGIS, to create a standardized residuals map in ArcGIS, and to be able to connect the statistics and spatial outputs generated. There were two parts to this assignment. Part 1 used SPSS in order to test the relationship between crime and school lunches in town x. Part 2 consisted of using SPSS to perform a multiple regression analysis with a number of variables compared to 911 calls in Portland, Oregon. Three maps were produced in part 2 as well, and will be discussed below.



Part 1:
The following background information was given for part one of this assignment. Data includes for given neighborhoods: Percent of kids that get free lunch in the given areas and the crime rate per 100,000 people. A study on crime rates and poverty was conducted for Town X.  The local news station got a hold of some data and made a claim that as the number of kids that get free lunches increase so does crime.  You feel this might a bit silly so you run a regression equation with the given data. Using SPSS to do you regression, determine if the news is correct?    If a new area of town was identified as have 23.5% with a free lunch, what would the corresponding crime rate be?  How confident are you in this result? 

Figure 1: This is the SPSS output that was generated in the part 1 regression analysis of this assignment.

Results/questions answered:
Question 1: Is the prediction correct?

The R squared value in the SPSS output shown below was 0.173, which as very low output. This shows that there is a weak relationship between crime and free school lunches in town x. The independent variable in this problem was Perfreelunch, and the dependent variable was crime in this given town (seen in the table as CrimeRate). The 0.173 output value indicates that Perfreelunch does not explain CrimeRate very well, if the output r squared value was closer to 1, this would indicate that the prediction made was correct. However, only 17.3% of CrimeRate in town x is explained by PerFreelunch. Overall, the relationship between the variables Is week, making the prediction false.

Question 2: If a new area of town was identified as have 23.5% with a free lunch, what would the corresponding crime rate be? 

The following equation format was used to produce a regression equation to answer this question: y=ax+b, where a is the constant and b is the slope (or regression coefficient). The regression equation used was: y= 21.819+ 1.685x. In this case, the x value to be put in the regression equation is 23.5%, or 0.235. When x was plugged into this regression equation, the y value was 22.215, so this would be the crime rate in town x per 100,000 people.

Question 3: How confident are you in this result?

The null hypothesis in this situation is that there is no significant relationship between PerFreeLunch and CrimeRate, and the alternative hypothesis states that there is a significant relationship between PerFreeLunch and CrimeRate. The significance level for this regression analysis came out as .005, at the 95% significance level. With this output significance level being below .05, we reject the null hypothesis, saying that there is a statistical relationship between the PerFreeLunch and CrimeRate. So although that the other outputs of this regression analysis state that there really is not a significant statistical relationship between these two variables, the hypothesis testing steps do reject the null hypothesis, which states that there is some level of a relationship between these variables.
I am not very confident in these results, because there is an extremely low number of crimes explained by free school lunches in the second question, with only 22 crimes per 100,000 people being explained in this situation. The r squared value produced in this regression analysis was 0.173, which is extremely low and indicates a weak relationship between these variables. In the end it can be noted that Perfreelunch only explains 17.3% of CrimeRates in town x, which leads to very little confidence in the original prediction. (also because the hypothesis testing steps indicate some presence of a relationship)


Part 2:
The instructions for this part of the assignment were as follows:
The following data was provided for you regarding 911 calls in Portland, OR.  The City of Portland is concerned about adequate responses to 911 calls.  They are curious what factors might provide explanations as to where the most calls come from.  A company is interested in building a new hospital and they are wondering how large an ER to build and the best place to build it.  While you cannot answer the size of the ER, you might be able to provide some idea as to the influences related to more or less calls and possibly where to build the hospital.  The following data is provided to you:

Calls (number of 911 calls per census tract), Jobs, Renters, LowEduc (Number of people with no HS Degree), AlcoholX (alcohol sales), Unemployed, ForgnBorn (Foreign Born Pop), Med Income, CollGrads (Number of College Grads) 

Step 1 Part 2:
Dependent variable: 911 Calls
Independent variables: AlcoholX, LowEduc, PopDensity

The three independent variables above were chosen by me, and then used to run a regression analysis in order to see how much each of these independent variables influences the dependent variable-which is 911 calls in Portland, Oregon. AlcoholX represents the alcohol sales in Portland, LowEduc represents the number of people without a high school diploma, and PopDensity represents the population density per census tract in Portland. Using SPSS, a regression analysis was done for each independent variable in order to determine how much each independent variable influences the dependent variable.
The null hypothesis for this situation is that there is no statistical relationship between the independent variable (one of the three) and the dependent variable (911 calls). The alternative hypothesis is that there is a statistical relationship between the independent variable (one of the three) and the dependent variable of 911 calls.
The following four question were answered for each regression done for each of the three independent variables.
Question 1: What are the relationships? 
Question 2: How strong of predictors are your independent variables?  How do you know?
Question 3:State how much change in each variable for ONE unit of change? 
Question 4:Do you reject or fail to reject the null hypothesis?  

 AlcoholX: The relationship between this independent variable and the dependent variable of 911 calls is a positive relationship. The following regression equation was produced using the output SPSS table (Figure 2): Y= 3.069E-5X + 9.590. The significance level for this output was .000, which is extremely low, and since the level is below .05, we reject the null hypothesis in this situation (so we accept the alternative hypothesis). This is saying that there is a statistical relationship between the independent variable of AlcoholX and the dependent variable of 911 calls in Portland. The r squared value produced from the regression was 0.152, which is a very low coefficient of determination. This indicates that the independent variable AlcoholX explains the dependent variable (911 calls) .152 or 15.2% of the time. 1 unit of change for this variable is 3.069E-5. This variable is not that strong of a predictor in this case, due to its low r squared value, in which AlcoholX explains 911 Calls in Portland only 15.2% of the time.


Figure 2: This is the SPSS output for the regression analysis done between the independent variable AlcoholX and the dependent variable of 911 calls.

LowEduc: The relationship between the independent variable of LowEduc and the dependent variable of 911 calls is a positive relationship. The regression equation for this analysis is as follows: Y=0.166X+ 3.931. The value of 0.166 shows one unit of change in the dependent variable (911 calls). The r squared value for this output was 0.567, which is a fairly high value, and this indicates that LowEduc (the independent variable) explains 911 calls in Portland (dependent variable) 56.7% of the time. This shows that LowEduc is a strong predictor for the dependent variable of 911 calls. The significance level that was calculated for this regression was .000, and because this value is less than .05, we reject the null hypothesis in this situation. By rejecting the null hypothesis, this is stating that there is a statistical relationship between the independent variable of LowEduc and the dependent variable of 911 calls. Figure 3 shown below is the output table used for this analysis.




Figure 3: This is the regression analysis output for LowEduc (independent) and 911 calls (dependent).

PopDensity: The relationship between the independent variable of PopDensity and dependent variable of 911 calls is a positive relationship. The regression equation produced from the SPSS output (seen in Figure 4) is: Y= 21909.074X + 20.616. One unit of change in the dependent variable is represented by the value of 21,909.074. The r squared value for this regression was 0.004, which is extremely low, and indicates that PopDensity does not do a good job at all of explaining 911 calls, only explaining them .4% of the time. The significance level for this regression was 0.555, and because this is greater than .05, we fail to reject the null hypothesis in this situation. This is saying that there is no statistical relationship between the independent variable of PopDensity and the dependent variable of 911 calls in Portland. It can be noted that PopDensity is not a strong predictor of the dependent variable of 911 calls, due to the extremely low r squared value, and the high significance level that led to conclusion of failing to reject the null hypothesis, stating that there is no statistical relationship between the independent (PopDensity) and dependent (911 calls) variables.
Figure 4: This is the SPSS output from the regression analysis of PopDensity (independent) and 911 calls (dependent).


Step 2 Part 2 Choropleth Map and Residual Map:

In this part of the assignment, two maps were produced. The first map that was produced was a choropleth map of the number of calls per Census Tract in Portland, Oregon. This map can be found below as Figure 5, and is also explained here. The second map that was produced was a standardized residual map, which was created using the independent variable from above that had the highest r squared value, which was LowEduc. This residual map can be seen below as Figure 6, and is also explained here.



Figure 5: This is the map that was generated for the number of 911 calls per Census Tract in Portland. It can be seen the largest number of calls, ranging 57-176 were made in the cluster of tracts in the northern part of Portland (which are represented as the maroon color). This high number of 911 calls is also seen in the southeast corner of the map in one tract. The central part of Portland gets around 19-56 911 calls, indicated by the dark orange color. It can be seen that there is clustering of high numbers of 911 calls in the north part of Portland, and that in the central part of the city there also is a somewhat high number of 911 calls. This map can be used to help the construction company determine where (and how large) the new hospital should be in Portland. Since most of the 911 calls are concentrated in the north (maroon) area of Portland, this would most likely be the best choice for where to build this new hospital.

Figure 6: This residual map shows the standard deviations of the residuals for LowEduc (the independent variable tested against 911 calls-the dependent). The yellow areas in the map indicate where LowEduc did a good job of explaining 911 calls, more yellow areas indicate that the prediction that LowEduc and its regression equation made is good. (the more yellow areas the better) (Y=0.166X+ 3.931) The areas that are dark red and dark blue indicate the outliers, which is where the regression equation has either over (dark blue) or under (dark red) predicted the dependent variable (911 calls). The lighter blue and red areas indicate the same over or under prediction, but at much lower level. The red values, where LowEduc under-predicted 911 calls seems to be right in the middle (ranging north to south) of the city, indicating that in the inner city area LowEduc under-predicted the 911 calls. The over-prediction of 911 calls (blue colored areas) are located more on the outer/surrounding tracts of the city, indicating that this is where LowEduc over-predicted the number of 911 calls.


Step 3 Part 2 Multiple Regression:
The following instructions were given for step 3 of this assignment:
 Please run a multiple regression report (in SPSS) with all the variables listed above (Calls (number of 911 calls per census tract), Jobs, Renters, LowEduc (Number of people with no HS Degree), AlcoholX (alcohol sales), Unemployed, ForgnBorn (Foreign Born Pop), Med Income, CollGrads (Number of College Grads)).  Turn on the Collinearity Diagnostics by clicking.  Is Multicollinearity present?  Explain the results provided by SPSS.  

This part of the assignment consisted of running a multiple regression with the variables listed in the instructions above, collinearity diagnostics were also generated during the multiple regression analysis, and another standardized residuals map was generated based on the output. Figure 7 shows and explains the outputs for this multiple regression analysis.

Figure 7: This is the output that was generated using SPSS, for a multiple regression analysis. The collinearity diagnostics can also be seen for this output. The r squared value for this output is .783, which is quite high, and this indicates that the independent variables of this regression explained 78.3% of the 911 calls in Portland. The "Beta" column can be used to find out which independent variable had the most influence on this output, which turns out to be LowEduc, with a Beta value of .614. The collinearity diagnostics table can also be interpreted to see if collinearity is present in this data set. In order for multicollinearity to be present, the condition index value must exceed 30, and in this situation none of the independent variables had a condition index value above 30, which indicates that multicollinearity is not present. The Eigen values column can also be interpreted, and eigen values close to 0 account for little variance, which means multicollinearity maybe present and that you should investigate further in order to determine which variable may be causing this.

The next set of instructions for this part of the assignment are as follows: 
Now run a Stepwise approach and explain the results.  Which variable is the most important?  Explain using the SPSS output.  Go back to ArcGIS and make a Residual map using the variables selected from the Stepwise Approach.  Think how can this map help with the study question?

Stepwise Multiple Regression:


Figure 8: This is the stepwise multiple regression output for step 3 of this assignment. Below this output will be analyzed, along with responses to the questions posed in the instructions (seen above).

The stepwise method of multiple regression only produces an output with independent variables that do not exhibit multicollinearity, in order to get a more accurate result (not influenced by collinearity). The three independent variables that were most important and influential in this multiple regression include Renters, LowEduc, and Jobs. This is because these three variables have a combined r squared value of 0.771, which is quite high and indicates that these three independent variables explain 77.1% of the 911 calls. When looking at the output table above in the "B" column, it can be seen that each of these variables has a positive slope, and therefore a positive relationship with the 911 calls. All three of these variables also all have significance levels that are less than .05, so in this case we would reject the null hypothesis (which states that there is no relationship between the variables). This is stating that we agree with the alternative hypothesis, which states that there is a statistical relationship between these three independent variables and the dependent variable of 911 calls. The Beta values for these three independent variables (found in the coefficients table above) are also the highest among all of the independent variables, which is another indicator that these variables are the most important in this stepwise multiple regression analysis. Below another standardized residuals map can be found (and is explained) as Figure 9.
Figure 9: This is the standardized residuals map created from the multiple regression analysis that was completed for this part of the assignment. The range of colors that was described in Figure 6 also applies to the legend and colors in this map, because they are both standardized residuals maps. Blue areas on the map indicate where the regression over-predicted the 911 calls, and the red areas indicate where the number of 911 calls were under-predicted by the multiple regression analysis. This map can be interpreted the same as the one in Figure 6, and all three maps produced in this assignment were helpful in determining where the ideal location is in Portland, Oregon to build a new hospital. The map in this figure shows that in the central part of Portland, the number of 911 calls that were under-predicted is quite high. This same pattern was seen in Figure 6 (and Figure 5), and this pattern indicates that central Portland (where all of the red colored areas are located) is where the construction company should chose to build the new hospital. 

Conclusion:
SPSS was used throughout this assignment to generate a regression analysis, which would then be analyzed in order to determine the presence of statistical relationships. In part 1 of the assignment different independent variables were used to test the relationship between free school lunches and crime rates in town x. In part two of the assignment, a multiple regression and a stepwise multiple regression were done in order to determine how different independent variables influence the dependent variable of 911 calls in Portland. Three maps were generated in this assignment, all of which tie back to and help answer the study question at hand. The results of the multiple regression analysis and the maps generated (in part 2 of the assignment) indicate that the best place for the construction company to build a new hospital in Portland, Oregon would be in the central area of the city (the red areas that are clustered in the central part of each map)





Monday, April 24, 2017

Assignment 5: Correlation and Spatial Autocorrelation


The topic of assignment 5 is correlation and spatial autocorrelation and consisted of two parts. Part one focused on census tracts and population in Milwaukee, WI. In part one, SPSS was used to create a correlation matrix, which can be found below.
Figure 1

Part 1: Correlation
 
The correlation matrix displays various interesting patterns. For example, there is a high strength correlation that can be seen for the number of manufacturing employees that are white. The value given for this correlation is 0.735, which is very close to 1, making the relationship between these two variables more linear than the relationship between others, such as the correlation between the black and white populations, which is -0.582, showing a weaker correlation. The negative sign in this value also represents a change in direction. The Hispanic population’s relationship with the number of manufacturing employees had a value of 0.303, which is still somewhat of a strong correlation, and can be seen as a high value when comparing it to -0.221, which is the given value for the relationship between the black population and the number of manufacturing workers. This sheds light on what people have manufacturing jobs in Milwaukee, and that the probability of the black population working at a manufacturing job is lower than that for the Hispanic and white populations in Milwaukee. Another example of a high strength correlation can be seen when looking at the relationship between the white population and median household income, with a value of 0.585. This value becomes more significant when comparing it to the relationship between the black population in the area and median household income, which is given as -0.417. This difference is quite shocking, and outlines the low probability of a high strength relationship between the black population and median household income. Another interesting comparison can be seen when looking at the relationship between the black and white populations compared to the number of retail employees. The white population had a value of 0.722, while the black population had a value of -0.152. This is a high strength correlation for the white population, and shows that the probability of the white population working as retail employees is quite high, and the probability is much lower for the black population. Overall, it can be inferred that the white population in Milwaukee tend to work as manufacturing or retail employees and finance employees. They also tend to have a higher median household income. It can be inferred that the black population in Milwaukee have a significantly lower household income, and not a lot of the population works in finance, retail, and manufacturing. It can be inferred that the Hispanic population does work in manufacturing and retail jobs, but not as many work in finance positions. The median household income for the population was also slightly higher for the Hispanic population than for the black population.
Part 2: Spatial Autocorrelation
 
The second part of assignment 5 focused on spatial autocorrelation, and GeoDa and SPSS were used to gather data to be analyzed. For this question, data was given from the Texas Election Commission (TEC) for the 1980 and 2016 Presidential Elections. The data consisted of the percent of Democratic votes for both elections and the voter turnout for each election. The US census fact finder website was used to download the percent Hispanic populations from 2015. The fact finder website was also used to download a shapefile of Texas and all of its counties. With the data given and downloaded, the mission was to analyze the patterns of each election and determine if there are clustering of voting patterns in the state, and clustering with voter turnout. The TEC will provide the data created to the governor in order to detect any changes in election patterns over the past 36 years.        
Various spatial tools were used in order to answer this question, including SPSS and GeoDa (and the US census fact finder). In order to answer the given question from the TEC, it must be determined if there is spatial autocorrelation of the voting results in each election and the voter turnout. The first step in answering this question was using the US census fact finder website to download the necessary data and Texas county shapefile. Once all of the necessary data was gathered, the Hispanic population data and Voting data were joined together using ArcMap. The next step after joining the data tables was to export the data as a new shapefile, in order to use the data in GeoDa it must be downloaded as a shapefile. Once GeoDa was opened, a Spatial Weight was created in order to see if there is spatial autocorrelation for both elections, voter turnouts, and Hispanic populations. Rook contiguity was used for the contiguity weight, and GeoDa was then used to determine the Moran’s I and LISA cluster maps for this data set. There were three variables to choose from when calculating the Moran’s I and LISA maps, these included POLY_ID, SHAPE_AREA, and SHAPE_LENGTH. The three Moran’s I charts that were created are shown below. Three LISA cluster maps were also created based on the same three variables. Univariate Moran’s I and LISA were both used on GeoDa, and the same weight was used for all calculations. A discussion of the results can be found below the diagrams.
Figure 2: POLY_ID Moran's I
Figure 3: SHAPE_AREA Moran's I
 
Figure 4: SHAPE_LENGTH Moran's I


Figure 5: POLY_ID LISA Cluster Map
 
Figure 6: SHAPE_AREA LISA Cluster Map
 
Figure 7: SHAPE_LENGTH LISA Cluster Map
 
 

The POLY_ID Moran’s I chart shows a null relationship, meaning that there is no relationship between the variables based on the POLY_ID. This shows a low strength relationship, and this can be seen by the very low Moran’s I of 0.0271803. This indicates a weak association between the variables.
The SHAPE_AREA Moran’s I chart shows a positive relationship and a very strong association. This is because all of the data points are clustered in one area of the chart, while there are some outliers that can be seen, the SHAPE_AREA Moran’s I chart shows a strong association between the variables. The Moran’s I value for this chart is 0.554015, which is significantly higher than the Moran’s I calculated for POLY_ID. This shows a strong correlation.
The SHAPE_LENGTH Moran’s I chart appears to be quite similar to the SHAPE_AREA chart, and also shows a positive relationship. This chart does show a strong association between the variables, but not as strong as that of SHAPE_AREA, because there are more outliers present and the data is not as clustered. The Moran’s I calculated for SHAPE_LENGTH is 0.49795, which is very close to the value found for SHAPE_AREA, showing the similarity between these two variables. Both SHAPE_LENGTH and SHAPE_AREA showed high spatial autocorrelation.
The POLY_ID LISA cluster map did not provide helpful results for the question at hand. This map came out quite scattered, and does not show any patterns of clustering or any indication of change in Democratic voters. This map shows high red values more towards the east of the map, possibly indicating some type of movement.
The SHAPE_AREA LISA cluster map showed a similar pattern to the one seen in the SHAPE_LENGTH map. The southwest corner of Texas appears to be the point of interest when it comes to Democratic votes. The blue areas indicating low values are a lot more scattered in this map compared to those in SHAPE_LENGTH. There is a high concentration of Democratic voters in the southwest corner counties of Texas, indicating that this should be an area of focus for the governor to look at.
The SHAPE_LENGTH LISA cluster map shows clustering in the southwest corner of Texas, this indicates that a large number of people living in the red highlighted counties voted Democratic in both elections, this provides an idea of where the Democratic voter population in Texas is located. The blue highlighted areas it can be seen where there was a low amount of Democratic voter turnout, and these counties appear to be more in the northeast corner of Texas. It can be noted that the southwest corner of Texas will be important for future Democratic votes.
In this part of the assignment, data was given from the Texas Election Commission for the 1980 and 2016 Presidential Elections. The data was the percent Democratic votes for both elections and the voter turnout for both. Hispanic population data from 2015 was also downloaded from the US census fact finder website. The patterns of the data given were analyzed using SPSS and GeoDa and produced the Moran’s I and LISA cluster maps that were provided above. These charts and maps were then analyzed in order to determine patterns of voting in Texas counties. There was a similar trend of clustering for the SHAPE_LENGTH and SHAPE_AREA LISA cluster maps that were created, and there was also a lot of similarity between the Moran’s I values and charts of these two variables as well. The similar trend in the cluster maps indicated that there is a high number of Democratic voters located in the counties in the southwest corner of Texas (the dark red highlighted counties in the maps above). The POLY_ID data did not provide helpful results, the Moran’s I value was quite low and the chart indicated a null relationship. The LISA cluster map for POLY_ID showed scattered data throughout the Texas counties. Overall, the governor should focus in on the counties of Texas that are located in the southwest corner.


 



Wednesday, April 5, 2017

Assignment 4: Hypothesis Testing

The objectives of this assignment include:

  • Distinguish between a z or t test
  • Calculate a z and t test
  • Use the steps of hypothesis testing
  • Make decisions about the null and alternative hypotheses
  • Utilize real-world data connecting stats and geography


Part 1: Significance Testing with Z and T tests
1.

The chart above was completed using the data from columns B-D to calculate the values in columns E-G.


2. A Department of Agriculture and Live Stock Development organization in Kenya estimate that yields in a certain district should approach the following amounts in metric tons (averages based on data from the whole country) per hectare: groundnuts. 0.57; cassava, 3.7; and beans, 0.29.  A survey of 23 farmers had the following results:
                                    


  1. Test the hypothesis for each of these products.  Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate test
  2. Be sure to present the null and alternative hypotheses for each as well as conclusions
  3. What are the probabilities values for each crop? 
  4. What are the similarities and differences in the results
Null Hypothesis-Ground Nuts: There is no difference between the yield of ground nuts in the certain district in Kenya and the overall averages of crop yields in Kenya.


Alternative Hypothesis-Ground Nuts: There is a difference between the yield of ground nuts in the certain district in Kenya and the overall averages of crop yields in Kenya.

Conclusion-Ground Nuts: Reject the null hypothesis, so there is no difference between the yield of ground nuts in the certain district in Kenya and the overall averages of crop yields in Kenya.


Null Hypothesis-Cassava: There is no difference between the yield of cassava in the certain district in Kenya and the overall averages of crop yields in Kenya.


Alternative Hypothesis-Cassava: There is a difference between the yield of cassava in the certain district in Kenya and the overall averages of crop yields in Kenya.


Conclusion-Cassava: Reject the null hypothesis, there is no difference between the yield of cassava in the certain district in Kenya and the overall averages of crop yields in Kenya.

Null Hypothesis-Beans: There is no difference between the yield of beans in the certain district in Kenya and the overall averages of crop yields in Kenya.


Alternative Hypothesis-Beans: There is a difference between the yield of beans in the certain district in Kenya and the overall averages of crop yields in Kenya.


Conclusion-Beans: Reject the null hypothesis, there is no difference between the yield of beans in the certain district in Kenya and the overall averages of crop yields in Kenya.

Probabilities:
Ground Nuts: 0.78344
Cassava: 0.99144
Beans: 0.96403


Similarities and Differences:
-The test statistic that was calculated for ground nuts as well as the one calculated for cassava both fell on the left side of the negative critical value of -0.0192.
-Ground nuts, cassava, and beans are all similar because they all had the same conclusion, which was to reject the null hypothesis, stating that there was no difference between each of these specific crop yields and the overall average yields for Kenya.
-All three crops are also similar because they all used the same two critical values to compare the test statistics to.
-One difference is that the probability for ground nuts was 0.78344, which was significantly lower than the probabilities found for cassava and beans.




3. A researcher suspects that the level of a particular stream’s pollutant is higher than the allowable limit of 4.2 mg/l.  A sample of n= 17 reveals a mean pollutant level of 6.4 mg/l, with a standard deviation of 4.4.  

Null Hypothesis: There is no difference between the allowable limit of stream pollutants in the particular stream and the mean pollutant level.


Alternative Hypothesis: There is a difference between the allowable limit of stream pollutants in the particular stream and the mean pollutant level.


Probability: 0.97403


Conclusion: Fail to reject the null hypothesis. There is a difference between the allowable limit of stream pollutants in the particular stream and the mean pollutant level.


Part 2
Using block group data (2 shapefiles) for the City of Eau Claire and Eau Claire County, the hypothesis testing steps were followed in order to answer the study question; is the average value of homes for the City of Eau Claire block groups significantly different from the block groups for Eau Claire County? After noting the study question for this part of the assignment, the null and alternative hypotheses were created. The null hypothesis is; there is no significant difference between the average value of homes for the City of Eau Claire and the block groups for Eau Claire County. The alternative hypothesis is; there is a significant difference between the average value of homes for the City of Eau Claire and the block groups for Eau Claire County. The statistical test chosen for this problem was a z test, and it was a one tailed test with a significance level of 95%. The test statistic was then calculated using the z test equation and came out to be -2.572. Using the significance level of 95% and the standard statistical tables (areas under a normal distribution) chart, the critical value was calculated to be 1.64. This value was placed on a graph and compared to the test statistic. The conclusion of this problem was to fail to reject the null hypothesis, stating that there is no significant difference between the average value of homes for the City of Eau Claire and the block groups for Eau Claire County. The map below was created to show the patterns of average home value for the City of Eau Claire and Eau Claire County. The concentration of the highest average home values appears to be centralized around the city area of Eau Claire, in the northwest corner on the County. As you move farther away from the city, the average home values appear to be much lower. The general area around the main city in Eau Claire appears to be the most populated area in the county, which coincides with the results of the z test that concluded there is no difference between the average home value for the City of Eau Claire and that of homes in Eau Claire County. The similar patterns of average home value for the city and the county, which were centralized in the city/northwest corner of the map, help to further explain the results of this problem, which indicated no significant difference in value.













Wednesday, March 8, 2017

Quantitative Methods Assignment 3: Z scores and Probability

The purpose of this assignment is to study the geography of foreclosures in Dane County, Wisconsin. To do so, foreclosures in the area from 2011 and 2012 were spatially analyzed. A map was created to analyze the spatial patterns of the foreclosures, and answer the question of what the chance of foreclosures looks like for Dane County in 2013. In order to answer this question, data was placed in ArcGIS and manipulated with various tools in order to project the spatial patterns that the 2011 and 2012 Census Tracts data contain, and make predictions about foreclosures possibilities for Dane County in 2013.

 
Two different calculations were used when spatially analyzing the data set, z-score and probability. Z scores are based on a normal distribution and can be used to find the probability of something. A z-score can be used to determine how often patterns are occurring in a data set. As a z-score increases, so does probability. Probability is the ratio between the absolute frequency for an outcome and the frequency of all outcomes in an event. In this assignment, the data being analyzed is from county officials in Dane County, the addresses of foreclosures in the area were given and then geocoded and added to the Census Tracts for Dane County. The foreclosure data years were 2011 and 2012. The data from these two years of foreclosures will be used to predict the chance for foreclosures in the county in 2013.
The z-score was calculated for three census tracts in this assignment, tracts 122.01, 31, and 114.01. These calculations were completed for the 2011 and 2012 data sets. (See Figure 1 below) Two maps were also created that depict the pattern of house foreclosures in Dane County in 2011 and 2012 (Figure 2).

Figure 1:
Tract 122.01 Z Score-2011: Mean: 11.392523
Standard Deviation: 8.77303
Z-score: -0.614671
2012: Same z score of -0.614671

Tract 31 Z Score-2011: 1.43072
2012 Z Score: 0.753158


Tract 114.01 Z Score-2011: 2.348958
2012 Z Score: 3.146858

Figure 2:

The results of this assignment show the shift in where house foreclosures in Dane County were most concentrated in 2011 and 2012. In 2011, the concentration of house foreclosures is located in the northern part of the county, this pattern can be seen in the map on the left in Figure 2. The dark maroon color indicates where the most house foreclosures were located in the county. In 2012, this high concentration shifted to the right, and then was located on the east side of the county. This shift can be seen directly when looking at Figure 2 by comparing the two maps and locating the clockwise shift that this high concentration area experienced. The z score values calculated for tract 122.01 for 2011 and 2012 came out to be the same value, indicating no change in the pattern of house foreclosures for this specific census tract. The z score values that were calculated for tract 31 ended up being the two scores that were closest to the mean in this data set. The z score in 2011 for tract 31 was 1.43, and this value went down to 0.75 in 2012, meaning that the number of house foreclosures within this tract in 2012 was much closer to the mean of 11.39, which also indicates the high probability of this being a recurring pattern for this tract within Dane County. The z scores that were calculated for tract 114.01 were both quite high values, in 2011 it was 2.35 and this increased to 3.15 in 2012. This indicates that the house foreclosures located within this census tract are father away from the mean than any of the other 5 z scores that were calculated in this assignment. The probability for the two z scores for tract 122.01 is 0.7291. The probability for the z score value for tract 31 in 2011 is 0.9236, and for 2012 it is 0.7734. The probability for the z score for tract 114.01 in 2011 is 0.9906, and for 2012 it is 0.9991. If the patterns that are seen in 2012 for Dane County hold, the z score values for tract 122.01 2011 and 2012, along with the 2012 calculation for tract 31
would be exceeded 70% of the time. 20% of the time, the z score values for tract 31 in 2011, along with both z scores for tract 114.01 would be exceeded because they have a much higher probability than the other z scores.

The results of the assignment indicate a shift in the house foreclosures concentration in Dane County, Wisconsin in 2011 and 2012. This is a shift to the right or clockwise, and the high concentration area of foreclosures shifts from the north part of the county in 2011, to the east part of the county in 2012. This spatial pattern can be seen when looking at figure 2 and comparing the two maps that were created in the assignment. The patterns of 2012 along with the high z score calculations indicate that this high concentration of house foreclosures in the county will most likely continue to shift clockwise around the county, this may be because when foreclosures happen one year in a county, many residents move to a new part of town during this time, making the new part of town more populated and leaving behind house foreclosures. I think that the overall high z score values found for 2012 census tracts and the high z scores found in 2011 and 2012 for tract 114.01 are significant and helped in my prediction of what 2013 will look like. The information gathered in this assignment will be useful to the county officials of Dane County because seeing the 2011 to 2012 shift will be essential to estimating what the 2013 spatial pattern will look like. The z score calculations and probabilities that were found in this assignment can also be applied to further research on the movement of the high concentration area of house foreclosures in Dane County. One recommendation that I have is for county officials to be prepared for the house foreclosures to begin being more concentrated in the southern area of the county. Based on the spatial patterns seen in 2011 and 2012, it is likely that the foreclosures will continue to shift clockwise. Another suggestion I have would be to keep an eye on census tract areas like 114.01, because of the high z score values which indicate that the outcome seen in this tract will likely reoccur.

Monday, February 20, 2017

Quantitative Methods in Geography: Assignment 2-Descriptive Statistics and Mean Centers

The purpose of this assignment was to get familiar with various statistical methods and computer programs. This assignment required the use of a calculator and paper in order to hand write standard deviations, and the use of MS Excel in order to compute data from a given data set. There were two parts to this assignment, part one was hand calculations of the data, and part two was calculating mean centers and weighted mean centers for the data given. The topic of the data studied in this assignment was cycling. The story behind this assignment is as follows; you are looking to invest a large sum of money in a cycling team. During the last race of the Tour de Geographia, the overall individual won $300,000, with 25% going to the team owner. The team that won gained $400,000 in a variety of ways and 35% went to the team owner. With the skills and knowledge learned in quantitative methods, this cycling team data will be analyzed in order to decide which team to invest in. Typically, team Astana has produced a race winner, however team Tobler has recently been making waves as a new coming team in the cycling circuit. Using the race times from each team for each team member, calculations for each team will be made regarding range, mean, median, mode, kurtosis, skewness, and standard deviation. Each of these terms are defined below. 

Range: Range can be defined as the highest value in a data set subtracted from the lowest value in the data set.
Mean: Mean is another term for average (also known as sample mean). This is found by adding all of the values in a data set up to get the sum, and then dividing this response by the total number of values that are in the data set. Some examples of common uses for mean include batting averages and even the number of beers that you drink in a week.
Median: The median is the middle observation in your data set. When “n” is an odd number value, you take the middle number in the data set, when it is an even number you take the difference between the two middle values.
Mode: The mode of a data set is the most frequent occurring value within the set.
Kurtosis: Kurtosis refers to the relative peak or flatness of the distribution of your data set, when comparing to the normal distribution curve. If kurtosis for a data set is negative, this is called platykurtic and means that the data distribution is flat and spread out from the mean (any value less than -1 is platykurtic). If the kurtosis is positive, this is called leptokurtic and means that the data set includes all positive numbers and has a high peak (any value greater than 1 is leptokurtic). Finding the kurtosis for a data set helps you in figuring out what the outliers are trying to say about the data set.
Skewness: Skewness refers to how much the distribution of a data set differs from the mean of the set. Skewness is the symmetry of the distribution of data in a set. If a data set has a 0 for skewness, there is none present. Skewness helps you to better understand your data set. It shows which way the normal distribution curve for your data set is being pulled by the outliers. Skewness can be either positive or negative.
Standard Deviation: This is a normal distribution technique. Normal distribution helps in describing data that clusters around the mean. A standard deviation is a type of normal distribution statistic that tells you just how tightly that values in the data set are clustered around the mean within the data set. Standard deviations help us understand outliers and how they can influence a normal distribution curve by pulling the mean (or average) of a data set to either the left or the right. 

Results:
After looking over the standard deviations that were calculated for both teams, I will be investing in team Tobler because the standard deviation for this team is closer to the mean (has a smaller standard deviation value). This smaller value for the standard deviation means that most of the observations in team Tobler’s data set fell closer to the mean/average than the racing times for team Astana. I decided to pick team Tobler because while they may be new to the cycling circuit, there team members’ race times provided to more centered around the teams mean/average race time. With this being said, Tobler’s race times are more consistent as a team, and this makes them a better investment choice when trying to decide which team will be make you more money in the future. The use of standard deviations for both teams’ racing times was the most helpful tool that I used when making my investment decision. While the average race time for team Astana was slightly quicker (at 2,276.67 minutes, while Tobler’s was 2,285.47 minutes), the team member’s race times within Astana were not as close to the team’s average time as they were for team Tobler. This can be seen by comparing Astana’s rather high standard deviation of 16.63 with the much lower standard deviation of 7.62 that was calculated for team Tobler’s racing times. The range of each team’s racing times was another statistical method that was used to decide which team would be best to invest in. The range of team Astana was significantly higher than that of Tobler’s, comparing a range of 70 for Astana to 31 for Tobler. The smaller range of team Tobler shows how this team’s members all have a similar skill range with race times that are much closer together and more similar to the team’s average time than that of team Astana. Other statistical methods that were used to help make the decision of which team to back include the skewness and kurtosis of each team’s race times. While both teams had a negative skewness (Astana: -.003, Tobler: -1.56), Astana’s data set was less skewed than Tobler’s, this also means that Astana’s data set is not as left skewed as Tobler’s is. Both data sets had a positive kurtosis value as well (Astana: 1.17, Tobler: 2.93), which indicates that both data sets have a high (leptokurtic) peak on their distribution curves, with the peak on Tobler’s curve being a little higher. Skewness and Kurtosis were not the main factors used for deciding which team to invest in, but seeing these values for each time helped in visualizing the race times and deciding which team is best to put money on. Although team Astana appears to be the best choice at first glance of both teams’ race times, team Tobler ends up to be the best choice of team to invest in due to the lower standard deviation value and the lower range of the team’s data set. These statistical methods were used to analyze the data in this assignment, among other techniques that were defined and utilized in this assessment of race times. The standard deviation calculations (alone with the other calculations that were defined above) that were hand written and completed for each team are shown in the photograph below. 
Displaying IMG_6994.JPG

The goal of part two of this assignment was calculating mean centers and weighted mean centers for two cycling teams that were analyzed. (INSERT DEFINITIONS) Population data for Wisconsin counties from 2000 to 2015 were used to make the calculations. The geographic mean center of population at the county level will be calculated (for Wisconsin), along with the weighted mean center of population for 2000 to 2015, which will be weighted by population. The completed map that shows the three mean center data points is shown below, along with definitions for geographic mean center and weighted mean center.
Geographic Mean Center: Mean center is measure of central tendency that is also spatial. A measure of central tendency is a measure that indicates the middle or center of the distribution (includes mean, median, and mode). Mean center is attached to a Cartesian Plane, which includes X and Y coordinates, like those of latitude and longitude. Mean center is constructed from the average of the X and Y values included in a data set. Mean center answers the question; where is the center of the data?
Weighted Mean Center: The weighted mean center considers the frequencies of grouped data in a data set. Points are then weighted by the frequencies. 
As seen in the completed map, the weighted mean center for Wisconsin counties populations shifted to the right from 2000 to 2015, and both of the weighted mean centers are slightly above the mean center of this data set. This shift to the right of the weighted mean center may have resulted from a shift and wear the majority of people in Wisconsin are located, which may be in a more urban and crowded area. This shift to the right could have resulted from some type of economic shift, or some kind of shift in the housing market that caused a large number of people to move in that specific county. I found it interesting that both of the weighted mean centers of population for 2000 and 2015 are both located in one county, Wood county. It will be interesting to see how this weighted mean center will shift in years to come, and I wonder whether or not it will still be located within Wood county. Overall, the weighted mean center points are both located above the mean center point (green), and the population change from 2000 to 2015 can be seen as a shift to the right and a little bit down (from the purple point to the red).