Monday, April 24, 2017

Assignment 5: Correlation and Spatial Autocorrelation


The topic of assignment 5 is correlation and spatial autocorrelation and consisted of two parts. Part one focused on census tracts and population in Milwaukee, WI. In part one, SPSS was used to create a correlation matrix, which can be found below.
Figure 1

Part 1: Correlation
 
The correlation matrix displays various interesting patterns. For example, there is a high strength correlation that can be seen for the number of manufacturing employees that are white. The value given for this correlation is 0.735, which is very close to 1, making the relationship between these two variables more linear than the relationship between others, such as the correlation between the black and white populations, which is -0.582, showing a weaker correlation. The negative sign in this value also represents a change in direction. The Hispanic population’s relationship with the number of manufacturing employees had a value of 0.303, which is still somewhat of a strong correlation, and can be seen as a high value when comparing it to -0.221, which is the given value for the relationship between the black population and the number of manufacturing workers. This sheds light on what people have manufacturing jobs in Milwaukee, and that the probability of the black population working at a manufacturing job is lower than that for the Hispanic and white populations in Milwaukee. Another example of a high strength correlation can be seen when looking at the relationship between the white population and median household income, with a value of 0.585. This value becomes more significant when comparing it to the relationship between the black population in the area and median household income, which is given as -0.417. This difference is quite shocking, and outlines the low probability of a high strength relationship between the black population and median household income. Another interesting comparison can be seen when looking at the relationship between the black and white populations compared to the number of retail employees. The white population had a value of 0.722, while the black population had a value of -0.152. This is a high strength correlation for the white population, and shows that the probability of the white population working as retail employees is quite high, and the probability is much lower for the black population. Overall, it can be inferred that the white population in Milwaukee tend to work as manufacturing or retail employees and finance employees. They also tend to have a higher median household income. It can be inferred that the black population in Milwaukee have a significantly lower household income, and not a lot of the population works in finance, retail, and manufacturing. It can be inferred that the Hispanic population does work in manufacturing and retail jobs, but not as many work in finance positions. The median household income for the population was also slightly higher for the Hispanic population than for the black population.
Part 2: Spatial Autocorrelation
 
The second part of assignment 5 focused on spatial autocorrelation, and GeoDa and SPSS were used to gather data to be analyzed. For this question, data was given from the Texas Election Commission (TEC) for the 1980 and 2016 Presidential Elections. The data consisted of the percent of Democratic votes for both elections and the voter turnout for each election. The US census fact finder website was used to download the percent Hispanic populations from 2015. The fact finder website was also used to download a shapefile of Texas and all of its counties. With the data given and downloaded, the mission was to analyze the patterns of each election and determine if there are clustering of voting patterns in the state, and clustering with voter turnout. The TEC will provide the data created to the governor in order to detect any changes in election patterns over the past 36 years.        
Various spatial tools were used in order to answer this question, including SPSS and GeoDa (and the US census fact finder). In order to answer the given question from the TEC, it must be determined if there is spatial autocorrelation of the voting results in each election and the voter turnout. The first step in answering this question was using the US census fact finder website to download the necessary data and Texas county shapefile. Once all of the necessary data was gathered, the Hispanic population data and Voting data were joined together using ArcMap. The next step after joining the data tables was to export the data as a new shapefile, in order to use the data in GeoDa it must be downloaded as a shapefile. Once GeoDa was opened, a Spatial Weight was created in order to see if there is spatial autocorrelation for both elections, voter turnouts, and Hispanic populations. Rook contiguity was used for the contiguity weight, and GeoDa was then used to determine the Moran’s I and LISA cluster maps for this data set. There were three variables to choose from when calculating the Moran’s I and LISA maps, these included POLY_ID, SHAPE_AREA, and SHAPE_LENGTH. The three Moran’s I charts that were created are shown below. Three LISA cluster maps were also created based on the same three variables. Univariate Moran’s I and LISA were both used on GeoDa, and the same weight was used for all calculations. A discussion of the results can be found below the diagrams.
Figure 2: POLY_ID Moran's I
Figure 3: SHAPE_AREA Moran's I
 
Figure 4: SHAPE_LENGTH Moran's I


Figure 5: POLY_ID LISA Cluster Map
 
Figure 6: SHAPE_AREA LISA Cluster Map
 
Figure 7: SHAPE_LENGTH LISA Cluster Map
 
 

The POLY_ID Moran’s I chart shows a null relationship, meaning that there is no relationship between the variables based on the POLY_ID. This shows a low strength relationship, and this can be seen by the very low Moran’s I of 0.0271803. This indicates a weak association between the variables.
The SHAPE_AREA Moran’s I chart shows a positive relationship and a very strong association. This is because all of the data points are clustered in one area of the chart, while there are some outliers that can be seen, the SHAPE_AREA Moran’s I chart shows a strong association between the variables. The Moran’s I value for this chart is 0.554015, which is significantly higher than the Moran’s I calculated for POLY_ID. This shows a strong correlation.
The SHAPE_LENGTH Moran’s I chart appears to be quite similar to the SHAPE_AREA chart, and also shows a positive relationship. This chart does show a strong association between the variables, but not as strong as that of SHAPE_AREA, because there are more outliers present and the data is not as clustered. The Moran’s I calculated for SHAPE_LENGTH is 0.49795, which is very close to the value found for SHAPE_AREA, showing the similarity between these two variables. Both SHAPE_LENGTH and SHAPE_AREA showed high spatial autocorrelation.
The POLY_ID LISA cluster map did not provide helpful results for the question at hand. This map came out quite scattered, and does not show any patterns of clustering or any indication of change in Democratic voters. This map shows high red values more towards the east of the map, possibly indicating some type of movement.
The SHAPE_AREA LISA cluster map showed a similar pattern to the one seen in the SHAPE_LENGTH map. The southwest corner of Texas appears to be the point of interest when it comes to Democratic votes. The blue areas indicating low values are a lot more scattered in this map compared to those in SHAPE_LENGTH. There is a high concentration of Democratic voters in the southwest corner counties of Texas, indicating that this should be an area of focus for the governor to look at.
The SHAPE_LENGTH LISA cluster map shows clustering in the southwest corner of Texas, this indicates that a large number of people living in the red highlighted counties voted Democratic in both elections, this provides an idea of where the Democratic voter population in Texas is located. The blue highlighted areas it can be seen where there was a low amount of Democratic voter turnout, and these counties appear to be more in the northeast corner of Texas. It can be noted that the southwest corner of Texas will be important for future Democratic votes.
In this part of the assignment, data was given from the Texas Election Commission for the 1980 and 2016 Presidential Elections. The data was the percent Democratic votes for both elections and the voter turnout for both. Hispanic population data from 2015 was also downloaded from the US census fact finder website. The patterns of the data given were analyzed using SPSS and GeoDa and produced the Moran’s I and LISA cluster maps that were provided above. These charts and maps were then analyzed in order to determine patterns of voting in Texas counties. There was a similar trend of clustering for the SHAPE_LENGTH and SHAPE_AREA LISA cluster maps that were created, and there was also a lot of similarity between the Moran’s I values and charts of these two variables as well. The similar trend in the cluster maps indicated that there is a high number of Democratic voters located in the counties in the southwest corner of Texas (the dark red highlighted counties in the maps above). The POLY_ID data did not provide helpful results, the Moran’s I value was quite low and the chart indicated a null relationship. The LISA cluster map for POLY_ID showed scattered data throughout the Texas counties. Overall, the governor should focus in on the counties of Texas that are located in the southwest corner.


 



Wednesday, April 5, 2017

Assignment 4: Hypothesis Testing

The objectives of this assignment include:

  • Distinguish between a z or t test
  • Calculate a z and t test
  • Use the steps of hypothesis testing
  • Make decisions about the null and alternative hypotheses
  • Utilize real-world data connecting stats and geography


Part 1: Significance Testing with Z and T tests
1.

The chart above was completed using the data from columns B-D to calculate the values in columns E-G.


2. A Department of Agriculture and Live Stock Development organization in Kenya estimate that yields in a certain district should approach the following amounts in metric tons (averages based on data from the whole country) per hectare: groundnuts. 0.57; cassava, 3.7; and beans, 0.29.  A survey of 23 farmers had the following results:
                                    


  1. Test the hypothesis for each of these products.  Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate test
  2. Be sure to present the null and alternative hypotheses for each as well as conclusions
  3. What are the probabilities values for each crop? 
  4. What are the similarities and differences in the results
Null Hypothesis-Ground Nuts: There is no difference between the yield of ground nuts in the certain district in Kenya and the overall averages of crop yields in Kenya.


Alternative Hypothesis-Ground Nuts: There is a difference between the yield of ground nuts in the certain district in Kenya and the overall averages of crop yields in Kenya.

Conclusion-Ground Nuts: Reject the null hypothesis, so there is no difference between the yield of ground nuts in the certain district in Kenya and the overall averages of crop yields in Kenya.


Null Hypothesis-Cassava: There is no difference between the yield of cassava in the certain district in Kenya and the overall averages of crop yields in Kenya.


Alternative Hypothesis-Cassava: There is a difference between the yield of cassava in the certain district in Kenya and the overall averages of crop yields in Kenya.


Conclusion-Cassava: Reject the null hypothesis, there is no difference between the yield of cassava in the certain district in Kenya and the overall averages of crop yields in Kenya.

Null Hypothesis-Beans: There is no difference between the yield of beans in the certain district in Kenya and the overall averages of crop yields in Kenya.


Alternative Hypothesis-Beans: There is a difference between the yield of beans in the certain district in Kenya and the overall averages of crop yields in Kenya.


Conclusion-Beans: Reject the null hypothesis, there is no difference between the yield of beans in the certain district in Kenya and the overall averages of crop yields in Kenya.

Probabilities:
Ground Nuts: 0.78344
Cassava: 0.99144
Beans: 0.96403


Similarities and Differences:
-The test statistic that was calculated for ground nuts as well as the one calculated for cassava both fell on the left side of the negative critical value of -0.0192.
-Ground nuts, cassava, and beans are all similar because they all had the same conclusion, which was to reject the null hypothesis, stating that there was no difference between each of these specific crop yields and the overall average yields for Kenya.
-All three crops are also similar because they all used the same two critical values to compare the test statistics to.
-One difference is that the probability for ground nuts was 0.78344, which was significantly lower than the probabilities found for cassava and beans.




3. A researcher suspects that the level of a particular stream’s pollutant is higher than the allowable limit of 4.2 mg/l.  A sample of n= 17 reveals a mean pollutant level of 6.4 mg/l, with a standard deviation of 4.4.  

Null Hypothesis: There is no difference between the allowable limit of stream pollutants in the particular stream and the mean pollutant level.


Alternative Hypothesis: There is a difference between the allowable limit of stream pollutants in the particular stream and the mean pollutant level.


Probability: 0.97403


Conclusion: Fail to reject the null hypothesis. There is a difference between the allowable limit of stream pollutants in the particular stream and the mean pollutant level.


Part 2
Using block group data (2 shapefiles) for the City of Eau Claire and Eau Claire County, the hypothesis testing steps were followed in order to answer the study question; is the average value of homes for the City of Eau Claire block groups significantly different from the block groups for Eau Claire County? After noting the study question for this part of the assignment, the null and alternative hypotheses were created. The null hypothesis is; there is no significant difference between the average value of homes for the City of Eau Claire and the block groups for Eau Claire County. The alternative hypothesis is; there is a significant difference between the average value of homes for the City of Eau Claire and the block groups for Eau Claire County. The statistical test chosen for this problem was a z test, and it was a one tailed test with a significance level of 95%. The test statistic was then calculated using the z test equation and came out to be -2.572. Using the significance level of 95% and the standard statistical tables (areas under a normal distribution) chart, the critical value was calculated to be 1.64. This value was placed on a graph and compared to the test statistic. The conclusion of this problem was to fail to reject the null hypothesis, stating that there is no significant difference between the average value of homes for the City of Eau Claire and the block groups for Eau Claire County. The map below was created to show the patterns of average home value for the City of Eau Claire and Eau Claire County. The concentration of the highest average home values appears to be centralized around the city area of Eau Claire, in the northwest corner on the County. As you move farther away from the city, the average home values appear to be much lower. The general area around the main city in Eau Claire appears to be the most populated area in the county, which coincides with the results of the z test that concluded there is no difference between the average home value for the City of Eau Claire and that of homes in Eau Claire County. The similar patterns of average home value for the city and the county, which were centralized in the city/northwest corner of the map, help to further explain the results of this problem, which indicated no significant difference in value.