Monday, February 20, 2017

Quantitative Methods in Geography: Assignment 2-Descriptive Statistics and Mean Centers

The purpose of this assignment was to get familiar with various statistical methods and computer programs. This assignment required the use of a calculator and paper in order to hand write standard deviations, and the use of MS Excel in order to compute data from a given data set. There were two parts to this assignment, part one was hand calculations of the data, and part two was calculating mean centers and weighted mean centers for the data given. The topic of the data studied in this assignment was cycling. The story behind this assignment is as follows; you are looking to invest a large sum of money in a cycling team. During the last race of the Tour de Geographia, the overall individual won $300,000, with 25% going to the team owner. The team that won gained $400,000 in a variety of ways and 35% went to the team owner. With the skills and knowledge learned in quantitative methods, this cycling team data will be analyzed in order to decide which team to invest in. Typically, team Astana has produced a race winner, however team Tobler has recently been making waves as a new coming team in the cycling circuit. Using the race times from each team for each team member, calculations for each team will be made regarding range, mean, median, mode, kurtosis, skewness, and standard deviation. Each of these terms are defined below. 

Range: Range can be defined as the highest value in a data set subtracted from the lowest value in the data set.
Mean: Mean is another term for average (also known as sample mean). This is found by adding all of the values in a data set up to get the sum, and then dividing this response by the total number of values that are in the data set. Some examples of common uses for mean include batting averages and even the number of beers that you drink in a week.
Median: The median is the middle observation in your data set. When “n” is an odd number value, you take the middle number in the data set, when it is an even number you take the difference between the two middle values.
Mode: The mode of a data set is the most frequent occurring value within the set.
Kurtosis: Kurtosis refers to the relative peak or flatness of the distribution of your data set, when comparing to the normal distribution curve. If kurtosis for a data set is negative, this is called platykurtic and means that the data distribution is flat and spread out from the mean (any value less than -1 is platykurtic). If the kurtosis is positive, this is called leptokurtic and means that the data set includes all positive numbers and has a high peak (any value greater than 1 is leptokurtic). Finding the kurtosis for a data set helps you in figuring out what the outliers are trying to say about the data set.
Skewness: Skewness refers to how much the distribution of a data set differs from the mean of the set. Skewness is the symmetry of the distribution of data in a set. If a data set has a 0 for skewness, there is none present. Skewness helps you to better understand your data set. It shows which way the normal distribution curve for your data set is being pulled by the outliers. Skewness can be either positive or negative.
Standard Deviation: This is a normal distribution technique. Normal distribution helps in describing data that clusters around the mean. A standard deviation is a type of normal distribution statistic that tells you just how tightly that values in the data set are clustered around the mean within the data set. Standard deviations help us understand outliers and how they can influence a normal distribution curve by pulling the mean (or average) of a data set to either the left or the right. 

Results:
After looking over the standard deviations that were calculated for both teams, I will be investing in team Tobler because the standard deviation for this team is closer to the mean (has a smaller standard deviation value). This smaller value for the standard deviation means that most of the observations in team Tobler’s data set fell closer to the mean/average than the racing times for team Astana. I decided to pick team Tobler because while they may be new to the cycling circuit, there team members’ race times provided to more centered around the teams mean/average race time. With this being said, Tobler’s race times are more consistent as a team, and this makes them a better investment choice when trying to decide which team will be make you more money in the future. The use of standard deviations for both teams’ racing times was the most helpful tool that I used when making my investment decision. While the average race time for team Astana was slightly quicker (at 2,276.67 minutes, while Tobler’s was 2,285.47 minutes), the team member’s race times within Astana were not as close to the team’s average time as they were for team Tobler. This can be seen by comparing Astana’s rather high standard deviation of 16.63 with the much lower standard deviation of 7.62 that was calculated for team Tobler’s racing times. The range of each team’s racing times was another statistical method that was used to decide which team would be best to invest in. The range of team Astana was significantly higher than that of Tobler’s, comparing a range of 70 for Astana to 31 for Tobler. The smaller range of team Tobler shows how this team’s members all have a similar skill range with race times that are much closer together and more similar to the team’s average time than that of team Astana. Other statistical methods that were used to help make the decision of which team to back include the skewness and kurtosis of each team’s race times. While both teams had a negative skewness (Astana: -.003, Tobler: -1.56), Astana’s data set was less skewed than Tobler’s, this also means that Astana’s data set is not as left skewed as Tobler’s is. Both data sets had a positive kurtosis value as well (Astana: 1.17, Tobler: 2.93), which indicates that both data sets have a high (leptokurtic) peak on their distribution curves, with the peak on Tobler’s curve being a little higher. Skewness and Kurtosis were not the main factors used for deciding which team to invest in, but seeing these values for each time helped in visualizing the race times and deciding which team is best to put money on. Although team Astana appears to be the best choice at first glance of both teams’ race times, team Tobler ends up to be the best choice of team to invest in due to the lower standard deviation value and the lower range of the team’s data set. These statistical methods were used to analyze the data in this assignment, among other techniques that were defined and utilized in this assessment of race times. The standard deviation calculations (alone with the other calculations that were defined above) that were hand written and completed for each team are shown in the photograph below. 
Displaying IMG_6994.JPG

The goal of part two of this assignment was calculating mean centers and weighted mean centers for two cycling teams that were analyzed. (INSERT DEFINITIONS) Population data for Wisconsin counties from 2000 to 2015 were used to make the calculations. The geographic mean center of population at the county level will be calculated (for Wisconsin), along with the weighted mean center of population for 2000 to 2015, which will be weighted by population. The completed map that shows the three mean center data points is shown below, along with definitions for geographic mean center and weighted mean center.
Geographic Mean Center: Mean center is measure of central tendency that is also spatial. A measure of central tendency is a measure that indicates the middle or center of the distribution (includes mean, median, and mode). Mean center is attached to a Cartesian Plane, which includes X and Y coordinates, like those of latitude and longitude. Mean center is constructed from the average of the X and Y values included in a data set. Mean center answers the question; where is the center of the data?
Weighted Mean Center: The weighted mean center considers the frequencies of grouped data in a data set. Points are then weighted by the frequencies. 
As seen in the completed map, the weighted mean center for Wisconsin counties populations shifted to the right from 2000 to 2015, and both of the weighted mean centers are slightly above the mean center of this data set. This shift to the right of the weighted mean center may have resulted from a shift and wear the majority of people in Wisconsin are located, which may be in a more urban and crowded area. This shift to the right could have resulted from some type of economic shift, or some kind of shift in the housing market that caused a large number of people to move in that specific county. I found it interesting that both of the weighted mean centers of population for 2000 and 2015 are both located in one county, Wood county. It will be interesting to see how this weighted mean center will shift in years to come, and I wonder whether or not it will still be located within Wood county. Overall, the weighted mean center points are both located above the mean center point (green), and the population change from 2000 to 2015 can be seen as a shift to the right and a little bit down (from the purple point to the red).

Wednesday, February 1, 2017

Quantitative Methods in Geography, Assignment 1

The goals of assignment one are as follows: to differentiate between levels of measurement, to differentiate between classification methods, retrieving data from the U.S. Census and Joining Data, and to enhance cartographic knowledge. The first part of this assignment was to define the difference between nominal, ordinal, interval, and ratio data.  Nominal data is data that names objects, such as state names. Nominal data is usually shown as a label of some kind. With nominal data, each feature has its own value. Single symbol maps are used when representing nominal data. Shown below is an example of nominal data scales. Nominal data also does not include any quantitative values. Below is an example map that shows nominal data.
 Image result for map showing nominal data
Source: https://web.natur.cuni.cz/~langhamr/lectures/vtfg1/mapinfo_2/barvy/colors.html
Ordinal data is a categorical data type. It may consist of some type of ranking from low to high, it could range from village, town, to city, or it could be variations in symbol color and size, all in order to indicate an increase in value. Ordinal data scales usually measure non-numeric concepts. Ordinal is represented with unique values maps. Below is an example map that shows ordinal data.
Image result for map showing ordinal data
Source: https://www.e-education.psu.edu/natureofgeoinfo/book/export/html/1553
An interval scale is a regular numeric scale. In this case, the order of the values is known along with the exact differences between the values, unlike with ordinal data. Some good examples of interval data include the Celsius temperature scale, the pH scale, and time, because in both of these cases the increments between each value is known and measurable (as well as consistent). One thing to remember about interval data is that they do not have a “true zero” or absolute zero. Without an absolute zero, ratios cannot be computed. Interval data is represented with quantities maps. With interval data addition and subtraction can be done. Below is an example map that shows interval data.
Image result for map showing interval data
Source: http://support2.dundas.com/OnlineDocumentation/RSMap/DesigningMaps.html
Ratio data tell the order of the values, the exact number between each value, and they also have an absolute zero. Some examples of ratio data include height, weight, population, and rainfall. Ratio data having an absolute zero allows for a wide range of descriptive and inferential statistics to be applied to it. Addition, subtraction, multiplication, and division can be done with ratio data. Quantities maps are used to represent ratio data. Below is an example map showing ratio data.
Image result for map showing ratio data
Source: http://sites.uci.edu/randersonlab/available-data-2/

The goal of part two of this assignment is to provide maps that will presented to potential clients as a new hire to an agriculture consulting/marketing company. The company is interested in increasing the number of women as the principle operator of the farm. The company should concentrate their message in areas that females tend to visit a lot, and areas where farmers may go in their leisure time. Bringing this message to places where females and farmers commonly spend leisure time would be an effective way to draw them in to look at the message. Three maps will be created for this project. The three maps will be equal interval based on range, quantile, and natural breaks. Equal interval based on range is a classification method where each class has an equal range of values. This can be used when data is distributed evenly. The quantile classification method is when each class has about the same number of features. The natural breaks method is when data values that cluster are placed into a single class. Class breaks are when there is a gap between the clusters. This method can be used when data is distributed unevenly. Once these maps are completed, the next step will be to decide which map would be best for the potential clients to see and explain why this is the best choice.

The first step in this process was to gather data from the census fact finder website, the dataset chosen was: 2010 SF1 100% Data, and then the geography was set to all counties in Wisconsin. After the data was located, the next step was to download the shapefiles from this page into my folder for this assignment in order to begin the project. The next step is to open up ArcMap and prepare the Excel document for this assignment to be used in ArcGIS as well as add the shapefile of Wisconsin that was just downloaded to the map. To do this, it is necessary to add the data to the GIS platform. To do this, the “add data” button in ArcMap is used to select “Sheet 1” to be added to the GIS so the data can then be joined. To join the Wisconsin shapefile to Sheet 1, the field that the join will be based on was set to Geo_ID, the table to join to this layer was set to Sheet 1, and the field in the table to base the join on was set to Geo_ID. The next step in this project was to change the coordinate system to “USA Contiguous Albers Equal Area Conic” projection. After creating these three maps based on female farm operation in Wisconsin counties, the final project can be seen below.


 I think that the map that should be shown to potential clients should be the quantile classification method map. This map shows the most range of values, which can be seen from the large area of maroon that covers this map, compared to the maps from the other two classification methods. With more of a range of colors, which represents different values of female operators in each county, this map will be easier for clients to read and understand. The quantile classification map also gives the reader a good idea of where female farm operators are most concentrated in Wisconsin counties. When looking at the quantile map, it can be seen that female farmers are most concentrated in central and southern Wisconsin. Because of this, it makes sense that potential clients should direct their marketing to these general areas in order to reach the largest number of female farm operators. The strong variations in the value colors for the quantile classification map was the greatest of the three maps created in this project, and this is why it is the best choice to present to a potential client.