This assignment will take a number of weeks to complete and will require a good understanding of data science and management for successful completion. It is imperative that students take heed of the following points in relation to doing this assignment:
 Ensure that you clearly understand the requirements for the assignment – what has to be done and what are the deliverables.
 If you do not understand any of the assignment requirements – Please ASK the course coordinator or your tutor.
 Each time you work on any aspect of the assignment reread the assignment requirements to ensure that what is required is clearly understood.
Answer:
Introduction
Unemployment is the situation in which people actively seeking for employment but not getting it. Unemployment rate can be calculated by number of people seeking for employment divided by labor force.
The unemployment rate has its own impact on countries growth. There are various factors affecting unemployment. The unemployment rate is more in developing and developing countries than developed countries. In this study, we studied the unemployment rate in Australia and New Zealand. We also studied the relationship between School enrollment, tertiary (% gross) and unemployment rate. We group the countries using kmeans cluster analysis by unemployment rate.
This study may be useful for Socialist, demographer, researchers and academicians. Data is collected from World Bank (https://databank.worldbank.org).
 Data Setup
Data file is saved in csv (comma separated values) format. Data is read in R as
#Load The data > DATA=read.csv("DATA.csv", header = TRUE)

Data file contained 962 rows and 19 columns.
> dim(DATA) [1] 962 19 
Structure of data can be accessed as
> structure(DATA) 
 Exploratory Data Analysis
We used dplyr library for required data extraction. Firstly library is loaded as
# Library for the required data extraction > library(dplyr) 
Data of unemployment rate and School enrollment, tertiary (% gross) for given year are extracted from dataset as
#Data Extraction #Unemployment Rate in Australia from 2001 to 2014 > UER_AUS=na.omit(as.numeric(t(filter(DATA, Series.Code=="SL.UEM.TOTL.ZS", Country.Code=="AUS")[,5:18]))) > UER_AUS [1] 6.8 6.4 5.9 5.4 5.0 4.8 4.4 4.2 5.6 5.2 5.1 5.2 [13] 5.7 6.0 #Unemployment Rate in New Zealand from 2001 to 2014 > UER_NZL=na.omit(as.numeric(t(filter(DATA, Series.Code=="SL.UEM.TOTL.ZS", Country.Code=="NZL")[,5:18]))) > UER_NZL [1] 5.4 5.3 4.8 4.0 3.8 3.9 3.7 4.2 6.1 6.5 6.5 6.9 [13] 6.2 5.6 # School enrollment, tertiary (% gross) in Australia from 2001 to 2014 > SET_AUS=na.omit(as.numeric(t(filter(DATA, Series.Code=="SE.TER.ENRR", Country.Code=="AUS")[,5:18]))) > SET_AUS [1] 67.00505 75.75243 73.39426 71.69843 72.29192 [6] 71.48292 72.51995 72.91854 76.76537 80.91708 [11] 83.47076 85.41392 86.55455 attr(,"na.action") [1] 14 attr(,"class") [1] "omit" # School enrollment, tertiary (% gross) in New Zealand from 2001 to 2014 > SET_NZL=na.omit(as.numeric(t(filter(DATA, Series.Code=="SE.TER.ENRR", Country.Code=="NZL")[,5:18]))) > SET_NZL [1] 66.59294 67.27668 68.98010 83.60093 80.64162 [6] 78.68379 78.92032 78.03154 82.60359 82.51750 [11] 81.70712 80.84335 79.71429 80.88294 
We referred Hopkins, Glass and Hopkins (1987), Larsen and Marx (2017), Hoel (1954), Berenson(2012), Bickel and Doksum (2015), Casella and Burger (2002), DeGroot and Schervish (2012), Devore and Berk (2007), Groebner et al. (2008) and Ross (2014), Hogg and Craig (1995) and Serfling (2009).
One Variable Analysis:
Summary statistics (minimum, first quartile, median, mean, third quartile and maximum) for unemployment rate and School enrollment, tertiary (% gross) for Australia and New Zealand are obtained as
#Summary statistics (minimum, first quartile, median, mean, third quartile and maximum) for unemployment rate and School enrollment, tertiary (% gross) for Australia and New Zealand > summary(UER_AUS) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.200 5.025 5.300 5.407 5.850 6.800 > summary(UER_NZL) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.700 4.050 5.350 5.207 6.175 6.900 > summary(SET_AUS) Min. 1st Qu. Median Mean 3rd Qu. Max. 67.01 72.29 73.39 76.17 80.92 86.55 > summary(SET_NZL) Min. 1st Qu. Median Mean 3rd Qu. Max. 66.59 78.19 80.18 77.93 81.50 83.60 
Mean rate of unemployment is higher in Australia than New Zealand whereas mean School enrollment, tertiary (% gross) is higher in New Zealand than Australia. One can observed other measures also for comparison.
Standard deviation is obtained to study the variation in unemployment rate and School enrollment, tertiary (% gross) for Australia and New Zealand as
# Standard Deviation for unemployment rate and School enrollment, tertiary (% gross) for Australia and New Zealand > sd(UER_AUS) [1] 0.7247963 > sd(UER_NZL) [1] 1.136435 > sd(SET_AUS) [1] 6.070781 > sd(SET_NZL) [1] 5.822633 
Variation in unemployment rate for New Zealand is higher than Australia whereas variation in school enrollment, tertiary (% gross) for Australia is higher than New Zealand.
Boxplots of unemployment rate and school enrollment, tertiary (% gross) for Australia and New Zealand are plotted to study the variation more rigorously. Variation can be studied from Figure 1 and Figure 2.

Two variable analysis:
Correlation coefficient between unemployment rate and school enrollment, tertiary (% gross) for Australia and New Zealand are obtained.
# Correlation coefficient between unemployment rate and school enrollment, tertiary (% gross) for Australia and New Zealand # For Australia school enrollment, tertiary (% gross) for year 2014 is not available. > cor(UER_AUS[1:13], SET_AUS) [1] 0.08380169 > cor(UER_NZL, SET_NZL) [1] 0.1171504 
There is negative correlation between unemployment rate and school enrollment, tertiary (% gross) for Australia whereas positive correlation between unemployment rate and school enrollment, tertiary (% gross) for New Zealand.
In the following Figure 3 and Figure 4, scatter plots shows the relation between unemployment rate and school enrollment, tertiary (% gross) for Australia and New Zealand.
From Figure 3 and 4, we reported that
 As school enrollment, tertiary (% gross) increases unemployment rate also increases for New Zealand.
 As school enrollment, tertiary (% gross) increases unemployment rate decreases for Australia.
 Advanced Analysis
kmeans clustering and linear regression are carried in this section. We referred
Romesburg (2004) and Kaufman and Rousseeuw (2009)
 Clustering
Clustering is technique of grouping. In clustering we group (cluster) the set of objects which is similar in some characteristic than other group (cluster). In kmeans clustering is the gropuing technique where we make the k groups.
kmeans clustering according to unemployment rate for year 2014 for East Asia and Pacific countries:
First step is data extraction. Data of unemployment rate for all East Asia and Pacific countries for year 2014 is extracted as

The given 23 countries for which unemployment rate is available for year 2014 are gropued into 3 clsuters using kmeans clustering as
> kmeans(UER_2014,3) Kmeans clustering with 3 clusters of sizes 11, 7, 5
Cluster means: [,1] 1 3.881818 2 1.571429 3 6.560000
Clustering vector: [1] 3 1 2 1 3 1 3 1 1 1 2 2 2 1 1 3 2 3 1 1 2 1 2
Within cluster sum of squares by cluster: [1] 3.996363 3.434286 3.452000 (between_SS / total_SS = 87.0 %)
Available components:
[1] "cluster" "centers" "totss" [4] "withinss" "tot.withinss" "betweenss" [7] "size" "iter" "ifault" 
We can group the countries using clustering vector where 2: Low Unemployment rate, 1: Medium Unemployment rate and 3: High employment Rate. We can observe that Australia and New Zealand are in high unemployment group. We reported that about 87 % variation is explained by the clusters.
 Linear Regression
We referred Baayen (2008) and Hair et al. (1998) for this section. We tried to fit trend to unemployment rate for Australia and New Zealand by using simple linear regression.
Data for unemployment rate is given for 2001 to 2014. We tried to fit line for unemployment rate. We firstly plot the unemployment rate verses year to understand the nature in Figure 5 and Figure 6 for Australia and New Zealand respectively.
Australia:
Unemployment rate for Australia from 2001 to 2014
One can observed that unemployment rates get decreases from 2001 to 2008 then it started to increases for Australia. We fit second order polynomial to the unemployment rate for Australia as
> UER_AUS=as.numeric(t(filter(DATA, Series.Code=="SL.UEM.TOTL.ZS", Country.Code=="AUS")[,5:18])) > Year=2001:2014 > UERdataAUS=data.frame(Year,UER_AUS) > result1=lm(UER_AUS~Year+I(Year^2),data=UERdataAUS) > summary(result1) Call: lm(formula = UER_AUS ~ Year + I(Year^2), data = UERdataAUS) Residuals: Min 1Q Median 3Q Max 0.52549 0.13655 0.01422 0.08283 0.84371 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 1.654e+05 2.604e+04 6.351 5.44e05 Year 1.647e+02 2.594e+01 6.349 5.46e05 I(Year^2) 4.100e02 6.461e03 6.347 5.47e05 (Intercept) *** Year *** I(Year^2) *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3486 on 11 degrees of freedom Multiple Rsquared: 0.8042, Adjusted Rsquared: 0.7686 Fstatistic: 22.59 on 2 and 11 DF, pvalue: 0.0001272 
As PValue < 0.05 and R^{2} is 0.804, we can claim that second order polynomial fitted to the unemployment rate in Australia. We also found that all factors are significant.
New Zealand:
One can observed that unemployment rates for New Zealand from Figure 6. We fit second order polynomial to the unemployment rate for New Zealand as
> UER_NZL=as.numeric(t(filter(DATA, Series.Code=="SL.UEM.TOTL.ZS", Country.Code=="NZL")[,5:18])) > Year=2001:2014 > UERdataNZL=data.frame(Year,UER_NZL) > result2=lm(UER_NZL~Year+I(Year^2),data=UERdataNZL) > summary(result2) Call: lm(formula = UER_NZL ~ Year + I(Year^2), data = UERdataNZL) Residuals: Min 1Q Median 3Q Max 1.33571 0.66909 0.03819 0.78777 1.19396 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 1.137e+05 6.811e+04 1.670 0.123 Year 1.135e+02 6.786e+01 1.672 0.123 I(Year^2) 2.830e02 1.690e02 1.674 0.122 Residual standard error: 0.912 on 11 degrees of freedom Multiple Rsquared: 0.455, Adjusted Rsquared: 0.3559 Fstatistic: 4.592 on 2 and 11 DF, pvalue: 0.03549 
We found that R^{2} is 0.455 which suggest that fitting is not so good. We observed that Pvalue < 0.05, conclude that there is significant relation between year and unemployment rate for New Zealand.
Conclusion
We observed that mean rate of unemployment is higher in Australia than New Zealand whereas mean School enrollment, tertiary (% gross) is higher in New Zealand than Australia. We reported that variation in unemployment rate for New Zealand is higher than Australia whereas variation in school enrollment, tertiary (% gross) for Australia is higher than New Zealand.
There is negative correlation between unemployment rate and school enrollment, tertiary (% gross) for Australia whereas positive correlation between unemployment rate and school enrollment, tertiary (% gross) for New Zealand.
We grouped the countries using clustering vector where 2: Low Unemployment rate, 1: Medium Unemployment rate and 3: High employment Rate. We observed that Australia and New Zealand are in high unemployment group.
We fitted second order polynomial to the unemployment rate in Australia and found suitable. We also found that all factors are significant. We observed that there is significant relation between year and unemployment rate for New Zealand.
Reflection
Data filter and handling of not available values of variables of interest is main problem in this analysis. We used filter function defined in dplyr library for filtering data and na.omit function to omit the not available values. We got the interest after getting desired data in desired format. By doing this study, we got the confidence on the handling the big data analysis.
References
Anderberg, M.R., 2014. Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks (Vol. 19). Academic press.
Baayen, R.H., 2008. Analyzing linguistic data: A practical introduction to statistics using R. Cambridge University Press.
Berenson, M., Levine, D., Szabat, K.A. and Krehbiel, T.C., 2012. Basic business statistics: Concepts and applications. Pearson higher education AU.
Bickel, P.J. and Doksum, K.A., 2015. Mathematical statistics: basic ideas and selected topics, volume I (Vol. 117). CRC Press.
Casella, G. and Berger, R.L., 2002. Statistical inference (Vol. 2). Pacific Grove, CA: Duxbury.
DeGroot, M.H. and Schervish, M.J., 2012. Probability and statistics. Pearson Education.
Devore, J.L. and Berk, K.N., 2007. Modern mathematical statistics with applications. Cengage Learning.
Groebner, D.F., Shannon, P.W., Fry, P.C. and Smith, K.D., 2008. Business statistics. Pearson Education.
Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E. and Tatham, R.L., 1998. Multivariate data analysis (Vol. 5, No. 3, pp. 207219). Upper Saddle River, NJ: Prentice hall.
Hoel, P.G., 1954. Introduction to mathematical statistics. Introduction to mathematical statistics., (2nd Ed).
Hogg, R.V. and Craig, A.T., 1995. Introduction to mathematical statistics.(5"" edition) (pp. 269278). Upper Saddle River, New Jersey: Prentice Hall.
Hopkins, K.D., Glass, G.V. and Hopkins, B.R., 1987. Basic statistics for the behavioral sciences. PrenticeHall, Inc.
Kaufman, L. and Rousseeuw, P.J., 2009. Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons.
Larsen, R.J. and Marx, M.L., 2017. An introduction to mathematical statistics and its applications (Vol. 5). Pearson.
Romesburg, C., 2004. Cluster analysis for researchers. Lulu. com.
Ross, S.M., 2014. Introduction to probability models. Academic press.
Serfling, R.J., 2009. Approximation theorems of mathematical statistics (Vol. 162). John Wiley & Sons.
Follow Us