I have attached all necessary information in **Statitics assignment in word**.

Due date : 2nd April 2022

Reference: APA style

Word count: 700-1000 words

I have attached others student work that was provided by lecturer. I have also attcahed an assigned variable. My student id ends with 7. So please refer to the topic from the variable assigned.

Please let me know if you need any information for this assignment also also a quote for the assignment.

Statistics Assignment

The purpose of this assignment is to give you an opportunity to demonstrate your skills in describing and analysing data using concepts and tools that we have developed in the course so far.

Below are instructions on how to collect a specified set of data and what to do with it. Your goal is to produce a report in MS Word discussing the data and submit this along with a single MS Excel workbook showing your workings. A suggested target range for the word count of the report is 700-1000 words.

I have prepared and attached an example Excel workbook which I will refer to below. Note: my Excel workbook is not a model answer. You may choose to use different visualisations and do not necessarily need all the computed statisitcs and charts I have included. It really depends on the features of the data you have, so you need to use your own judgement as to how to best present and describe the data. Besides, the primary output for this assignment is the report itself, not the workbook.

**Data collection**

Collect quantitative data on two variables from the Sustainable Development Report 2021 website.

· Go to

__https://dashboards.sdgindex.org/__

and browse around the site to become familiar with its purpose and the information publicly available there.

· Go to the “Downloads” page and click on “Database EXCEL” to download the database of indicators used to assess countries’ progress towards the UN Sustainable Development Goals. You will be taking data from the **“SDR2021 Data” sheet** in the workbook. Starting from **column AR** of that sheet, there are columns of cross-country data for the SDG indicators, one row for each country. Note that from **row 195** the data are for regional blocs so these should be __excluded__ from the data you take.

· You have been assigned __two__ variables according to the last digit of your Student ID number. You can find the variables assigned to you in the attached file “Assigned variables.xlsx”. For example, my student ID number (a long long time ago in a galaxy not so far away) ended with 2 so I would be using variables “Poverty headcount ratio at $1.90/day (%)” and “Cereal yield (tonnes per hectare of harvested land)”. I have chosen pairs of variables that may potentially have a statistical relationship. If you wish you are welcome to switch one of the variables with another one from the database that you are interested in investigating and you think is related to the variable you retain.

· **P.S: My student number ends with 7. Therefore, my topics are **

7 |
Corruption Perception Index (worst 0-100 best) |
16 |

Population with access to clean fuels and technology for cooking (%) |
7 |

· **Please refer to Variable Assigned excel sheet for more information.**

· Look up your variables in the Data Explorer on the website or in the report from page 75 (some newer variables are not included on the website yet, it seems). The main thing you want to understand is what a given value of each of your variables means. E.g. I found that the “Poverty headcount ratio at $1.90/day (%)” is the estimated percentage of the population that is living under the poverty threshold of US$1.90 a day.

· Each indicator has a number of associated columns in the Database workbook, **the first column of the set has the data**, the others can be safely ignored. So for the indicators I use in my example Excel workbook, I took **data from columns QZ and GO (and only down to row 194)**. Using Excel’s Find tool is a quick way to find your data. Copy and paste the data you will use into your workbook. You should keep the country names alongside the data so that you can identify which observation is for which country.

Prepare your data.

· Construct __separate__ univariate data sets for analysis. There will probably be many countries where there is no estimate for the indicators you are looking at. **If there is no observation recorded, do not assume the observed value is zero**. In general, missing observations in data rarely mean they should be replaced with zeros. Also consider if it is appropriate to include observations that are recorded as zero. In my example Excel workbook I have retained observations of zero for CO2 emissions because this suggests those countries are not exporting fossil fuels, while blank cells mean there is no observation. It is fine to have blank cells within your data ranges, Excel will usually ignore them (as long as they are truly blank).

· Construct a bivariate (__paired__) data set – i.e. for each country you should have an observation for both variables. You can see in my example Excel workbook how I use some Excel formulas and the Replace tool to blank out cells for countries where there is only an observation for one of the two variables. If you find that the number of countries that you have left in the bivariate data set is low, say less than 30, it might be best to go back to the Database and replace the variable that is causing many countries to be dropped.

· In your report you should note any difficulties with the data preparation and implications of dropping countries from the data sets if such was required.

**An educated guess**

Guess the average value for each variable.

· Run your eye down the column of univariate data you have for each variable (the separate data not the paired data), and make a guess what you think the cross-country average would be for each one. **Do not use Excel to calculate the averages here.**

· Just take a note of your guesses; you will use them later.

**Data description**

Use numerical summary measures and graphical representations to describe the two variables (using the __separate data__)

· You can use the “Descriptive Statistics” tool in the data analysis tool pack and also calculate quartiles, coefficients of variation etc.

· Draw a histogram, boxplot, etc. for each data set.

· You should __discuss__ the **important and interesting features** of the data revealed by your descriptive statistics and graphical representations in your report. In my example workbook you will see the CO2 data is strongly positively skewed, so much so that the boxplot is almost meaningless. Two options I had was to drop some of the largest observations, or to transform the data. I chose the latter – by taking the log of the data I end up with a data set distribution that can be usefully presented on a boxplot or histogram. Outliers and skewness are common features of cross-country data like this, so you should be prepared to drop observations or transform data if necessary, and explain why you did this in your report. (Just because data is skewed doesn’t mean you have to transform it! You’ll notice I did not transform the literacy data.)

Use numerical summary measures and graphical representations to consider if there might be a relationship between the two variables (using the __paired data__).

· Use the correlation coefficient and a scatterplot to see the strength and direction of the relationship (if any) between the two variables.

· In your report, __discuss__ the above and __explain__ why you think the relationship might be causative, spurious, or driven by a third factor.

**Data analysis**

Construct confidence intervals (using the __separate data__).

· Now assume that the data for each variable is a random sample and construct a confidence interval for the population mean of each variable. Since you don’t know the population standard deviations you should use **critical values from the Student t-distribution**.

· State your confidence interval in your report, explaining what it means (to a layperson) and also discuss if you have any doubts about the validity of the interval.

Compute p-values (using the __separate data__).

· Now assume that your “educated guess” of the average for each variable is the true mean of that variable. How likely is it that you would observe the sample mean you have obtained, or something more extreme, if your parameter assumption for each variable is correct? I.e. find the **two-tail p-value** associated with each sample mean.

*You can obtain the p-value by doing a two-tail hypothesis for the mean, for each data set.*

· State the p-values in your report and explain their meaning. Conclude by stating whether your educated guesses were probably right or wrong. (There is no penalty if your educated guesses are wrong!)

**Report**

As noted above, your assignment output should consist of a report and a spreadsheet workbook. Imagine that the reader of your report is a busy executive with only a basic understanding of statistics. Your report should therefore be of professional appearance and be able to be fully understood without reference to the workbook. I.e. paste relevant charts into the report; __do not__ paste the full descriptive statistics table into the report but rather use an abridged table and/or discussion; __do not__ show the __computation__ of the confidence intervals and p-values in the report but do __state__ and __interpret__ them.

Remember the suggested word count is 700-1000 words but this is a guide only: if you accomplish everything required above with less, that is fine; ideally don’t go much over 1000 – this would indicate you are not being concise enough.

Finally, I have attached some collated feedback I provided to students last year. You may like to refer to this to see what I am hoping to see in your report.

**General:**

· Check that you included everything that was asked for in the report (not just the workbook). If you missed out computing or discussing e.g. the p-values I couldnt give you marks for those!

**Data description univariate**

· Things I looked for were

· A brief introduction to the variables: what do the quantities mean?

· Use of descriptive statistics to describe the main features of the data e.g. IQR, CV, mean, median, quartiles, standard deviation (together with empirical rule or Chebyshev theorem)

· A little discussion about outliers that were interesting or were removed

· Histograms, polygons, boxplots and/or normal probability plots and discussion of the data distribution shape

· If there was no observation in a data set for a country it should not be treated as zero. Also be suspicious of zeros that do appear in the raw data set they might have ended up there in place of no observation.

· Be careful with units state what the units of the variables are and keep using those units for things like mean, standard deviation etc. E.g. cereal yield is in tonnes per capita, not percentage.

· Left skewed or negatively skewed data has a peak near the top of the distribution and a long lower tail.

· A data set does not have to fall into {left skewed, symmetric/normal, right skewed}. There are many other variations (without specific names, you could just call it asymmetric for instance).

· It wasnt necessary to transform a data set for the univariate analysis if it was skewed, only if it was so skewed or outliers were so extreme that boxplots etc. were not useful. It could of course be useful for bivariate analysis if one or both variables seem to be lognormal, because then you could still see a linear relation in the scatterplot using the logged variable(s) and linear correlation and regression would be valid.

**Data description bivariate**

· Things I looked for were

· A scatterplot and associated discussion (not a line chart or bar chart with series side by side)

· The correlation coefficient being stated and interpreted

· Discussion about why there might or might not be a relationship

· You shouldnt just rely on the correlation coefficient if the scatterplot indicates almost no relationship then you should say it is doubtful there is an actual relationship

· If the scatterplot indicates there is a relationship, but it is more likely to be non-linear, you should mention this.

· Dont just say there might be a third variable and leave it at that. Some discussion is important to show you know what this actually means.

**Confidence intervals**

· Things I looked for were

· Statement of each confidence interval in a sentence, in the context of the variable

· Discussion of why the confidence intervals were valid (appealing to CLT)

· Some said the sample mean was a good estimate because it was inside the confidence interval. Of course, the sample mean is at the center of the interval! This does not guarantee anything about the accuracy of the estimate.

· If you have dropped countries from the data set, the confidence interval only estimates the mean of the population you have sampled from, which may not be the same population that the countries you dropped come from. E.g. if you drop many African countries from the data set due to lack of data, then you should use caution saying that the confidence interval estimate for the mean still applies to African countries (i.e. only valid if you are quite sure there is no systematic difference between such countries and those included in your sample in the context of your variable.

**P-values**

· Things I looked for were

· Computation of two-tail p-values in the workbook

· An interpretation of the two-tail (not one-tail) p-value for each variable in terms of how likely it is to observe the sample estimate if the assumed population parameter was correct (not in terms of rejection of the null hypothesis or not).

· Dont guess the parameter value to be equal to the sample mean, otherwise the p-value is guaranteed to be 1 (and is just cheating!)

**Professional appearance**

· The report should have been easy to read, well-structured, and include charts with titles and axis labels etc.

· You shouldnt put bar charts of all the countries for each variable into the report if you want to highlight certain countries e.g. top 5 and bottom 5 just put them into a bar chart.

· You also shouldnt paste in whole descriptive statistics tables or give values with all the decimal places Excel gives you by default. A good rule of thumb is two or three significant figures. E.g. 542,997 becomes 543,000 and 0.024897874 becomes 0.025 or 0.0249. Also dont paste in the confidence interval or p-value calculation templates. Managers dont have time or understanding for technical output. They wont be impressed; they may just be confused or think you are showing off. They will be impressed if you can boil that technical material down in such a way that they can understand the key points enough to know what decisions need to be made.

## Sheet1

Last Digit of Student ID | Variables | SDG # | List of SDGs | |||||

0 | Poverty headcount ratio at $1.90/day (%) | 1 | SDG 1 | Poverty headcount ratio at $1.90/day (%) | ||||

Mean area that is protected in freshwater sites important to biodiversity (%) | 15 | Poverty headcount ratio at $3.20/day (%) | ||||||

1 | Maternal mortality rate (per 100,000 live births) | 3 | Poverty rate after taxes and transfers (%) | |||||

Mean area that is protected in terrestrial sites important to biodiversity (%) | 15 | SDG 2 | Prevalence of undernourishment (%) | |||||

2 | Poverty headcount ratio at $1.90/day (%) | 1 | Prevalence of stunting in children under 5 years of age (%) | |||||

Cereal yield (tonnes per hectare of harvested land) | 2 | Prevalence of wasting in children under 5 years of age (%) | ||||||

3 | Mortality rate, under-5 (per 1,000 live births) | 3 | Prevalence of obesity, BMI = 30 (% of adult population) | |||||

Birth registrations with civil authority (% of children under age 5) | 16 | Human Trophic Level (best 2-3 worst) | ||||||

4 | Neonatal mortality rate (per 1,000 live births) | 3 | Cereal yield (tonnes per hectare of harvested land) | |||||

Government revenue excluding grants (% of GDP) (countries other than high-income and OECD DAC) | 17 | Sustainable Nitrogen Management Index (best 0-1.41 worst) | ||||||

5 | Prevalence of obesity, BMI = 30 (% of adult population) | 2 | * = description not given on website, but is available in the report from p75 | Yield gap closure (% of potential yield) | ||||

Homicides (per 100,000 population) | 16 | SDG 3 | Maternal mortality rate (per 100,000 live births) | |||||

6 | Fish caught from overexploited or collapsed stocks (% of total catch) | 14 | Neonatal mortality rate (per 1,000 live births) | |||||

Poverty headcount ratio at $1.90/day (%) | 1 | Mortality rate, under-5 (per 1,000 live births) | ||||||

7 | Corruption Perception Index (worst 0-100 best) | 16 | Incidence of tuberculosis (per 100,000 population) | |||||

Population with access to clean fuels and technology for cooking (%) | 7 | New HIV infections (per 1,000 uninfected population) | ||||||

8 | Logistics Performance Index: Quality of trade and transport-related infrastructure (worst 15 best) | 9 | Age-standardized death rate due to cardiovascular disease, cancer, diabetes, or chronic respiratory disease in adults aged 3070 years (%) | |||||

Mean area that is protected in freshwater sites important to biodiversity (%) | 15 | * | Age-standardized death rate attributable to household air pollution and ambient air pollution (per 100,000 population) | |||||

9 | Government revenue excluding grants (% of GDP) (countries other than high-income and OECD DAC) | 17 | Traffic deaths (per 100,000 population) | |||||

Municipal solid waste (kg/capita/day) | 12 | Life expectancy at birth (years) | ||||||

Example | CO2 emissions embodied in fossil fuel exports (kg/capita) | 13 | Adolescent fertility rate (births per 1,000 females aged 15 to 19) | |||||

Literacy rate (% of population aged 15 to 24) | 4 | Births attended by skilled health personnel (%) | ||||||

Surviving infants who received 2 WHO-recommended vaccines (%) | ||||||||

Universal health coverage (UHC) index of service coverage (worst 0-100 best) | ||||||||

Subjective well-being (average ladder score, worst 0-10 best) | ||||||||

* | Gap in life expectancy at birth among regions (years) | |||||||

Gap in self-reported health status by income (percentage points) | ||||||||

Daily smokers (% of population aged 15 and over) | ||||||||

SDG 4 | Net primary enrollment rate (%) | |||||||

Lower secondary completion rate (%) | ||||||||

* | Literacy rate (% of population aged 15 to 24) | |||||||

Participation rate in pre-primary organized learning (% of children aged 4 to 6) | ||||||||

Tertiary educational attainment (% of population aged 25 to 34) | ||||||||

PISA score (worst 0-600 best) | ||||||||

Variation in science performance explained by socio-economic status (%) | ||||||||

Underachievers in science (% of 15-year-olds) | ||||||||

Resilient students in science (% of 15-year-olds) | ||||||||

SDG 5 | Demand for family planning satisfied by modern methods (% of females aged 15 to 49) | |||||||

Ratio of female-to-male mean years of education received (%) | ||||||||

Ratio of female-to-male labor force participation rate (%) | ||||||||

Seats held by women in national parliament (%) | ||||||||

Gender wage gap (% of male median wage) | ||||||||

* | Gender gap in time spent doing unpaid work (minutes/day) | |||||||

SDG 6 | Population using at least basic drinking water services (%) | |||||||

Population using at least basic sanitation services (%) | ||||||||

* | Freshwater withdrawal (% of available freshwater resources) | |||||||

* | Anthropogenic wastewater that receives treatment (%) | |||||||

Scarce water consumption embodied in imports (m³/capita) | ||||||||

Population using safely managed water services (%) | ||||||||

Population using safely managed sanitation services (%) | ||||||||

SDG 7 | Population with access to electricity (%) | |||||||

Population with access to clean fuels and technology for cooking (%) | ||||||||

CO2 emissions from fuel combustion for electricity and heating per total electricity output (MtCO2/TWh) | ||||||||

Share of renewable energy in total primary energy supply (%) | ||||||||

SDG 8 | * | Adjusted GDP growth (%) | ||||||

* | Victims of modern slavery (per 1,000 population) | |||||||

Adults with an account at a bank or other financial institution or with a mobile-money-service provider (% of population aged 15 or over) | ||||||||

Unemployment rate (% of total labor force) | ||||||||

Fundamental labor rights are effectively guaranteed (worst 01 best) | ||||||||

Fatal work-related accidents embodied in imports (per 100,000 population) | ||||||||

Employment-to-population ratio (%) | ||||||||

Youth not in employment, education or training (NEET) (% of population aged 15 t0 29) | ||||||||

SDG 9 | Population using the internet (%) | |||||||

Mobile broadband subscriptions (per 100 population) | ||||||||

Logistics Performance Index: Quality of trade and transport-related infrastructure (worst 15 best) | ||||||||

* | The Times Higher Education Universities Ranking: Average score of top 3 universities (worst 0100 best) | |||||||

Scientific and technical journal articles (per 1,000 population) | ||||||||

Expenditure on research and development (% of GDP) | ||||||||

Researchers (per 1,000 employed population) | ||||||||

Triadic patent families filed (per million population) | ||||||||

Gap in internet access by income (percentage points) | ||||||||

Female share of graduates from STEM fields at the tertiary level (%) | ||||||||

SDG 10 | Gini coefficient adjusted for top income | |||||||

Palma ratio | ||||||||

Elderly poverty rate (% of population aged 66 or over) | ||||||||

SDG 11 | Proportion of urban population living in slums (%) | |||||||

Annual mean concentration of particulate matter of less than 2.5 microns in diameter (PM2.5) (µg/m³) | ||||||||

Access to improved water source, piped (% of urban population) | ||||||||

Satisfaction with public transport (%) | ||||||||

Population with rent overburden (%) | ||||||||

SDG 12 | * | Municipal solid waste (kg/capita/day) | ||||||

* | Electronic waste (kg/capita) | |||||||

* | Production-based SO2 emissions (kg/capita) | |||||||

* | SO2 emissions embodied in imports (kg/capita) | |||||||

* | Production-based nitrogen emissions (kg/capita) | |||||||

* | Nitrogen emissions embodied in imports (kg/capita) | |||||||

* | Non-recycled municipal solid waste (kg/capita/day) | |||||||

SDG 13 | CO2 emissions from fossil fuel combustion and cement production (tCO2/capita) | |||||||

CO2 emissions embodied in imports (tCO2/capita) | ||||||||

CO2 emissions embodied in fossil fuel exports (kg/capita) | ||||||||

Carbon Pricing Score at EUR60/tCO2 (%, worst 0-100 best) | ||||||||

SDG 14 | Mean area that is protected in marine sites important to biodiversity (%) | |||||||

Ocean Health Index: Clean Waters score (worst 0-100 best) | ||||||||

Fish caught from overexploited or collapsed stocks (% of total catch) | ||||||||

Fish caught by trawling or dredging (%) | ||||||||

* | Fish caught that are then discarded (%) | |||||||