STAT20029 Module 1: Introduction to statistics

Module 1: Introduction to statistics

Introduction

This Study Guide serves two key purposes. Firstly, it contains an explanation of the statistical techniques and concepts covered in this unit. Everything in the textbook is not in syllabus. Only material discussed or referred to in the Study Guide (and also in lectures) is in syllabus and examinable. Secondly, it is complementary to the textbook. Some information contained in the Study Guide do not exist in the textbook and, some information contained in the textbook do not exist in the Study Guide, so it is essential that students read both the Study Guide and the textbook.

The Study Guide should be your first source of reference (after the unit website) since it provides details of what to read from the textbook. It is recommended that students complete all the required readings from the textbook each week and that review questions are attempted. At the very least, the recommended problems should be attempted, but it is strongly suggested that students also complete a selection from the additional problems list. The lecture notes are prepared to help you understand the concepts – it will also guide you as to what topics to emphasize.

As with many subjects, this unit is one where understanding comes from practicing the concepts through solving problems. Note that because this is a postgraduate unit, a greater depth of understanding is required than the undergraduate units. You will be tested not just on your recall of the study material, but also on the application and interpretation of the techniques. It is expected that you will learn the statistical techniques well enough to use them in real life situations, analysing and explaining results, and then making sound business and financial decisions.

Solving problems cannot really be overemphasized in learning statistics. It is like learning driving – you can read all the theories about driving, but you cannot pass a driving test unless you actually drive a car. Similarly, you can read and understand all the theories of statistics, but you will find it very difficult to answer the questions in the final exam unless you practice solving many problems. Discussing and solving problems in groups can be very useful. Distance students are expected to take advantage of the online Discussion Forum for this purpose.

This week we learn some of the basic terminology of statistics and consider issues relating to surveys. For those students who are not familiar with Microsoft Excel, this is also a time to learn some of the basics of this package so that in weeks to come, when it is used to perform statistical analyses, you will feel comfortable working with it. This basic understanding of Excel will be assumed knowledge for the remainder of the course, but please do watch the videos referred to in the Study Guide.

Objectives

On completion of this module you should be able to:

define and understand ‘statistics’ as a tool in business analytics,
explain some basic statistical concepts such as sample, sampling frame, population, statistic, parameter, descriptive statistics, and inferential statistics,
define data types and measurement scales - nominal, ordinal, interval and ratio,
understand the difference between statistics and data mining,
consider the basics of questionnaire design and
utilise basic Excel functions.

Readings: Suggested readings are boxed to the right. If there is any discrepancy between the textbook and the Study Guide, follow the Study Guide (and lecture notes) and not the textbook.

What is statistics?

The term statistics has a variety of interpretations in common use today. When many people think of statistics, they think about quoting figures. A common joke says that ‘80% of statistics are made up on the spot’. Although this may not be strictly true, it does give an indication of what the general public really thinks statistics is! Many people believe that statistics is just about quoting batting averages in cricket or points scored in footy matches. The media often use the word ‘statistic’ to refer to deaths. In that sense, becoming a statistic is definitely not a desirable outcome! Given that there are so many different ways this term is used, it is important to understand what statistics is and why students would need to study a course in statistics!

We’ll begin this module by defining the term statistics as we’ll be using it in this course. Like any field of study, statistics has its own language and terminology, so we’ll start with some basic definitions. These will be terms we’ll use frequently throughout the Term. We’ll also look at sampling and data collection and briefly explore good questionnaire design.

Statistics involves planning, collecting, analysing data, and reporting and interpreting results. It lets us take raw data and derive information that will help in informed decision making. Studying statistics means learning how to use and interpret statistical techniques, always keeping that goal of decision making in mind. Although statistics is often considered a branch of mathematics, studying statistics does not mean simply doing mathematical calculations; it also requires a thoughtful justification of the techniques chosen, careful interpretation and application of results, and consideration of any ethical issues involved. Given that complexity, it is important that statisticians have the ability to translate these complex statistical analyses into a format that someone without statistical training can understand; they need to be able to do all the calculations and then summarise the result in simple terms. The goal of this course is that students gain a level of competency in all these areas. The important difference between mathematics and statistics is that the former does not know how to deal with uncertainty, but the latter does.

Statistics is often divided into three main areas: descriptive statistics, probability and inferential statistics. Descriptive statistics involves collecting, summarising and describing data sets; it is exploratory in nature giving an overview of the main characteristics of the data set. Probability is the set of statistical tools available to us to do the analysis. Inferential statistics means estimating characteristics of sample data to discover patterns and make inferences about the population (or the future). We will be exploring these areas in this course. In many cases, real world problems require use of all branches of statistics.

Some terminology

Element—an object on which a measurement is made. For example, a registered Australian voter or a student at CQU.

Population—the set of all possible elements that could be observed. For example, all the registered voters in Australia or all students enrolled at CQU.

Sample—a selected portion of the population. It is a collection of sampling units drawn from a sampling frame (see below for definitions of sampling units and frames). A sample is drawn and examined when it would be too expensive or time consuming to look at every single element in the population.

Parameter—a characteristic of the population, for example, the average age of all students at CQU.

Statistic—a characteristic of the sample (that is used to estimate a parameter). For example if we took a random sample of 100 CQU students and found the average age of this sample, that would be a statistic, and we could use that to estimate the average age of all students at CQU. Similarly, pollsters often take a sample of Australian voters, and ask them their preferred prime minister. The results from this sample are used to tell us something about the parameter (the preferred prime minister of all registered Australian voters).

Sampling unit—non-overlapping collections of elements from a population. Ideally the sampling units are the same as the elements, but sometimes it is cheaper to sample groups. For example, sampling households instead of individual voters might be more convenient and cheaper. Care needs to be taken that sampling larger units (rather than individual elements) does not lead to bias in the results.

Frame (or sampling frame)—a list of sampling units. For example, the electoral roll is a list of all registered Australian voters and the student records system contains a list of all students enrolled at CQU. The sample is therefore drawn by selecting sampling units from this sampling frame. The sampling frame may not contain the entire population, for example, some new students may not yet appear in the student records system.

Census—the measurement or observation of all possible elements from the population. In other words, the sample contains the entire population. For example, every five years the Australian Bureau of Statistics conducts a census of all people in Australia.

Variable—a characteristic of an item or an individual whose value varies from individual to individual.

Accurate data collection

Good quality data is essential for effective decision making in the business environment. Data is obtained from either:

a primary source where the data collector analyses the data. For example, a company analyses information from their own client database.
a secondary source where the data is collected by someone else and provided to you for analysis. For example, data is collected by the Australian Bureau of Statistics who then make it available for download (sometimes for a fee).

Some primary sources

An experiment

With an experiment, the effects of various treatments are compared in a controlled setting. For example, if two brands of air-bags were tested in new cars, crash test dummies might be placed in cars which have been fitted with the different brands, and then used to measure the potential damage to car occupants. An experiment normally requires careful experimental design and often advanced statistical techniques. It usually involves a number of repetitions to ensure that results accurately reflect the population. For example, testing the air-bags in only one type of car is not likely to give accurate results—that car may have been special or different in some way which would affect the outcome.

Personal interviews

With personal interviews, an interviewer asks questions of the respondent and notes the responses. Because of the personal interaction, there is usually a good response rate. The interviewer can also get extra (non-verbal) information by noting such things as body language. They can ensure that any misunderstandings of the questions are minimised. To properly conduct a personal interview requires training in interview techniques. An untrained interviewer can easily lead the respondent to a certain response (either intentionally or unintentionally). Questions need to be phrased in a neutral manner and the interviewer must be careful that their body language and voice intonation do not lead the respondent to certain answers. When noting the responses, the interviewer could easily make (minor) errors, but this can be minimised by recording the sessions electronically and reviewing them at a later time.

Telephone interviews

Telephone interviews are similar to personal interviews, but are usually cheaper to conduct (since the interviewer is not required to travel). The sample frame may not accurately reflect the population however, since not everyone has a telephone and not everyone will be available when phone calls are made. In addition, the respondent’s body language cannot be noted. Telephone interviews have a reasonably high non-response rate. Often people find the phone calls annoying or intrusive and do not wish to be involved. They may enlist in the ‘do not call’ register, and therefore, not contactable. This may lead to a sample which poorly reflects the population.

Self-administered questionnaires (paper or web-based)

Self-administered questionnaires are very cheap to administer and are often the first method used by people who are not trained in statistics. They can be physically handed to respondents, mailed or emailed out, or made available via a website. Usually they have a low response rate and so follow up is often required (for example reminder phone calls, emails or letters asking that people complete and return their questionnaires). Often an incentive (such as a ‘free gift’) can be used to encourage participation, but care needs to be taken in selecting the form of this gift so that it does not bias the results.

Self-administered questionnaires tend to attract responses mainly from those who have very strong feelings (either positive or negative) on the issue being surveyed. Web-based surveys are particularly prone to receiving only extreme responses. The sample is self-selecting and so it is highly unlikely to reflect the population accurately. Because of these problems, self-administered questionnaires need to be very carefully designed to encourage participation by everyone in the identified sample group and to avoid leading or ambiguous questions. See the discussion on non-probability samples below for more information.

Direct observation

A person counts events as they occur (for example, counting cars as they cross a bridge to get an idea of traffic flow on the bridge). Electronic equipment is sometimes used to measure the events.

Focus groups

Focus groups are a form of direct observation that are often used in market research. These use open-ended questions and allow time for discussion of the issues raised. A focus group has a moderator who leads the discussion. Other group studies include brainstorming, the Delphi Method and the nominal-group technique (not covered in this course).

Survey errors

Coverage error

It is very important that the sampling frame is constructed in such a way that it accurately reflects the population. If this does not occur, such as if certain groups of elements are excluded, the result is coverage error. Coverage error creates selection bias, since certain kinds of elements will not be included in the sample. The sample would then estimate characteristics of the (faulty) sampling frame, rather than the population.

Nonresponse error

Nonresponse error occurs when the survey fails to collect data from all elements in the sample. This error can be because not everyone is willing to respond to a survey (such as discussed in self-administered questionnaires above). The result of nonresponse error is nonresponse bias in the survey results. Follow up of non-responding elements assists in minimising this error and bias.

Sampling error

When taking a sample, the goal is usually that it be drawn from a population in such a way that all the elements have an equal chance of being selected. Because this selection is random, each time a sample is drawn it is likely to contain different individual elements. Sampling error is due to this chance difference from sample to sample. Although sampling error can be reduced by increasing the sample size, this obviously comes with an increase in cost (in money and/or time).

Measurement error

Measurement errors are problems with the recorded responses. These can be the result of ambiguous wording of questions, errors by the respondent or the ‘halo effect’. The halo effect is when a respondent feels the need to please the interviewer. Proper interviewing technique (as discussed in the section on personal interviews above) can help alleviate this problem.

Ethical issues

The various survey errors only become ethical problems if the interviewer or survey designer intentionally causes these to occur. This might mean intentionally excluding a group of individuals from a survey (coverage error/selection bias) where it is believed they would respond in a way contrary to the purposes of the survey. When nonprobability samples (discussed shortly) are used to make inferences about the population, this also creates ethical problems.

Sampling methods

Non-probability samples

With non-probability samples, elements are chosen without considering the probability of occurrence. In many cases, the elements in these samples are self-selected (such as is the case with web-based surveys). Although non-probability samples have advantages such as being quick and cheap to conduct, most statistical techniques rely on probability samples and so cannot be used on data resulting from a non-probability sample. In addition, non-probability samples introduce selection bias into results. Examples of non-probability samples are convenience sampling, judgement sampling and quota sampling.

Probability samples

When a probability sample is drawn, elements are chosen based on knowledge of the probability of occurrence. Probability samples enable the inference of unbiased generalisations about the population. Although it is often difficult to achieve a true probability sample (i.e. it is difficult to ensure absolute randomness in your selection), this should always be the goal in sampling. We will be considering four kinds of probability samples in this course: simple random sampling, systematic sampling, stratified sampling and cluster sampling.

Sampling with replacement—randomly selected elements are returned to the frame after they are selected (and so could potentially be re-selected).

Sampling without replacement—randomly selected elements are not returned to the frame after selection.

Simple random sample

A simple random sample (SRS) is one in which every item in the frame has the same chance of being selected. Also, every sample of a certain size has the same chance of being selected as every other sample of the same size. We use n to represent the number of elements in the sample and N for the number of units in the frame (or population). An example of SRS is to write down each person’s name on each sheet of paper. All paper sheets must be of equal size and weight. Put all the names in a fishbowl, stir, pick a name blind-folded, and repeat. If you pick 10 names this way stirring and shaking the fishbowl after each draw, you have a simple random sample of size 10 from a population of many persons.

Random number tables

Random number tables contain random digits from 1 to 9, usually. Numbers are taken from these tables to select sample items (which are numbered). Table 7.1 on page 198 of the textbook shows a small portion of a random number table within the range 00001 to 99999 – Excel can generate random numbers between 0 and 1 (in fractions) or between two numbers that you provide (assuming that you have numbered all your elements). Often numbering systems may already exist for the elements in the sampling frame. For example, we might use invoice numbers, student ID numbers, etc.

(Pseudo) random number generators

Random number generators are similar to random number tables. Most statistical software and many calculators will generate random numbers which can be used in the same way as those taken from a table. Random number generators are often prefaced with the word pseudo since it is not possible to use computers to generate truly random numbers (although algorithms now exist that do a pretty good job).

Systematic sample

Given N individuals in the frame and n in the sample, the frame is partitioned in to k groups where . An item is chosen randomly from the first k items and then every k^th item after this is sampled. This method is frequently used in the production of goods and in street polls (among other places) where a simple random sample would be very difficult to draw (since we often won’t know how many goods are going to be produced until after it has happened and people on the street are constantly moving).

For example, if 200 customers (N) are in a sampling frame and a sample of 20 (n) is required, . An item is randomly chosen from the first 10 (say item 7) and then every 10^th item is chosen after that. The sample would be items 7, 17, 27, etc.

Stratified sample

The frame is divided into strata, from within each of which a simple random sample is drawn. Strata are grouped according to a similar characteristic (e.g., we might separate people into three groups: high, medium and low income earners). Stratified sampling is used when we have some knowledge about the composition of the population and we can divide the population into different groups.

Cluster sample

The frame is divided into clusters so that each cluster is representative of the population. A random sample of clusters is drawn and every item within the chosen clusters is studied (a census is conducted within the selected clusters). Cluster sampling will bring savings in travelling time, but it is difficult to ensure that clusters are truly representative of the population (e.g., the suburbs of a city might be a logical choice for clusters, but they are not normally all representative of the entire city).

Types of variables

Variables are described as either qualitative (categorical) or quantitative (numerical):

Qualitative (categorical) random variables—responses are categories such as: yes or no, male or female, low, medium or high income earner, etc.

Quantitative (numerical) random variables—responses are numerical such as: height, weight, time, distance.

Quantitative variables can be either discrete or continuous:

Discrete random variables are numerical responses resulting from counting (they can take only integer values). Examples include the number of students in a classroom, the number of mobile phone models marketed by a company etc. (A discrete variable has a finite number of values within a range. Money is often considered discrete because you can have $1.05 or $1.06 but you cannot have $1.05678 –smaller denomination than a cent does not exist.)

Continuous random variables arise from a measuring process. Examples include the speed of a car travelling along a highway, your weight, etc. (Your weight can be 70 kg, 70.12 kg or 70.1201234 kg depending on the accuracy of the scale – there is no limit to the number of digits you can have after the decimal point. Money is an interesting example – it can be termed as continuous when many decimal places are used such as petrol price $1.299 or share price $2.3456.)

Measurement scales

Qualitative variables can be:

Nominal—there is no particular order to the categories (e.g., male/female, yes/no, names of people)

Ordinal—there is an order to the categories (e.g., first year at uni, second year at uni, third year at uni, etc.)

Quantitative variables can be:

Interval—there is no true zero point (e.g., temperature—degrees Celsius & Fahrenheit have different zero points. Children’s dress sizes of 0 or 00 do not mean the cloth vanishes.).

Ratio—there is a true zero point (e.g., a person’s height, weight, etc. The height of 180 cm is double of 90 cm, thus the ratio of the two heights convey a meaning. But shoe size 10 is not double of shoe size 5. Therefore, the ratio of the latter does not make any sense, thus shoe size is interval scale.).

Example 1–1

For each of the following random variables, determine whether the variable is qualitative or quantitative. If the variable is quantitative, determine whether the variable of interest is discrete or continuous. In addition, determine the level of measurement.

(a) Number of mobile phones per household

(b) Mobile phone service provider

(d) Length (in minutes) of longest call made during a month

(e) Colour of mobile phone

(f) Monthly charge (in dollars and cents) for calls made

(g) Ownership of a car charge kit

(h) Number of calls made per month

(i) Whether there is a telephone line connected to a computer modem in the household

(j) Whether there is a fax machine in the household.

(k) The size of dress you need for a child.

Solution 1–1

(a) Quantitative, discrete (since we talk about whole mobile phones), ratio (since there is a true zero point

(b) Qualitative, nominal (there will be no particular order to the service providers).

(d) Quantitative, continuous (time is measured on a continuous scale, although here we are rounding this measurement to the nearest minute and so could argue that it becomes discrete and ordinal), ratio (true zero point since we can’t have a call length of a negative time).

(e) Qualitative (categories will be names of colours), nominal (there is no particular ordering to colour names).

(f) Quantitative, continuous (money is measured on a continuous scale, although here we are rounding to the nearest cent), ratio (true zero point).

(g) Qualitative (effectively only two options: own a car charge kit, don’t own a car charge kit), nominal (there is no particular order to own or don’t own).

(h) Quantitative, discrete (counting whole numbers of calls), ratio (true zero point).

(i) Qualitative (connected or not connected), nominal (no particular order to these options).

(j) Qualitative (fax machine present or not present), nominal (no particular order to these options).

(k) Quantitative, discrete, interval (since there is no true zero, dress sizes can be 0, 00 or 000).).

Questionnaire design

Questionnaire design is an area where a large amount of information and guidance is available; however, we will only briefly consider it in this course. The way that survey questions are phrased determines whether the variable is quantitative or qualitative and determines the level of measurement. This choice is based on the kind of information that is required.

Quantitative or numerical questions are relatively easy to design and the results of these are easy to analyse. See Example 1–2 (c) below for some examples. It should always be made clear what the units of measurement are (for example, dollars and cents, hours, kilograms, months etc.). Listing categories (for example, age categories of ‘under 20’, ‘20 to 29’, ‘30 to 39’, etc.) transforms a quantitative variable into a qualitative variable. Although this often makes people feel more comfortable in responding (people might not want to reveal their exact age, income, etc.) it also results in poorer quality information and makes analysis of the data much more difficult. Because of this, where possible, categorising quantitative variables should be avoided.

Qualitative or categorical questions require a little more thought. The most common kinds of qualitative questions are dichotomous, multiple choice, response scales and open-ended questions.

Dichotomous questions have only two possible outcomes. Examples are true or false, yes or no, male or female etc. These kinds of questions need to be used carefully, as in the wrong situation they can oversimplify a problem. For example, ‘Do you believe that overseas earnings should be taxed?’ is a question which requires a yes or no response, but, respondents might want to give a qualified yes or no (yes under certain circumstances or no under certain other circumstances).

What is your gender (please circle one)?

Male Female

Multiple choice questions are appropriate when a finite list of answers exists. Care needs to be taken that options are mutually exclusive (there is no overlap between categories) and exhaustive (every possible option is covered). An example of a multiple choice question is:

Which one of the following best describes your primary field of employment (please circle one)?

A. Medical or health profession

B. Education

C. Business or government

D. Information technology

E. Scientific or technical

F. Other

Response scales allow the respondent to indicate their position from a range of options. A couple of examples are:

On the following scale, circle the one value which best describes how you rate the ease of use of KaddStat.

Extremely easy						Extremely difficult
1	2	3	4	5	6	7

Please tick the one option below that best summarises your response to the following statement:

All major shopping centres should trade on Sundays.

Strongly disagree

Disagree

No opinion

Agree

Strongly agree

Open-ended questions produce results that are much more difficult to analyse and so should be used sparingly and only in certain circumstances. They have the benefit of allowing the respondent complete freedom in how they answer. Because of this, they are often used as a final question on a survey to allow the respondent to say anything they have not yet had the opportunity to express. An example of an open-ended question is:

What is your opinion of the current fringe benefit tax laws?

Guidelines for designing questions

Always keep the following guidelines in mind when designing questions:

keep questions simple
phrase questions so that their meaning is clear to every respondent (avoid ambiguous questions)
avoid leading questions
ensure spelling and grammar is correct
aim for an easy to read and attractive layout
for qualitative questions, offer an adequate choice of responses (mutually exclusive and exhaustive)
keep the questionnaire as short as possible
order questions carefully (early questions should be simple and build rapport with the respondent, group questions by topics and consider how early questions may impact on the thoughts or feelings of the respondent in answering later questions)
keep questions pertinent to the objectives of the survey and
pre-test the questionnaire on a small group of people (a pilot study) to observe any errors or shortcomings.

Example 1–2

The manager of an electronics company is interested in determining whether customers who purchased a digital camera over the past 12 months were satisfied with their purchase. The manager is planning to survey these customers using the contact information given on warranty cards submitted after the purchases.

(a) Describe the population and frame. What differences are there between the population and the frame? How might these differences affect the results?

(b) Develop three qualitative questions that you feel would be appropriate for this survey.

(d) How could a simple random sample of warranty cards be selected?

(e) If the manager wanted to select a sample of warranty cards for each brand of digital camera sold, how should the sample be selected? Explain.

Discussion points

Discussion point 1–1

Examine and complete Review Problems 1.1 to 1.3 from your textbook (p. 14). Where does statistics sit in these contexts?

Discussion point 1–2

Examine and complete Problem 1.4 from your textbook (p. 14) Compare your examples with other students. What conclusion might you reach?

Discussion point 1–3

Examine and complete Problem 1.8 and 1.9 from the textbook (pp. 14-15). Compare your answers with other students. What would be the answers if the question asked for only the measurement scales, which are: nominal, ordinal, interval or ratio?

Additional readings

If you are intending to use the Analysis ToolPak and/or KaddStat on your home computer for this unit, then please ensure that you install them by following the instructions. But note that KaddStat or any other add-on will not be covered in lectures and tutorials, and your knowledge about it will never be examined.

Summary

Now that you have completed this module, turn back to the objectives at the beginning of the module. Have you achieved these objectives?

Ensure that you attempt the recommended problems in the list of review questions below and at least a sample of problems from the optional list. This will help you to identify any areas of difficulty you have in achieving the module’s objectives.

It is often said that you need to solve at least 10 questions on a given concept in statistics to understand the concept properly.

Review questions