close this bookStatistical Thinking for Decision Making: Revealing Facts from Figures.By.Dr. Hossein Arsham
source ref: stat.htm
View the documentChapter.1:Towards Statistical Thinking for Decision Making
View the documentChapter 2:Descriptive Sampling Data Analysis
View the documentChapter.3:Probability for Statistical Inference and Modeling
View the documentChapter 4:Necessary Conditions for Statistical Decision Making
View the documentChapter 5:Estimators and Their Qualities
View the documentChapter 6:Hypothesis Testing: Rejecting a Claim
View the documentChapter 7:Hypotheses Testing for Means and Proportions
View the documentChapter 8: Applications of the Chi-square Statistic
View the documentChapter 9:Regression Modeling and Analysis
View the documentChapter 10: Unified Views of Statistical Decision Tools
View the documentChapter 11:Visualization of Statistics: Analytic-Geometry & Statistics
View the documentChapter 12: Index Numbers with Applications

Chapter.1:Towards Statistical Thinking for Decision Making



Introduction to Statistical Thinking for Decision Making

This site builds up the basic ideas of business statistics systematically and correctly. It is a combination of lectures and computer-based practical, whereby theory is firmly placed into practice. It provides an introduction to techniques for summarizing and presenting data, estimation, confidence intervals and hypothesis testing. The presentation is centered around focusing more on conceptual understanding of key concepts, and statistical thinking, and less on formulas and calculations, which can now be left to PCs using accessible and user-friendly Statistical JavaScript Applets.

Today's good decisions are driven by data. In all aspects of our lives, and importantly in the business context, an amazing diversity of data is available for inspection and enlightenment. Moreover, business managers and professionals are increasingly encouraged to justify decisions on the basis of data.

Business managers need statistical model-based decision support systems. Statistical skills enable you to intelligently collect, analyze and interpret data relevant to their decision-making. Statistical concepts and statistical thinking enable you:

  • to solve problems in a diversity of contexts
  • to add substance to decisions.
This Web site is a course in statistics appreciation; i.e., to acquire a feel for the statistical way of thinking. Appreciation of statistics is wonderful: it makes what is excellent in statistical thinking belong to you as well. It is an introductory course in statistics that is designed to provide you with the basic concepts, and methods of statistical analysis for processes and products. Materials in this Web site are tailored to meet your needs in making good decision and they promote one to think statistically. The cardinal objective for this Web site is to increase the extent to which statistical thinking is embedded with managerial thinking for decision making under uncertainties.

In competitive environment, business managers must design quality into products, and into the processes of making the products. They must facilitate a process of never-ending improvement at all stages of manufacturing and service. This is a strategy that employs statistical methods, particularly statistically designed experiments, and produces processes that provide high yield and products that seldom fail. Moreover, it facilitates development of robust products that are insensitive to changes in the environment and internal component variation. Carefully planned statistical studies remove hindrances to high quality and productivity at every stage of production. This saves time and money. It is well recognized that quality must be engineered into products as early as possible in the design process. One must know how to use carefully planned, cost-effective statistical experiments to improve, optimize and make robust products and processes.

Business Statistics is a science assisting you to make business decisions under uncertainties based on some numerical and measurable scales. Decision making processes must be based on data, not on personal opinion nor on belief.

The Devil is in the Deviations: Variation is inevitable in life! Every process, every measurement, every sample has variation. Managers need to understand variation for two key reasons. First, so that they can lead others to apply statistical thinking in day-to-day activities and secondly, to apply the concept for the purpose of continuous improvement. This course will provide you with hands-on experience to promote the use of statistical thinking and techniques to apply them to make educated decisions, whenever you encounter variation in business data. You will learn techniques to intelligently assess and manage the risks inherent in decision-making. Therefore, remember that:

Just like weather, if you cannot control something, you should learn how to measure and analyze it, in order to predict it, effectively.

If you have taken statistics before, and have a feeling of inability to grasp concepts, it may be largely due to your former non-statistician instructors teaching statistics. Their deficiencies lead students to develop phobias for the sweet science of statistics. In this respect, Professor Herman Chernoff (1996) made the following remark:

"Since everybody in the world thinks he can teach statistics even though he does not know any, I shall put myself in the position of teaching biology even though I do not know any"

Inadequate statistical teaching during university education leads even after graduation, to one or a combination of the following scenarios:

  1. In general, people do not like statistics and therefore they try to avoid it.
  2. There is a pressure to produce scientific papers, however often confronted with "I need something quick."
  3. At many institutes in the world, there are only a few (mostly 1) statisticians, if any at all. This means that these people are extremely busy. As a result, they tend to advise simple and easy to apply techniques, or they will have to do it themselves.
  4. Communication between a statistician and decision-maker can be difficult. One speaks in statistical jargon; the other understands the monetary or utilitarian benefit of using the statistician's recommendations.

Plugging numbers into the formulas and crunching them have no value by themselves. You should continue to put effort into the concepts and concentrate on interpreting the results.

Even when you solve a small size problem by hand, I would like you to use the available computer software and Web-based computation to do the dirty work for you.

You must be able to read the logical secret in any formulas not memorize them. For example, in computing the variance, consider its formula. Instead of memorizing, you should start with some why:

i. Why do we square the deviations from the mean.
Because, if we add up all deviations, we get always zero value. So, to deal with this problem, we square the deviations. Why not raise to the power of four (three will not work)? Squaring does the trick; why should we make life more complicated than it is? Notice also that squaring also magnifies the deviations; therefore it works to our advantage to measure the quality of the data.

ii. Why is there a summation notation in the formula.
To add up the squared deviation of each data point to compute the total sum of squared deviations.

iii. Why do we divide the sum of squares by n-1.
The amount of deviation should reflect also how large the sample is; so we must bring in the sample size. That is, in general, larger sample sizes have larger sum of square deviation from the mean. Why n-1 not n? The reason for n-1 is that when you divide by n-1, the sample's variance provides an estimated variance much closer to the population variance, than when you divide by n. You note that for large sample size n (say over 30), it really does not matter whether it is divided by n or n-1. The results are almost the same, and they are acceptable. The factor n-1 is what we consider as the "degrees of freedom".

This example shows how to question statistical formulas, rather than memorizing them. In fact, when you try to understand the formulas, you do not need to remember them, they are part of your brain connectivity. Clear thinking is always more important than the ability to do arithmetic.

When you look at a statistical formula, the formula should talk to you, as when a musician looks at a piece of musical-notes, he/she hears the music.

computer-assisted learning: The computer-assisted learning provides you a "hands-on" experience which will enhance your understanding of the concepts and techniques covered in this site.

Java, once an esoteric programming language for animating Web pages, is now a full-fledged platform for building JavaScript E-labs' learning objects with useful applications. As you used to do experiments in physics labs to learn physics, computer-assisted learning enables you to use any online interactive tool available on the Internet to perform experiments. The purpose is the same; i.e., to understand statistical concepts by using statistical applets which are entertaining and educating.

The appearance of computer software, JavaScript Applets, Statistical Demonstration Applets, and Online Computation are the most important events in the process of teaching and learning concepts in model-based, statistical decision making courses. These e-lab tools allow you to construct numerical examples to understand the concepts, and to find their significance for yourself.

Unfortunately, most classroom courses are not learning systems. The way the instructors attempt to help their students acquire skills and knowledge has absolutely nothing to do with the way students actually learn. Many instructors rely on lectures and tests, and memorization. All too often, they rely on "telling." No one remembers much that's taught by telling, and what's told doesn't translate into usable skills. Certainly, we learn by doing, failing, and practicing until we do it right. The computer assisted learning serves this purpose.

A course in appreciation of statistical thinking gives business professionals an edge. Professionals with strong quantitative skills are in demand. This phenomenon will grow as the impetus for data-based decisions strengthens and the amount and availability of data increases. The statistical toolkit can be developed and enhanced at all stages of a career. Decision making process under uncertainty is largely based on application of statistics for probability assessment of uncontrollable events (or factors), as well as risk assessment of your decision.

The main objective for this course is to learn statistical thinking; to emphasize more on concepts, and less theory and fewer recipes, and finally to foster active learning using the useful and interesting Web-sites. It is already a known fact that "Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." So, let's be ahead of our time.

Further Readings:
Chernoff H., A Conversation With Herman Chernoff, Statistical Science, Vol. 11, No. 4, 335-350, 1996.
Churchman C., The Design of Inquiring Systems, Basic Books, New York, 1971. Early in the book he stated that knowledge could be considered as a collection of information, or as an activity, or as a potential. He also noted that knowledge resides in the user and not in the collection.
Rustagi M., et al. (eds.), Recent Advances in Statistics: Papers in Honor of Herman Chernoff on His Sixtieth Birthday, Academic Press, 1983.


The Birth of Probability and Statistics

The original idea of "statistics" was the collection of information about and for the "state". The word statistics derives directly, not from any classical Greek or Latin roots, but from the Italian word for state.

The birth of statistics occurred in mid-17th century. A commoner, named John Graunt, who was a native of London, began reviewing a weekly church publication issued by the local parish clerk that listed the number of births, christenings, and deaths in each parish. These so called Bills of Mortality also listed the causes of death. Graunt who was a shopkeeper organized this data in the form we call descriptive statistics, which was published as Natural and Political Observations Made upon the Bills of Mortality. Shortly thereafter he was elected as a member of Royal Society. Thus, statistics has to borrow some concepts from sociology, such as the concept of Population. It has been argued that since statistics usually involves the study of human behavior, it cannot claim the precision of the physical sciences.

Probability has much longer history. Probability is derived from the verb to probe meaning to "find out" what is not too easily accessible or understandable. The word "proof" has the same origin that provides necessary details to understand what is claimed to be true.

Probability originated from the study of games of chance and gambling during the 16th century. Probability theory was a branch of mathematics studied by Blaise Pascal and Pierre de Fermat in the seventeenth century. Currently in 21st century, probabilistic modeling is used to control the flow of traffic through a highway system, a telephone interchange, or a computer processor; find the genetic makeup of individuals or populations; quality control; insurance; investment; and other sectors of business and industry.

New and ever growing diverse fields of human activities are using statistics; however, it seems that this field itself remains obscure to the public. Professor Bradley Efron expressed this fact nicely:

During the 20th Century statistical thinking and methodology have become the scientific framework for literally dozens of fields including education, agriculture, economics, biology, and medicine, and with increasing influence recently on the hard sciences such as astronomy, geology, and physics. In other words, we have grown from a small obscure field into a big obscure field.

Further Readings:
Daston L., Classical Probability in the Enlightenment, Princeton University Press, 1988.
The book points out that early Enlightenment thinkers could not face uncertainty. A mechanistic, deterministic machine, was the Enlightenment view of the world.
Gillies D., Philosophical Theories of Probability, Routledge, 2000. Covers the classical, logical, subjective, frequency, and propensity views.
Hacking I., The Emergence of Probability, Cambridge University Press, London, 1975. A philosophical study of early ideas about probability, induction and statistical inference.
Peters W., Counting for Something: Statistical Principles and Personalities, Springer, New York, 1987. It teaches the principles of applied economic and social statistics in a historical context. Featured topics include public opinion polls, industrial quality control, factor analysis, Bayesian methods, program evaluation, non-parametric and robust methods, and exploratory data analysis.
Porter T., The Rise of Statistical Thinking, 1820-1900, Princeton University Press, 1986. The author states that statistics has become known in the twentieth century as the mathematical tool for analyzing experimental and observational data. Enshrined by public policy as the only reliable basis for judgments as the efficacy of medical procedures or the safety of chemicals, and adopted by business for such uses as industrial quality control, it is evidently among the products of science whose influence on public and private life has been most pervasive. Statistical analysis has also come to be seen in many scientific disciplines as indispensable for drawing reliable conclusions from empirical results. This new field of mathematics found so extensive a domain of applications.
Stigler S., The History of Statistics: The Measurement of Uncertainty Before 1900, U. of Chicago Press, 1990. It covers the people, ideas, and events underlying the birth and development of early statistics.
Tankard J., The Statistical Pioneers, Schenkman Books, New York, 1984.
This work provides the detailed lives and times of theorists whose work continues to shape much of the modern statistics.


Statistical Modeling for Decision-Making under Uncertainties:
From Data to the Instrumental Knowledge

In this diverse world of ours, no two things are exactly the same. A statistician is interested in both the differences and the similarities; i.e., both departures and patterns.

The actuarial tables published by insurance companies reflect their statistical analysis of the average life expectancy of men and women at any given age. From these numbers, the insurance companies then calculate the appropriate premiums for a particular individual to purchase a given amount of insurance.

Exploratory analysis of data makes use of numerical and graphical techniques to study patterns and departures from patterns. The widely used descriptive statistical techniques are: Frequency Distribution; Histograms; Boxplot; Scattergrams and Error Bar plots; and diagnostic plots.

In examining distribution of data, you should be able to detect important characteristics, such as shape, location, variability, and unusual values. From careful observations of patterns in data, you can generate conjectures about relationships among variables. The notion of how one variable may be associated with another permeates almost all of statistics, from simple comparisons of proportions through linear regression. The difference between association and causation must accompany this conceptual development.

Data must be collected according to a well-developed plan if valid information on a conjecture is to be obtained. The plan must identify important variables related to the conjecture, and specify how they are to be measured. From the data collection plan, a statistical model can be formulated from which inferences can be drawn.

As an example of statistical modeling with managerial implications, such as "what-if" analysis, consider regression analysis. Regression analysis is a powerful technique for studying relationship between dependent variables (i.e., output, performance measure) and independent variables (i.e., inputs, factors, decision variables). Summarizing relationships among the variables by the most appropriate equation (i.e., modeling) allows us to predict or identify the most influential factors and study their impacts on the output for any changes in their current values.

Frequently, for example the marketing managers are faced with the question, What Sample Size Do I Need? This is an important and common statistical decision, which should be given due consideration, since an inadequate sample size invariably leads to wasted resources. The sample size determination section provides a practical solution to this risky decision.

Statistical models are currently used in various fields of business and science. However, the terminology differs from field to field. For example, the fitting of models to data, called calibration, history matching, and data assimilation, are all synonymous with parameter estimation.

Your organization database contains a wealth of information, yet the decision technology group members tap a fraction of it. Employees waste time scouring multiple sources for a database. The decision-makers are frustrated because they cannot get business-critical data exactly when they need it. Therefore, too many decisions are based on guesswork, not facts. Many opportunities are also missed, if they are even noticed at all.

Knowledge is what we know. Information is the communication of knowledge. In every knowledge exchange, there is a sender and a receiver. The sender makes common what is private, does the informing, the communicating. Information can be classified as explicit and tacit forms. The explicit information can be explained in structured form, while tacit information is inconsistent and fuzzy to explain.

Data is known to be crude information and not knowledge by itself. The sequence from data to knowledge is: from Data to Information, from Information to Facts, and finally, from Facts to Knowledge. Data becomes information, when it becomes relevant to your decision problem. Information becomes fact, when the data can support it. Facts are what the data reveals. However the decisive instrumental knowledge is expressed together with some statistical degree of confidence.

Fact becomes knowledge, when it is used in the successful completion of a decision process. Knowledge needs wisdom. Wisdom is the power to put our time and our knowledge to the proper use. Once you have a massive amount of facts integrated as knowledge, then your mind will be superhuman in the same sense that mankind with writing is superhuman compared to mankind before writing. The following figure illustrates the statistical thinking process based on data in constructing statistical models for decision making under uncertainties.

From Data to Knowledge

The above figure depicts the fact that as the exactness of a statistical model increases, the level of improvements in decision-making increases. That's why we need Business Statistics. Statistics arose from the need to place knowledge on a systematic evidence base. This required a study of the laws of probability, the development of measures of data properties and relationships, and so on.

Statistical inference aims at determining whether any statistical significance can be attached that results after due allowance is made for any random variation as a source of error. Intelligent and critical inferences cannot be made by those who do not understand the purpose, the conditions, and applicability of the various techniques for judging significance.

The purpose of statistical thinking is to get acquainted with the statistical techniques, to be able to execute procedures using available JavaScript Applets, and to be conscious of the conditions and limitations of various techniques.


Statistical Decision-Making Process

Unlike the deterministic decision-making process, in decision making under pure uncertainty, the variables are often more numerous and more difficult to measure and control. However, the steps are the same. They are:
  1. Simplification
  2. Building a decision model
  3. Testing the model
  4. Using the model to find the solution:
    • It is a simplified representation of the actual situation
    • It need not be complete or exact in all respects
    • It concentrates on the most essential relationships and ignores the less essential ones.
    • It is more easily understood than the empirical situation, and hence permits the problem to be solved more readily with minimum time and effort.
  5. It can be used again and again for similar problems or can be modified.

Fortunately the probabilistic and statistical methods for analysis and decision making under uncertainty are more numerous and powerful today than ever before. The computer makes possible many practical applications. A few examples of business applications are the following:

  • An auditor can use random sampling techniques to audit the accounts receivable for clients.
  • A plant manager can use statistical quality control techniques to assure the quality of his production with a minimum of testing or inspection.
  • A financial analyst may use regression and correlation to help understand the relationship of a financial ratio to a set of other variables in business.
  • A market researcher may use test of significace to accept or reject the hypotheses about a group of buyers to which the firm wishes to sell a particular product.
  • A sales manager may use statistical techniques to forecast sales for the coming year.

Questions Concerning Statistical the Decision-Making Process:

  1. Objectives or Hypotheses: What are the objectives of the study or the questions to be answered? What is the population to which the investigators intend to refer their findings?
  2. Statistical Design: Is the study a planned experiment (i.e., primary data), or an analysis of records ( i.e., secondary data)? How is the sample to be selected? Are there possible sources of selection, which would make the sample atypical or non-representative? If so, what provision is to be made to deal with this bias? What is the nature of the control group, standard of comparison, or cost? Remember that statistical modeling means reflections before actions.
  3. Observations: Are there clear definition of variables, including classifications, measurements (and/or counting), and the outcomes? Is the method of classification or of measurement consistent for all the subjects and relevant to Item No. 1.? Are there possible biased in measurement (and/or counting) and, if so, what provisions must be made to deal with them? Are the observations reliable and replicable (to defend your finding)?
  4. Analysis: Are the data sufficient and worthy of statistical analysis? If so, are the necessary conditions of the methods of statistical analysis appropriate to the source and nature of the data? The analysis must be correctly performed and interpreted.
  5. Conclusions: Which conclusions are justifiable by the findings? Which are not? Are the conclusions relevant to the questions posed in Item No. 1?
  6. Representation of Findings: The finding must be represented clearly, objectively, in sufficient but non-technical terms and detail to enable the decision-maker (e.g., a manager) to understand and judge them for himself? Is the finding internally consistent; i.e., do the numbers added up properly? Can the different representation be reconciled?
  7. Managerial Summary: When your findings and recommendation(s) are not clearly put, or framed in an appropriate manner understandable by the decision maker, then the decision maker does not feel convinced of the findings and therefore will not implement any of the recommendations. You have wasted the time, money, etc. for nothing.

Further Readings:
Corfield D., and J. Williamson, Foundations of Bayesianism, Kluwer Academic Publishers, 2001. Contains Logic, Mathematics, Decision Theory, and Criticisms of Bayesianism.
Lapin L., Statistics for Modern Business Decisions, Harcourt Brace Jovanovich, 1987.
Pratt J., H. Raiffa, and R. Schlaifer, Introduction to Statistical Decision Theory, The MIT Press, 1994.


What is Business Statistics?

The main objective of Business Statistics is to make inferences (e.g., prediction, making decisions) about certain characteristics of a population based on information contained in a random sample from the entire population. The condition for randomness is essential to make sure the sample is representative of the population.

Business Statistics is the science of ‘good' decision making in the face of uncertainty and is used in many disciplines, such as financial analysis, econometrics, auditing, production and operations, and marketing research. It provides knowledge and skills to interpret and use statistical techniques in a variety of business applications. A typical Business Statistics course is intended for business majors, and covers statistical study, descriptive statistics (collection, description, analysis, and summary of data), probability, and the binomial and normal distributions, test of hypotheses and confidence intervals, linear regression, and correlation.

Statistics is a science of making decisions with respect to the characteristics of a group of persons or objects on the basis of numerical information obtained from a randomly selected sample of the group. Statisticians refer to this numerical observation as realization of a random sample. However, notice that one cannot see a random sample. A random sample is only a sample of a finite outcomes of a random process.

At the planning stage of a statistical investigation, the question of sample size (n) is critical. For example, sample size for sampling from a finite population of size N, is set at: N½+1, rounded up to the nearest integer. Clearly, a larger sample provides more relevant information, and as a result a more accurate estimation and better statistical judgement regarding test of hypotheses.

What is statistics?

Activities Associated with the General Statistical Thinking
Click on the image to enlarge it

The above figure illustrates the idea of statistical inference from a random sample about the population. It also provides estimation for the population's parameters; namely the expected value µx, the standard deviation, and the cumulative distribution function (cdf) Fx, s and their corresponding sample statistics, mean , sample standard deviation Sx, and empirical cumulative distribution function (cdf), respectively.

The major task of statistics is to study the characteristics of populations whether these populations are people, objects, or collections of information. For two major reasons, it is often impossible to study an entire population:

The process would be too expensive or too time-consuming. The process would be destructive.

In either case, we would resort to looking at a sample chosen from the population and trying to infer information about the entire population by only examining the smaller sample. Very often the numbers which interest us most about the population are the mean m and standard deviation s. Any number -- like the mean or standard deviation -- which is calculated from an entire population, is called a Parameter. If the very same numbers are derived only from the data of a sample, then the resulting numbers are called Statistics. Frequently, Greek letters represent parameters and Latin letters represent statistics (as shown in the above Figure).

Statistics is a tool that enables us to impose order on the disorganized cacophony of the real world of modern society. The business world has grown both in size and competition. Corporate executive must take risk in business, hence the need for business statistics.

Business statistics has grown with the art of constructing charts and tables! It is a science of basing decisions on numerical data in the face of uncertainty.

Business statistics is a scientific approach to decision making under risk. In practicing business statistics, we search for an insight, not the solution. Our search is for the one solution that meets all the business's needs with the lowest level of risk. Business statistics can take a normal business situation, and with the proper data gathering, analysis, and re-search for a solution, turn it into an opportunity.

While business statistics cannot replace the knowledge and experience of the decision maker, it is a valuable tool that the manager can employ to assist in the decision making process in order to reduce the inherent risk.

Business Statistics provides justifiable answers to the following concerns for every consumer and producer:

  1. What is your or your customer's, Expectation of the product/service you sell or that your customer buys? That is, what is a good estimate for m ?
  2. Given the information about your, or your customer's, expectation, what is the Quality of the product/service you sell or that you customer buys. That is, what is a good estimate for s ?
  3. Given the information about your or your customer's expectation, and the quality of the product/service you sell or you customer buy, how does the product/service compare with other existing similar types? That is, comparing several m 's, and several s 's .


Common Statistical Terminology with Applications

Like all profession, also statisticians have their own keywords and phrases to ease a precise communication. However, one must interpret the results of any decision making in a language that is easy for the decision-maker to understand. Otherwise, he/she does not believe in what you recommend, and therefore does not go into the implementation phase. This lack of communication between statisticians and the managers is the major roadblock for using statistics.

Population: A population is any entire collection of people, animals, plants or things on which we may collect data. It is the entire group of interest, which we wish to describe or about which we wish to draw conclusions. In the above figure the life of the light bulbs manufactured say by GE, is the concerned population.

Qualitative and Quantitative Variables: Any object or event, which can vary in successive observations either in quantity or quality is called a "variable." Variables are classified accordingly as quantitative or qualitative. A qualitative variable, unlike a quantitative variable does not vary in magnitude in successive observations. The values of quantitative and qualitative variables are called "Variates" and "Attributes", respectively.

Variable: A characteristic or phenomenon, which may take different values, such as weight, gender since they are different from individual to individual.

Randomness: Randomness means unpredictability. The fascinating fact about inferential statistics is that, although each random observation may not be predictable when taken alone, collectively they follow a predictable pattern called its distribution function. For example, it is a fact that the distribution of a sample average follows a normal distribution for sample size over 30. In other words, an extreme value of the sample mean is less likely than an extreme value of a few raw data.

Sample: A subset of a population or universe.

An Experiment: An experiment is a process whose outcome is not known in advance with certainty.

Statistical Experiment: An experiment in general is an operation in which one chooses the values of some variables and measures the values of other variables, as in physics. A statistical experiment, in contrast is an operation in which one take a random sample from a population and infers the values of some variables. For example, in a survey, we "survey" i.e. "look at" the situation without aiming to change it, such as in a survey of political opinions. A random sample from the relevant population provides information about the voting intentions.

In order to make any generalization about a population, a random sample from the entire population; that is meant to be representative of the population, is often studied. For each population, there are many possible samples. A sample statistic gives information about a corresponding population parameter. For example, the sample mean for a set of data would give information about the overall population mean m .

It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included.

Example: The population for a study of infant health might be all children born in the U.S.A. in the 1980's. The sample might be all babies born on 7th of May in any of the years.

An experiment is any process or study which results in the collection of data, the outcome of which is unknown. In statistics, the term is usually restricted to situations in which the researcher has control over some of the conditions under which the experiment takes place.

Example: Before introducing a new drug treatment to reduce high blood pressure, the manufacturer carries out an experiment to compare the effectiveness of the new drug with that of one currently prescribed. Newly diagnosed subjects are recruited from a group of local general practices. Half of them are chosen at random to receive the new drug, the remainder receives the present one. So, the researcher has control over the subjects recruited and the way in which they are allocated to treatment.

Design of experiments is a key tool for increasing the rate of acquiring new knowledge. Knowledge in turn can be used to gain competitive advantage, shorten the product development cycle, and produce new products and processes which will meet and exceed your customer's expectations.

Primary data and Secondary data sets: If the data are from a planned experiment relevant to the objective(s) of the statistical investigation, collected by the analyst, it is called a Primary Data set. However, if some condensed records are given to the analyst, it is called a Secondary Data set.

Random Variable: A random variable is a real function (yes, it is called " variable", but in reality it is a function) that assigns a numerical value to each simple event. For example, in sampling for quality control an item could be defective or non-defective, therefore, one may assign X=1, and X = 0 for a defective and non-defective item, respectively. You may assign any other two distinct real numbers, as you wish; however, non-negative integer random variables are easy to work with. Random variables are needed since one cannot do arithmetic operations on words; the random variable enables us to compute statistics, such as average and variance. Any random variable has a distribution of probabilities associated with it.

Probability: Probability (i.e., probing for the unknown) is the tool used for anticipating what the distribution of data should look like under a given model. Random phenomena are not haphazard: they display an order that emerges only in the long run and is described by a distribution. The mathematical description of variation is central to statistics. The probability required for statistical inference is not primarily axiomatic or combinatorial, but is oriented toward describing data distributions. 

Sampling Unit: A unit is a person, animal, plant or thing which is actually studied by a researcher; the basic objects upon which the study or experiment is executed. For example, a person; a sample of soil; a pot of seedlings; a zip code area; a doctor's practice.

Parameter: A parameter is an unknown value, and therefore it has to be estimated. Parameters are used to represent a certain population characteristic. For example, the population mean m is a parameter that is often used to indicate the average value of a quantity.

Within a population, a parameter is a fixed value that does not vary. Each sample drawn from the population has its own value of any statistic that is used to estimate this parameter. For example, the mean of the data in a sample is used to give information about the overall mean min the population from which that sample was drawn.

Statistic: A statistic is a quantity that is calculated from a sample of data. It is used to give information about unknown values in the corresponding population. For example, the average of the data in a sample is used to give information about the overall average in the population from which that sample was drawn.

A statistic is a function of an observable random sample. It is therefore an observable random variable. Notice that, while a statistic is a "function" of observations, unfortunately, it is commonly called a random "variable" not a function.

It is possible to draw more than one sample from the same population, and the value of a statistic will in general vary from sample to sample. For example, the average value in a sample is a statistic. The average values in more than one sample, drawn from the same population, will not necessarily be equal.

Statistics are often assigned Roman letters (e.g. and s), whereas the equivalent unknown values in the population (parameters ) are assigned Greek letters (e.g., µ, s).

The word estimate means to esteem, that is giving a value to something. A statistical estimate is an indication of the value of an unknown quantity based on observed data.

More formally, an estimate is the particular value of an estimator that is obtained from a particular sample of data and used to indicate the value of a parameter.

Example: Suppose the manager of a shop wanted to know m , the mean expenditure of customers in her shop in the last year. She could calculate the average expenditure of the hundreds (or perhaps thousands) of customers who bought goods in her shop; that is, the population mean m . Instead she could use an estimate of this population mean m by calculating the mean of a representative sample of customers. If this value were found to be $25, then $25 would be her estimate.

There are two broad subdivisions of statistics: Descriptive Statistics and Inferential Statistics as described below.

Descriptive Statistics: The numerical statistical data should be presented clearly, concisely, and in such a way that the decision maker can quickly obtain the essential characteristics of the data in order to incorporate them into decision process.

The principal descriptive quantity derived from sample data is the mean (), which is the arithmetic average of the sample data. It serves as the most reliable single measure of the value of a typical member of the sample. If the sample contains a few values that are so large or so small that they have an exaggerated effect on the value of the mean, the sample is more accurately represented by the median -- the value where half the sample values fall below and half above.

The quantities most commonly used to measure the dispersion of the values about their mean are the variance s2 and its square root , the standard deviation s. The variance is calculated by determining the mean, subtracting it from each of the sample values (yielding the deviation of the samples), and then averaging the squares of these deviations. The mean and standard deviation of the sample are used as estimates of the corresponding characteristics of the entire group from which the sample was drawn. They do not, in general, completely describe the distribution (Fx) of values within either the sample or the parent group; indeed, different distributions may have the same mean and standard deviation. They do, however, provide a complete description of the normal distribution, in which positive and negative deviations from the mean are equally common, and small deviations are much more common than large ones. For a normally distributed set of values, a graph showing the dependence of the frequency of the deviations upon their magnitudes is a bell-shaped curve. About 68 percent of the values will differ from the mean by less than the standard deviation, and almost 100 percent will differ by less than three times the standard deviation.

Inferential Statistics: Inferential statistics is concerned with making inferences from samples about the populations from which they have been drawn. In other words, if we find a difference between two samples, we would like to know, is this a "real" difference (i.e., is it present in the population) or just a "chance" difference (i.e. it could just be the result of random sampling error). That's what tests of statistical significance are all about. Any inferred conclusion from a sample data to the population from which the sample is drawn must be expressed in a probabilistic term. Probability is the language and a measuring tool for uncertainty in our statistical conclusions.

Inferential statistics could be used for explaining a phenomenon or checking for validity of a claim. In these instances, inferential statistics is called Exploratory Data Analysis or Confirmatory Data Analysis, respectively.

Statistical Inference: Statistical inference refers to extending your knowledge obtained from a random sample from the entire population to the whole population. This is known in mathematics as Inductive Reasoning, that is, knowledge of the whole from a particular. Its main application is in hypotheses testing about a given population. Statistical inference guides the selection of appropriate statistical models. Models and data interact in statistical work. Inference from data can be thought of as the process of selecting a reasonable model, including a statement in probability language of how confident one can be about the selection.

Normal Distribution Condition: The normal or Gaussian distribution is a continuous symmetric distribution that follows the familiar bell-shaped curve. One of its nice features is that, the mean and variance uniquely and independently determines the distribution. It has been noted empirically that many measurement variables have distributions that are at least approximately normal. Even when a distribution is non-normal, the distribution of the mean of many independent observations from the same distribution becomes arbitrarily close to a normal distribution, as the number of observations grows large. Many frequently used statistical tests make the condition that the data come from a normal distribution.

Estimation and Hypothesis Testing:Inference in statistics are of two types. The first is estimation, which involves the determination, with a possible error due to sampling, of the unknown value of a population characteristic, such as the proportion having a specific attribute or the average value m of some numerical measurement. To express the accuracy of the estimates of population characteristics, one must also compute the standard errors of the estimates. The second type of inference is hypothesis testing. It involves the definitions of a hypothesis as one set of possible population values and an alternative, a different set. There are many statistical procedures for determining, on the basis of a sample, whether the true population characteristic belongs to the set of values in the hypothesis or the alternative.

Statistical inference is grounded in probability, idealized concepts of the group under study, called the population, and the sample. The statistician may view the population as a set of balls from which the sample is selected at random, that is, in such a way that each ball has the same chance as every other one for inclusion in the sample.

Notice that to be able to estimate the population parameters, the sample size n must be greater than one. For example, with a sample size of one, the variation (s2) within the sample is 0/1 = 0. An estimate for the variation (s2) within the population would be 0/0, which is indeterminate quantity, meaning impossible.

to previous section to next section