Topical Paper on Sample Definition and Selection

 

We live in an age of statistics. We learn from the daily newspapers that the production of industry in the country is up by 4 percent over the previous year; the number of unemployed now stands at 3.5 million; the area under food crops has shrunk by 2 percent. A demographer claims that the population of the country has risen by 1.5 percent during the year while the number of suicides has more than doubled. It is broadcast on the radio that as many as 70 percent of the adults smoke more than two packets of cigarettes a day and so on. Confronted with this mass of statistics, the common man becomes bewildered and started wondering how these figures are arrived at and at what cost.

It should be conceded at the outset that we do need figures to make the right type of decision. Government, business, and the professions all seek the broadest possible factual basis for decision making. In the absence of data on the subject a decision taken is just like leaping into the dark. We do need statistics, in fact better and better statistics. Many of the statistics we find in newspapers are the by-product of day-to-day administration. There are some other basic facts about the nation which are ordinarily collected through periodic censuses. And there are statistics collected through sample surveys in which just a fraction of the universe is used to provide information for the whole.

The basic idea is simple. Information is needed about a group or population of objects such as persons, farms, or firms. We examine only some of the objects and extend our findings to the whole group. Thus a few blocks may be taken from a city and lists of persons made in each sample block. This information is used for estimating the total number of persons in the city.

There are three elements in the process: selecting the sample, collecting the information, and making an inference about the population. The three elements cannot generally be considered in isolation from one another. Sample selection, data collection, and estimation are all interwoven and each has an impact on the others. Sampling is not haphazard selection:; it embodies definite rules for selecting the sample. But having followed a set of rules for sample selection, we cannot consider the estimation process independent of it; estimation is guided by the manner in which the sample has been selected.

It appears at first sight that sampling is a risky proposition. If a sample of blocks is used to estimate the total number of persons in the city, the blocks in the sample may be larger than the average. This will overstate the true population of the city. In sampling from a product, the sample can be free of defects; yet as much as 10 percent of the manufactured product may be defective. Thus we concede that the figure obtained from the sample may not be exactly equal to the true value in the population. The reason is that the estimate is based on a part and not on the whole. Another way of saying it is that the sample estimate is subject to sampling errors or sampling fluctuations. Modern sampling theory helps in designing the survey in such a manner that the sampling errors can be made small.

There are many kinds of errors involved when data are collected from a sample of objects or from the whole group. Suppose we want to estimate the average age of a group by asking a sample of persons. Some persons may not know their age exactly; some others may overstate it as a matter of prestige; a few others may refuse to give their age in a few cases the enumerator may record the age wrongly; and so on. Thus errors of various types may creep into the results. These errors are present whether you take a sample or canvass every unit in the population.

Then there are errors arising from unclear definitions. Suppose we want to estimate the number of persons in an area. What do we mean by the term "the number of persons" in the area? Do we want to take in those who normally live there, or those who spent the previous night in the area? What about a guest in the house or the father on a trip to a foreign country or the boy at school in the neighboring commune? If you want to count only those who normally live there, how do you define the usual residents? You cannot define all your terms with mathematical exactitude, and so you do not know exactly what you want to measure. This brings about errors.

Some of the errors in the data are of the random type: That is, they average to zero over the sample. This happens when the errors are not deliberate or intentional. Some units will overstate, and others will understate, resulting in a net difference of zero. There are other errors of a systematic type which are more serious. In an inquiry conducted to estimate the average number of parcels operated by farmers, the farmers may deliberately understate the number for fear of taxation. This type of error will not cancel out over the sample but will rather persist. Such errors are called systematic errors.

We should reconcile ourselves to the situation that errors are always present when data are collected. We have to live with errors. To pursue the example of estimating the number of persons in an area, we cannot even define our terms rigorously. Even if we could, the population is changing at every moment. The population must have changed by the time the results become available. And then we are so much concerned with what the position was at the time of the inquiry but with what it would be when action is to be taken on the basis of the inquiry. There is thus a strong case for reordering ourselves toward the philosophy of data collection.

Of course the errors of a survey depend considerably on the essential conditions under which it is conducted. If resources are meager and skilled personnel are not available, the errors will be large; these errors can be reduced if money and the right kind of personnel are available.

Sampling methods are being increasingly used in almost all spheres of human activity. There are population studies for finding the number of persons in an area, their distribution by sex and age, the number of births and deaths, and the amount of internal migration. Then there are studies of labor problems, the number of hours worked and wages paid, whether women receive lower wages than men in the same industry, the occupational structure of the population, and the number of those actively seeking work. Another field is that of agriculture. The total area under cultivation, the pattern of cropping, the size and quality of livestock, and the number and area of holdings and their tenure are some of the problems in which sampling methods are being used.

Although sampling methods can be employed for any conceivable purpose, there are four characteristics of the population in which we are usually interested: population total (such as the total number of beggars in a city), population mean (the average number of persons in a household), population proportion (the percent of cultivated area devoted to corn), and population ratio (the ratio of expenditure on recreation to that on food). The populations encountered in practice are finite in that the number of objects contained in them is limited.

In addition to summary figures such as the mean or proportion, the entire distribution of a variant may be of interest, such as the distribution of broken homes in a community or a political crisis on a university campus. In all these cases the goal is description of the population. There are also situations in which the goal is explanation—to find out why a distribution takes the form it does. What accounts for broken homes? What accounts for the support and opposition which a political movement on campus generates? There may be cases in which both description and explanation are of interest. For example, the purpose may be to find out both how students react to a situation and why they react the way they do.

Most surveys are descriptive in nature. Description is of two types: simple and differentiated. The distribution of responses to a question is an example of simple description. But when these responses are broken down by, say, age, sex, income, or education, we have an example of differentiated description. Differentiated description is used to see how the distribution varies among subdivisions of the population. Quite often differentiated description is an initial step to explanatory analysis.

When the inquiry is based on a sample, the results obtained will ordinarily differ from the true values aimed at. In fact different samples will produce different results. But we cannot judge the validity of a single sample by finding how far it differs from the true value, which is unknown. It is the entire apparatus used—the sampling procedure, the data collection, and the manner in which estimates are made—which will have to be judged as satisfactory or not. Fortunately the mathematical theory of sampling can be used for this purpose provided the sample is selected properly. The sample itself can give guidance as to how different samples will differ from each other and provide a measure of sampling errors. From the same sample we can make different types of calculations in order to determine the best way of making the estimate. It is possible to provide bounds on the magnitudes of the errors in the results and to devise means of lowering the error bounds as required. It is these aspects of the sampling method which have earned it an important place in scientific inference. Sampling is now considered a reliable and organized instrument of fact finding.

There are occasions when the sample survey will not do the job. Suppose you want information for every village in the country and there are 586,000 villages in the population. You cannot possible devise a sample of a reasonable size which can provide you with this information for each village. You will have to take a census rather in this case. A census is the only possibility when local data are needed for each subdivision of the country. And then there is the question of coverage. It is difficult to have complete coverage of the population in a sample survey. Even some of the best surveys have been found to cover only 95 percent of the population. It becomes difficult to check whether all those intended to be taken in the sample have actually been taken. It is easier to check on this in the census since every one is supposed to be included.

It might be thought that the quickest way of selecting a sample is by judgment. Thus if 20 books are to selected from a total of 200 to estimate the average number of pages in a book, someone might suggest picking out those books which appear to be of average size. The difficulty with such a procedure is that consciously or unconsciously, the sampler will tend to make errors of judgment in the same direction by selecting most of the books which are either bigger than the average or otherwise. Such systematic errors lead to what are called biases. A second disadvantage is that the range of variation as observed in such a sample does not give a good idea of the variability in the population. This happens because the sampler is unlikely to select units which are too small or too larger, although such unit do exist in the population. Furthermore, the situation does not improve by asking a larger number of persons to select the samples by judgment. And there is no objective method of preferring the judgment of one person to that of another. We cannot predict the type or the distribution of the results produced by a larger number of samplers, nor can we predict the manner in which these will differ from the true value aimed for.

The basic reason for out inability to handle judgment samples is that we cannot calculate the probability that a specified unit is selected in the sample. We are therefore unable to determine the frequency distribution of the estimates produced by its process. In the absence of this information the sampling error cannot be objectively determined.

The situation is completely different when we use probability sampling methods in which each unit in the population has a known nonzero chance of being selected in the sample. Such samples are usually selected with the help of random numbers. Having selected a probability sample, we can use the theory of probability to determine the frequency distribution of the estimates derivable from the sampling and estimation procedure used. In this manner, a measure of the sampling variation can be obtained objectively from the sample itself. Valid inferences can be derived from the sample by making use of the statistical theory of inference.

An estimate calculated from the sample is said to be precise if it is near the expected value, that is census taken under identical conditions. It may not necessarily be near the true value aimed at; that is, it need not to be accurate. Precision refers to closeness to the expected value while accuracy refers to closeness to the true value. The expected value will be different from the true value when the errors present in the data do not average top zero. These errors arise from faulty measurement techniques, defective questionnaires, ill-defined concepts, and so on. Although our aim is to be as accurate as possible, the sample can only give guidance as to how precise we are. If non sampling errors are unimportant, precision and accuracy are not different, and the sample can throw light on the accuracy of the results.

The resources for conducting a sample survey are always limited. We have thus to look for procedures which are simple and efficient, procedures which can be completed within the time schedules and which take into account all administrative requirements. Only those procedures are ordinarily considered from which an objective estimate of the precision attained to carry through the procedures according to desired specifications. If the cost of the survey is specified, the best procedure to be chosen is that which gives the highest accuracy. When the level of accuracy is predetermined, the best procedure is that for which the cost of the survey is minimum. This is the guiding principle of sample design.

The following will present the basic principles of sampling, in particular on identifying a valid and reliable sample size as well as the correct methodology for obtaining the sample. A small hypothetical population will be used as the testing ground. Consider the following hypothetical population of 10 manufacturing establishments along the number of paid employees in each:

Establishment number 0 1 2 3 4 5 6 7 8 9

Number of paid employees 31 15 67 20 13 18 09 22 48 27

It is clear that the sample estimates lie within the range 11.0 to 57.5, the true value being 27.0. Same samples give a very low figure while some others give a high estimate. But the average of all the sample estimates is 27.0, which is the true average in the population. We shall express this fact by saying that the sample mean is an unbiased estimate of the population mean; that is, the expected value of the sample mean is the population mean. But, although unbiased, the sample mean varies considerably around the population mean.

The purpose is to estimate the average employment per establishment from a random sample of two establishments. There are in all 45 samples each containing two establishments. We can calculate the average employment from each sample and use it as an estimate of the population average.

If we increase the sample size to 3, there are 120 samples in all. We can calculate the average employment for each sample and prepare a frequency distribution showing the number of sample averages falling in the different classes. This procedure can be repeated with samples of size n = 4, 5, ..., 9. It is noted that the concentration of sample estimates around the true mean increases as the sample size is increased. Here we have observed a universal phenomenon. The concentration of the sample average increases as the sample size increases. This fact is expressed by saying that the sample mean is a consistent estimate of the population mean.

Now we need a measure of degree of concentration of the sample estimates around the expected value. This measure is provided by the variance of estimates. If all the sample estimates are equal, the variance of the estimate is zero. The greater the variance the less the concentration.

There is another method by which we can often achieve greater concentration of the sample estimates around the expected value. The spread of the sample estimates can be made fantastically small. There is very little risk in using the sample estimate for the population parameter. It may be pointed out that the reason for the considerable reduction in variance in this case is that there is near-proportionality between the main variant and the auxiliary character. If the degree of proportionality is not so high, the results will be less spectacular.

There is another method of making use of auxiliary information for improving the precision of the estimate. With this method the population is divided up into groups or strata which are relatively more homogeneous internally. This can be done by placing within the same stratum those units which are similar with respect to a given variable. Then a sample is selected from each stratum. The sample results from different strata are pooled in order to arrive at an estimate for the whole. Greater precision arises from the fact that the strata are homogenous, so that stratum means can be estimated with smaller error.

Another technique is to select the sample in clusters or groups rather than to take individual units. Suppose, for instance, that you want to select a sample of households from a city. It is very expensive job to make a list of all households in the city before selecting a random sample from it. It is far more convenient to get a map of the city, divide it into identifiable blocks, and select a sample of some blocks. All households in the selected blocks are the object of further inquiry. The sample is selected by taking a few blocks which are clusters of households. The main reason for resorting to cluster sampling is the economy involved. Ordinarily clusters will be formed by putting together units which are physically near to each other. When there is an opportunity to form clusters by picking out units which are necessarily geographically contiguous, it is best to place in same cluster dissimilar units. This is exactly the opposite of stratification, in which similar units are assigned to the same stratum. If each cluster is a miniature of the total universe, one can make a good estimates by selecting just a few clusters.

Suppose we believe that it is useful to make use of auxiliary information on a given variable x for improving the precision of the estimate of the population mean for the other variable y. But information on x is not available. In that case two courses are open. We may decide to select a sample of n units, collect information on the variant y under study, and use the sample of mean for y as an estimate of the population mean. Alternatively, a part of the total budget is used for collecting information on x for a fairly large sample taken from the universe. Then a smaller sample is taken for obtaining information on y alone. The two samples are used in the best possible manner to arrive at a better estimate of the population mean. The latter procedure is called double or two-phase sampling. This method will be useful when it is considerably cheaper to collect information on s than on y and when there is high correlation between y and x.

It has been assumed all along that there are no reported errors in the data collected. Actually, errors of measurement or ascertainment are almost always present when information is collected. The establishments may not have proper records to give information on y. They may deliberately understate the employment figures, assuming erroneously that the inquiry may be connected with taxation, and so on. This will bring about errors in the data. The problem is how to form usable estimates from the sample in the presence of response errors.

There may be another type of bias present in the results. This happens when some of the units in the sample do not respond and provide no information whatsoever. For example, the two smallest establishments may decide not to cooperate in the survey. If a sample of two establishments is taken and a random substitution is made for those establishments which do not cooperate, it is clear that the sample refers to the other establishments. In these circumstances the sample will overstate the true average and is biased in one direction. Some procedure will have to be found for diminishing this bias.

There are many different ways of selecting a probability sample from a population. The basic requirements is that it should be possible to subdivide the population into what are called sample units. For example, all persons in a commune may be considered to be grouped into households; all land in a village may be assumed to be divided up into fields. The household in the first case and the filed in the second case are sampling units. The existence of a list or a map showing the various sampling units is an essential prerequisites for the selection of the sample. For the same population these may be several kinds of sampling units to which the selection procedure is applied. The reporting unit and the unit of analysis may not be the same as the sampling unit.

In conclusion, the methods for sample selection are: Simple Random Sampling—a simple random sampling; Systematic Sampling—which is a more convenient method of sample selection when the units are numbered from 1 to N; Stratified Sampling—in which the units in the population are allocated to groups or strata on the basis of information on a variant x; Cluster Sampling—whereas the sample is selected in clusters or groups of elements units since frames listing elementary units are rarely available; and Double Sampling—where if auxiliary information is not available, it may be advantageous to conduct the inquiry in two phases.