__Key Concepts__

The purpose of sampling is to create a representative subset of the data. Sampling allowed us to perform analyses and extrapolate results for the claims data in this engagement without examining the entire dataset. The reliability of the sample is measured by two statistical concepts: the confidence level and the margin of error, explained below.

The confidence level is the percentage of time that the actual or true value of the population is within a specified range of the sample. The margin of error defines that range. In our case, if the average cost of care was estimated to be $2,000 with a +/- 5% margin of error at the 95% confidence level, we would know that the true average cost is between $1,900 and $2,100 with 95% statistical certainty.

The confidence level and margin of error are directly tied to the size of the sample. The more observations included in the sample, the smaller the margin of error given the same confidence level. However, this relationship is non-linear. At high levels of precision, a small increase in accuracy may require a large increase in sample size.

__Sample Design__

Designing a representative sample requires several considerations, including choices that change the sample size required for statistical inference, such as sampling units and choice of sampling methods. In terms of sampling units, each sampling unit has the same probability of being selected into the sample. That is, each individual claim has equal chance of being drawn. Sampling at the claim level may allow one to examine the average cost per service.

In terms of choice of sampling methods, random sampling is the most common and simple method to select a representative subset from a population. We identified two alternative approaches depending on the question the sampling aimed to answer. For each approach, the approximate sample size needed in the example is obtained using RAT-STATS, a software created by the Office of Inspector General, which is often used to sample and quantify improper claims.

#### Option 1: Sampling by Error Rate

When the sample intends to answer a yes-or-no question, for example, whether there was overbilling or not, the sample will be able to provide a statistical basis to conclude how common a particular occurrence (presence of overbilling) is. The sample size required for +/- 5% margin of error at the 95% confidence level is 401.

#### Option 2: Sampling by Amount

In the event that we try to answer a question such as “what is the average amount billed?”, the objective of the sample is to estimate an amount of interest. The sample size required for +/- 5% margin of error at the 95% confidence level is 548.