<- 1:100
ages <- sample(ages, size = 10)
random_sample random_sample
[1] 53 13 84 50 55 9 12 38 79 15
Steven P. Sanderson II, MPH
June 21, 2023
Sampling is a fundamental technique in data analysis and statistical modeling. It allows us to draw meaningful insights and make inferences about a larger population based on a representative subset. In the world of R programming, the sample()
function stands as a versatile tool that enables us to create random samples efficiently. In this post, we will explore the sample()
function and its various applications through a series of plain English examples.
First, let’s take a look at the syntax:
where:
x
is the dataset or vector from which to take the samplesize
is the number of elements to include in the samplereplace
is a logical value that indicates whether or not to allow sampling with replacement (the default is FALSE)prob
is a vector of probabilities that can be used to weight the sample (the default is NULL)Let’s say we have a dataset containing the ages of 100 people. To create a random sample of 10 individuals, we can use the sample()
function as follows:
[1] 53 13 84 50 55 9 12 38 79 15
The sample()
function randomly selects 10 values from the ages
vector, without replacement, resulting in a new vector named random_sample
. This technique represents simple random sampling, where each individual in the population has an equal chance of being included in the sample.
In some scenarios, we might want to allow repeated selections from the population. Let’s say we have a bag with colored balls, and we want to simulate drawing 5 balls with replacement. Here’s how we can achieve it:
colors <- c("red", "blue", "green", "yellow")
sample_with_replacement <- sample(colors, size = 5, replace = TRUE)
sample_with_replacement
[1] "yellow" "yellow" "green" "green" "red"
The sample()
function, with the replace = TRUE
argument, enables us to randomly select 5 colors from the colors
vector, allowing duplicates. This approach represents sampling with replacement, where each selection is independent of the previous ones.
In certain situations, we may want to assign different probabilities to elements in the population. Let’s assume we have a list of items and corresponding weights denoting their probabilities of being selected. We can use the sample() function with the prob
parameter to achieve weighted sampling. Consider the following example:
library(dplyr)
items <- c("apple", "banana", "orange")
weights <- c(0.4, 0.2, 0.4)
weighted_sample <- sample(items, size = 1, prob = weights)
weighted_sample
[1] "apple"
tibble(x = 1:10) |>
group_by(x) |>
mutate(rs = sample(items, size = 1, prob = weights)) |>
ungroup()
# A tibble: 10 × 2
x rs
<int> <chr>
1 1 orange
2 2 apple
3 3 apple
4 4 apple
5 5 apple
6 6 orange
7 7 orange
8 8 orange
9 9 apple
10 10 orange
By specifying the prob
argument with the corresponding weights, the sample()
function randomly selects a single item from the items
vector. The probability of each item being chosen is proportional to its weight. In this case, “apple” and “orange” have a higher chance (40% each) of being selected compared to “banana” (20%).
Stratified sampling involves dividing the population into subgroups or strata and then sampling from each stratum proportionally. Let’s assume we have a dataset of students’ grades in different subjects, and we want to select a sample that maintains the proportion of students from each subject. We can achieve this using the sample()
function along with additional parameters. Consider the following example:
subjects <- c("Math", "Science", "English", "History")
grades <- c(80, 90, 85, 70, 75, 95, 60, 92, 88, 83, 78, 91)
strata <- factor(subjects)
stratified_sample <- unlist(
by(
grades,
rep(strata, 3),
FUN = function(x) sample(x, size = 2)
)
)
stratified_sample
English1 English2 History1 History2 Math1 Math2 Science1 Science2
78 60 92 91 80 75 90 95
In this example, we use the by() function to group the grades by subject (strata
). Then, we apply the sample() function to each subgroup (subject) using the FUN argument. The result is a stratified sample of two grades from each subject, maintaining the relative proportions of students in the final sample.
The sample() function in R provides a powerful tool for generating random samples for various purposes. Whether you need simple random sampling, sampling with replacement, weighted sampling, or even stratified sampling, the sample() function can cater to your needs. By understanding and utilizing its various parameters, you can leverage the capabilities of sampling to gain insights from your data and make informed decisions. So go ahead, experiment with different sampling techniques using the sample() function, and unlock the potential of your data!