Unleashing the Power of Sampling in R: Exploring the Versatile sample() Function

rtip
Author

Steven P. Sanderson II, MPH

Published

June 21, 2023

Introduction

Sampling is a fundamental technique in data analysis and statistical modeling. It allows us to draw meaningful insights and make inferences about a larger population based on a representative subset. In the world of R programming, the sample() function stands as a versatile tool that enables us to create random samples efficiently. In this post, we will explore the sample() function and its various applications through a series of plain English examples.

First, let’s take a look at the syntax:

sample(x, size, replace = FALSE, prob = NULL)

where:

  • x is the dataset or vector from which to take the sample
  • size is the number of elements to include in the sample
  • replace is a logical value that indicates whether or not to allow sampling with replacement (the default is FALSE)
  • prob is a vector of probabilities that can be used to weight the sample (the default is NULL)

Examples

Example 1: Simple Random Sampling

Let’s say we have a dataset containing the ages of 100 people. To create a random sample of 10 individuals, we can use the sample() function as follows:

ages <- 1:100
random_sample <- sample(ages, size = 10)
random_sample
 [1] 53 13 84 50 55  9 12 38 79 15

The sample() function randomly selects 10 values from the ages vector, without replacement, resulting in a new vector named random_sample. This technique represents simple random sampling, where each individual in the population has an equal chance of being included in the sample.

Example 2: Sampling with Replacement

In some scenarios, we might want to allow repeated selections from the population. Let’s say we have a bag with colored balls, and we want to simulate drawing 5 balls with replacement. Here’s how we can achieve it:

colors <- c("red", "blue", "green", "yellow")
sample_with_replacement <- sample(colors, size = 5, replace = TRUE)
sample_with_replacement
[1] "yellow" "yellow" "green"  "green"  "red"   

The sample() function, with the replace = TRUE argument, enables us to randomly select 5 colors from the colors vector, allowing duplicates. This approach represents sampling with replacement, where each selection is independent of the previous ones.

Example 3: Weighted Sampling

In certain situations, we may want to assign different probabilities to elements in the population. Let’s assume we have a list of items and corresponding weights denoting their probabilities of being selected. We can use the sample() function with the prob parameter to achieve weighted sampling. Consider the following example:

library(dplyr)

items <- c("apple", "banana", "orange")
weights <- c(0.4, 0.2, 0.4)
weighted_sample <- sample(items, size = 1, prob = weights)
weighted_sample
[1] "apple"
tibble(x = 1:10) |> 
  group_by(x) |> 
  mutate(rs = sample(items, size = 1, prob = weights)) |>
  ungroup()
# A tibble: 10 × 2
       x rs    
   <int> <chr> 
 1     1 orange
 2     2 apple 
 3     3 apple 
 4     4 apple 
 5     5 apple 
 6     6 orange
 7     7 orange
 8     8 orange
 9     9 apple 
10    10 orange

By specifying the prob argument with the corresponding weights, the sample() function randomly selects a single item from the items vector. The probability of each item being chosen is proportional to its weight. In this case, “apple” and “orange” have a higher chance (40% each) of being selected compared to “banana” (20%).

Example 4: Stratified Sampling

Stratified sampling involves dividing the population into subgroups or strata and then sampling from each stratum proportionally. Let’s assume we have a dataset of students’ grades in different subjects, and we want to select a sample that maintains the proportion of students from each subject. We can achieve this using the sample() function along with additional parameters. Consider the following example:

subjects <- c("Math", "Science", "English", "History")
grades <- c(80, 90, 85, 70, 75, 95, 60, 92, 88, 83, 78, 91)
strata <- factor(subjects)
stratified_sample <- unlist(
  by(
    grades, 
    rep(strata, 3), 
    FUN = function(x) sample(x, size = 2)
    )
  )
stratified_sample
English1 English2 History1 History2    Math1    Math2 Science1 Science2 
      78       60       92       91       80       75       90       95 

In this example, we use the by() function to group the grades by subject (strata). Then, we apply the sample() function to each subgroup (subject) using the FUN argument. The result is a stratified sample of two grades from each subject, maintaining the relative proportions of students in the final sample.

Conclusion

The sample() function in R provides a powerful tool for generating random samples for various purposes. Whether you need simple random sampling, sampling with replacement, weighted sampling, or even stratified sampling, the sample() function can cater to your needs. By understanding and utilizing its various parameters, you can leverage the capabilities of sampling to gain insights from your data and make informed decisions. So go ahead, experiment with different sampling techniques using the sample() function, and unlock the potential of your data!