Tools for the simulation of data in the context of small area estimation. Combine all steps of your simulation - from data generation over drawing samples to model fitting - in one object. This enables easy modification and combination of different scenarios. You can store your results in a folder or start the simulation in parallel.
Two external resources may be of interest in addition to this vignette:
Consider a linear mixed model. It contains two components. A fixed effects part, and an error component. The error component can be split into a random effects part and a model error.
library(saeSim)
setup <- sim_base() %>%
sim_gen_x() %>%
sim_gen_e() %>%
sim_gen_v() %>%
sim_resp_eq(y = 100 + 2 * x + v + e) %>%
sim_simName("Doku")
setup## idD idU x e v y
## 1 1 1 -2.5058152 -3.217326 0.2353485 92.00639
## 2 1 2 0.7345733 -4.226103 0.2353485 97.47839
## 3 1 3 -3.3425144 -4.141583 0.2353485 89.40874
## 4 1 4 6.3811232 -4.742241 0.2353485 108.25535
## 5 1 5 1.3180311 -2.001758 0.2353485 100.86965
## 6 1 6 -3.2818735 -2.099955 0.2353485 91.57165
sim_base() is responsible to supply a
data.frame to which variables can be added. The default is
to create a data.frame with indicator variables
idD and idU (2-level-model), which uniquely
identify observations. ‘D’ stands for the domain, i.e. the grouping
variable. ‘U’ stands for unit and is the identifier of single
observations within domains. sim_resp will add a variable
y as response.
The setup itself does not contain the simulated data but the
functions to process the data. To start a simulation use the function
sim. It will return a list containing
data.frames as elements:
You can coerce a simulation setup to a data.frame with
as.data.frame.
Components in a simulation setup should be added using the pipe
operator %>% from magrittr. A component defines ‘when’ a
specific function will be applied in a chain of functions. To add a
component you can use one of: sim_gen,
sim_resp, sim_comp_pop,
sim_sample, sim_comp_sample,
sim_agg and sim_comp_agg. They all expect a
simulation setup as first argument and a function as second and will
take care of the order in which the functions are called. The reason for
this is the following:
setup <- sim_base() %>%
sim_gen_x() %>%
sim_gen_e() %>%
sim_resp_eq(y = 100 + 2 * x + e)
setup1 <- setup %>% sim_sample(sample_fraction(0.05))
setup2 <- setup %>% sim_sample(sample_number(5))You can define a rudimentary scenario and only have to explain how
scenarios differ. You do not have to redefine them. setup1
and setup2 only differ in the way samples are drawn.
sim_sample will take care, that the sampling will take
place at the appropriate place in the chain of functions no matter how
setup was composed.
If you can’t remember all function names, use auto-complete. All
functions to add components start with sim_. And all
functions meant to be used in a given phase will start with the
corresponding prefix, i.e. if you set the sampling scheme you use
sim_sample – all functions to control sampling have the
prefix sample.
You will want to check your results regularly when working with
sim_setup objects. Some methods are supplied to do
that:
show - this is the print method for
S4-Classes. You don’t have to call show explicitly. You may
have noticed that only a few lines of data are printed to the console if
you evaluate a simulation setup. show will print the
head of the resulting data of one simulation run.plot - for visualizing the dataautoplot - Will imitate smoothScatter with
ggplot2saeSim has an interface to standard random number generators in R. If things get more complex you can always define new generator functions.
## idD idU x
## 1 1 1 -3.477177
## 2 1 2 -3.477177
## 3 1 3 -3.477177
## 4 2 1 13.219727
## 5 2 2 13.219727
## 6 2 3 13.219727
You can supply any random number generator to
gen_generic and since we are in small area estimation you
have an optional group variable to generate ‘area-level’ variables. Some
short cuts for data generation are sim_gen_x,
sim_gen_v and sim_gen_e which add normally
distributed variables named ‘x’, ‘e’ and ‘v’ respectively. Also there
are some function with the prefix ‘gen’ which will be extended in the
future.
library(saeSim)
setup <- sim_base() %>%
sim_gen_x() %>% # Variable 'x'
sim_gen_e() %>% # Variable 'e'
sim_gen_v() %>% # Variable 'v' as a random-effect
sim_gen(gen_v_sar(name = "vSp")) %>% # random-effect following a SAR(1)
sim_resp_eq(y = 100 + x + v + vSp + e) # Computing 'y'
setup## idD idU x e v vSp y
## 1 1 1 -3.261018 -5.797709 -1.293091 0.4803078 90.12849
## 2 1 2 4.256857 4.773686 -1.293091 0.4803078 108.21776
## 3 1 3 -3.071645 1.024299 -1.293091 0.4803078 97.13987
## 4 1 4 5.208898 2.322837 -1.293091 0.4803078 106.71895
## 5 1 5 1.908363 8.533583 -1.293091 0.4803078 109.62916
## 6 1 6 -2.660212 3.759640 -1.293091 0.4803078 100.28664
For contaminated data you can use the same generator functions,
however, instead of using sim_gen you use
sim_gen_cont which will have some extra arguments to
control the contamination. To extend the above setup with a contaminated
spatially correlated error component you can add the following:
contSetup <- setup %>%
sim_gen_cont(
gen_v_sar(sd = 40, name = "vSp"), # defining the model
nCont = 0.05, # 5 per cent outliers
type = "area", # whole areas are outliers, i.e. all obs within
areaVar = "idD", # var name to identify domain
fixed = TRUE # if in each iteration the same area is an outlier
)Note that the generator is the same but with a higher standard
deviation. The argument nCont controls how much
observations are contaminated. Values < 1 are interpreted as
probability. A single number as the number of contaminated units (can be
areas or observations in each area or observations). When given with
length(nCont) > 1 it will be interpreted as number of
contaminated observations in each area. Use the following example to see
how these things play together:
base_id(3, 4) %>%
sim_gen_x() %>%
sim_gen_e() %>%
sim_gen_ec(mean = 0, sd = 150, name = "eCont", nCont = c(1, 2, 3)) %>%
as.data.frame## idD idU x e eCont idC
## 1 1 1 2.1238928 2.85552209 0.00000 FALSE
## 2 1 2 1.8224797 -3.70320125 0.00000 FALSE
## 3 1 3 -0.7466757 -4.63509843 0.00000 FALSE
## 4 1 4 0.2864107 -0.51238608 174.16195 TRUE
## 5 2 1 -4.1711873 -1.64389880 0.00000 FALSE
## 6 2 2 -0.9948926 2.68706944 0.00000 FALSE
## 7 2 3 -4.4030395 0.08884468 131.30190 TRUE
## 8 2 4 1.4103819 0.77040266 -111.81580 TRUE
## 9 3 1 -0.3921937 0.88877112 0.00000 FALSE
## 10 3 2 -5.7884702 -1.54674283 47.70671 TRUE
## 11 3 3 -4.8635870 1.35621664 -306.40766 TRUE
## 12 3 4 -5.7893455 4.26584633 305.57991 TRUE
Here follow some examples how to add components for computation to a
sim_setup. Three points can be accessed with
sim_comp_pop - add a computation before samplingsim_comp_sample - add a computation after samplingsim_comp_agg - add a computation after aggregationbase_id(2, 3) %>%
sim_gen_x() %>%
sim_gen_e() %>%
sim_gen_ec() %>%
sim_resp_eq(y = 100 + x + e) %>%
# the mean in each domain:
sim_comp_pop(comp_var(popMean = mean(y)), by = "idD")## idD idU x e idC y popMean
## 1 1 1 0.5746668 6.6169891 FALSE 107.1917 108.2884
## 2 1 2 3.7688622 0.2257547 FALSE 103.9946 108.2884
## 3 1 3 12.6044689 1.0745909 FALSE 113.6791 108.2884
## 4 2 1 6.6417051 5.0116040 FALSE 111.6533 106.3608
## 5 2 2 -0.8421809 3.4132565 FALSE 102.5711 106.3608
## 6 2 3 4.4991934 0.3587781 FALSE 104.8580 106.3608
The function comp_var is a wrapper around
dplyr::mutate so you can add simple data manipulations. The
argument by is a little extension and lets you define
operations in the scope of groups identified by a variable in the data.
In this case the mean of the variable ‘y’ is computed for every group
identified with the variable ‘idD’. This is done before sampling, hence
the prefix ‘pop’ for population.
By adding computation functions you can extend the functionality to wrap your whole simulation. This will separate the utility of this package from only generating data.
comp_linearPredictor <- function(dat) {
dat$linearPredictor <- lm(y ~ x, dat) %>% predict
dat
}
sim_base_lm() %>%
sim_comp_pop(comp_linearPredictor)## idD idU x e y linearPredictor
## 1 1 1 -8.5041814 -0.6369861 90.85883 91.55255
## 2 1 2 -0.7317040 3.0870898 102.35539 99.25294
## 3 1 3 1.1077078 6.9927853 108.10049 101.07530
## 4 1 4 3.8226415 4.3782846 108.20093 103.76505
## 5 1 5 0.6150609 -0.1913978 100.42366 100.58722
## 6 1 6 0.3820305 -3.6001313 96.78190 100.35635
Or, should this be desirable, directly produce a list of
lm objects or add them as attribute to the data. The
intended way of writing functions is that they will return the modified
data of class ‘data.frame’.
## [[1]]
##
## Call:
## lm(formula = y ~ x, data = dat)
##
## Coefficients:
## (Intercept) x
## 100.0421 0.9872
comp_linearModelAsAttr <- function(dat) {
attr(dat, "linearModel") <- lm(y ~ x, dat)
dat
}
dat <- sim_base_lm() %>%
sim_comp_pop(comp_linearModelAsAttr) %>%
as.data.frame
attr(dat, "linearModel")##
## Call:
## lm(formula = y ~ x, data = dat)
##
## Coefficients:
## (Intercept) x
## 100.017 1.032
If you use any kind of sampling, the ‘linearPredictor’ can be added after sampling. This is where small area models are supposed to be applied.
## idD idU x e y linearPredictor
## 1 1 1 -2.3521855 -2.075083 95.57273 97.97613
## 2 1 94 -6.8884013 -2.397076 90.71452 93.20934
## 3 1 68 5.0285137 -3.870309 101.15820 105.73200
## 4 1 19 3.0967386 4.419374 107.51611 103.70203
## 5 1 27 -0.4615954 8.400955 107.93936 99.96282
## 6 2 95 5.0186243 1.586789 106.60541 105.72161
Should you want to apply area level models, use
sim_comp_agg instead.
## idD x e y linearPredictor
## 1 1 -2.9161403 1.7102560 98.79412 96.76717
## 2 2 -0.2108174 -0.1187314 99.67045 99.52888
## 3 3 2.6877059 -1.5974012 101.09030 102.48782
## 4 4 0.8245103 -4.4691303 96.35538 100.58579
## 5 5 1.3990762 0.5101492 101.90923 101.17233
## 6 6 1.1361814 2.2870048 103.42319 100.90395
After the data generation you may want to draw a sample from the
population. Use the function sim_sample() to add a sampling
component to your sim_setup.
sample_number - wrapper around
dplyr::sample_nsample_fraction - wrapper around
dplyr::sample_frac## idD idU x
## 1 2 3 -2.117038
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## NA.3 NA NA NA
## NA.4 NA NA NA
## idD idU x
## 1 1 3 -5.349785
## 2 2 1 -9.711665
## 3 3 2 1.483272
## NA NA NA NA
## NA.1 NA NA NA
## NA.2 NA NA NA
## idD idU x e y
## 1 79 72 2.3706682 2.014138 104.38481
## 2 22 15 2.9998000 3.032253 106.03205
## 3 64 64 -0.2648422 -9.362967 90.37219
## 4 57 31 6.0910778 -3.184816 102.90626
## 5 53 27 -1.8359776 2.840800 101.00482
## 6 45 26 0.5487789 -9.372022 91.17676
## idD idU x e y
## 1 80 7 -3.7332804 7.142498 103.40922
## 2 100 74 9.6384784 1.852246 111.49072
## 3 95 49 -1.7732996 -6.654482 91.57222
## 4 62 95 5.9321181 1.967965 107.90008
## 5 76 80 -0.9007322 3.792310 102.89158
## 6 3 75 4.6494503 -3.962043 100.68741
# srs in each domain/cluster
sim_base_lm() %>% sim_sample(sample_number(size = 10L, groupVars = "idD"))## idD idU x e y
## 1 1 13 -1.0937285 3.430671 102.33694
## 2 1 23 -5.0083024 -12.542080 82.44962
## 3 1 26 -0.6701565 1.254721 100.58456
## 4 1 68 5.7348578 5.134796 110.86965
## 5 1 47 0.7213554 6.893006 107.61436
## 6 1 15 -3.4577262 3.535397 100.07767
## idD idU x e y
## 1 1 18 2.4307740 -2.500055 99.93072
## 2 1 25 -0.3409896 -1.197272 98.46174
## 3 1 51 3.5957433 3.326504 106.92225
## 4 1 33 -1.0225583 3.621166 102.59861
## 5 1 16 -0.5729424 -8.278747 91.14831
## 6 2 49 6.0450908 1.109298 107.15439