Title: | Simulation Tools for Small Area Estimation |
---|---|
Description: | Tools for the simulation of data in the context of small area estimation. Combine all steps of your simulation - from data generation over drawing samples to model fitting - in one object. This enables easy modification and combination of different scenarios. You can store your results in a folder or start the simulation in parallel. |
Authors: | Sebastian Warnholz [aut, cre], Timo Schmid [aut] |
Maintainer: | Sebastian Warnholz <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.11.0 |
Built: | 2025-01-31 03:24:30 UTC |
Source: | https://github.com/wahani/saesim |
This is the 'pipe operator' from the package 'magrittr'. Use it to chain all operations for the simulation together. See the original documentation for details: %>%.
lhs %>% rhs
lhs %>% rhs
lhs |
The value to be piped |
rhs |
A function or expression |
This function is intended to be used with sim_agg
and not interactively. This is one implementation for aggregating data in a simulation set-up.
agg_all(groupVars = "idD")
agg_all(groupVars = "idD")
groupVars |
variable names as character identifying groups to be aggregated. |
This function follows the split-apply-combine idiom. Each data set is split by the defined variables. Then the variables within each subset are aggregated (reduced to one row). Logical variables are reduced by any
; for characters and factors dummy variables are created and the aggregate is the mean of each dummy; and for numerics the mean (removing NAs).
sim_base() %>% sim_gen_x() %>% sim_gen_e() %>% sim_agg(agg_all())
sim_base() %>% sim_gen_x() %>% sim_gen_e() %>% sim_agg(agg_all())
Use this method to get a single simulated data.frame out of a sim_setup object.
## S3 method for class 'sim_setup' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S3 method for class 'sim_setup' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
x |
a sim_setup |
row.names |
will have no effect |
optional |
will have no effect |
... |
will have no effect |
Use this function to produce plots for an object of class sim_setup
and you like to have plots based on ggplot2. At this time it is a ggplot2 implementation which mimics the behavior of smoothScatter
without all the options.
## S3 method for class 'sim_setup' autoplot(object, x = "x", y = "y", ...)
## S3 method for class 'sim_setup' autoplot(object, x = "x", y = "y", ...)
object |
a sim_setup |
x |
character of variable name in the data on the x-axis |
y |
character of variable name in the data on the y-axis |
... |
is not used |
## Not run: autoplot(sim_base_lm()) ## End(Not run)
## Not run: autoplot(sim_base_lm()) ## End(Not run)
Use this function to add id-variables to your data.
base_add_id(data, domainId)
base_add_id(data, domainId)
data |
a data.frame. |
domainId |
variable names in |
This function constructs a data frame with grouping/id
variables.
base_id(nDomains = 10, nUnits = 10) base_id_temporal(nDomains = 10, nUnits = 10, nTime = 10)
base_id(nDomains = 10, nUnits = 10) base_id_temporal(nDomains = 10, nUnits = 10, nTime = 10)
nDomains |
The number of domains. |
nUnits |
The number of units in each domain. Can have |
nTime |
The number of time points for each units. |
Return a data.frame
with variables idD
as ID-variable for
domains, and idU
as ID-variable for units.
base_id(2, 2) base_id(2, c(2, 3))
base_id(2, 2) base_id(2, c(2, 3))
This function is intended to be used with sim_comp_pop
, sim_comp_sample
or sim_comp_agg
and not interactively. This is a wrapper around mutate
comp_var(...)
comp_var(...)
... |
variables interpreted in the context of that data frame. |
sim_comp_pop
, sim_comp_sample
, sim_comp_agg
sim_base_lm() %>% sim_comp_pop(comp_var(yExp = exp(y)))
sim_base_lm() %>% sim_comp_pop(comp_var(yExp = exp(y)))
These functions are intended to be used with sim_gen
and not
interactively. They are designed to draw random numbers according to the
setting of grouping variables.
gen_norm(mean = 0, sd = 1, name = "e") gen_v_norm(mean = 0, sd = 1, name = "v") gen_v_sar(mean = 0, sd = 1, rho = 0.5, type = "rook", name) gen_v_ar1(mean = 0, sd = 1, rho = 0.5, groupVar = "idD", timeVar = "idT", name) gen_generic(generator, ..., groupVars = NULL, name)
gen_norm(mean = 0, sd = 1, name = "e") gen_v_norm(mean = 0, sd = 1, name = "v") gen_v_sar(mean = 0, sd = 1, rho = 0.5, type = "rook", name) gen_v_ar1(mean = 0, sd = 1, rho = 0.5, groupVar = "idD", timeVar = "idT", name) gen_generic(generator, ..., groupVars = NULL, name)
mean |
the mean passed to the random number generator, for example
|
sd |
the standard deviation passed to the random number generator, for example rnorm. |
name |
name of variable as character in which random numbers are stored. |
rho |
the correlation used to create the variance covariance matrix for
a SAR process - see |
type |
either "rook" or "queen". See |
groupVar |
a variable name identifying groups. |
timeVar |
a variable name identifying repeated measurements. |
generator |
a function producing random numbers. |
... |
arguments passed to |
groupVars |
names of variables as character. Identify groups within random numbers are constant. |
gen_norm
is used to draw random numbers from a normal
distribution where all generated numbers are independent.
gen_v_norm
and gen_v_sar
will create an area-level random
component. In the case of v_norm
, the error component will be from a
normal distribution and i.i.d. from an area-level perspective (all units in
an area will have the same value, all areas are independent). v_sar will also
be from a normal distribution, but the errors are correlated. The variance
covariance matrix is constructed for a SAR(1) - spatial/simultanous
autoregressive process. mvrnorm is used for the random number
generation. gen_v_norm
and gen_v_sar
expect a variable
idD
in the data identifying the areas.
gen_generic
can be used if your world is not normal. You can specify
'any' function as generator, like rnorm
. Arguments in
...
are matched by name or position. The first argument of
generator
is expected to be the number of random numbers (not
necessarily named n
) and need not to be specified.
sim_gen
, sim_gen_x
,
sim_gen_e
, sim_gen_ec
, sim_gen_v
,
sim_gen_vc
, cell2nb
sim_base() %>% sim_gen_x() %>% sim_gen_e() %>% sim_gen_v() %>% sim_gen(gen_v_sar(name = "vSP")) # Generic interface set.seed(1) dat1 <- sim(base_id() %>% sim_gen(gen_generic(rnorm, mean = 0, sd = 4, name = "e"))) set.seed(1) dat2 <- sim(base_id() %>% sim_gen_e()) all.equal(dat1, dat2)
sim_base() %>% sim_gen_x() %>% sim_gen_e() %>% sim_gen_v() %>% sim_gen(gen_v_sar(name = "vSP")) # Generic interface set.seed(1) dat1 <- sim(base_id() %>% sim_gen(gen_generic(rnorm, mean = 0, sd = 4, name = "e"))) set.seed(1) dat2 <- sim(base_id() %>% sim_gen_e()) all.equal(dat1, dat2)
Use this function to produce plots for an object of class sim_setup
.
## S3 method for class 'sim_setup' plot(x, y, ...)
## S3 method for class 'sim_setup' plot(x, y, ...)
x |
a |
y |
will be ignored |
... |
Arguments to be passed to |
These functions are intended to be used with sim_sample
and not
interactively. They are wrappers around sample_frac and
sample_n.
sample_fraction(size, replace = FALSE, weight = NULL, groupVars = NULL) sample_number(size, replace = FALSE, weight = NULL, groupVars = NULL) sample_numbers(size, replace = FALSE, groupVars = NULL) sample_cluster_number(size, replace = FALSE, weight = NULL, groupVars) sample_cluster_fraction(size, replace = FALSE, weight = NULL, groupVars)
sample_fraction(size, replace = FALSE, weight = NULL, groupVars = NULL) sample_number(size, replace = FALSE, weight = NULL, groupVars = NULL) sample_numbers(size, replace = FALSE, groupVars = NULL) sample_cluster_number(size, replace = FALSE, weight = NULL, groupVars) sample_cluster_fraction(size, replace = FALSE, weight = NULL, groupVars)
size |
< |
replace |
Sample with or without replacement? |
weight |
< |
groupVars |
character with names of variables to be used for grouping. |
sample_numbers
is a vectorized version of sample_number
.
sample_cluster_number
and sample_cluster_fraction
will sample
clusters (all units in a cluster).
sim_base_lm() %>% sim_sample(sample_number(5)) sim_base_lm() %>% sim_sample(sample_fraction(0.5)) sim_base_lm() %>% sim_sample(sample_cluster_number(5, groupVars = "idD")) sim_base_lm() %>% sim_sample(sample_cluster_fraction(0.5, groupVars = "idD"))
sim_base_lm() %>% sim_sample(sample_number(5)) sim_base_lm() %>% sim_sample(sample_fraction(0.5)) sim_base_lm() %>% sim_sample(sample_cluster_number(5, groupVars = "idD")) sim_base_lm() %>% sim_sample(sample_cluster_fraction(0.5, groupVars = "idD"))
sim_setup
This is the documentation for the show methods in the package saeSim
. In case you don't know, show
is for S4-classes like print
for S3. If you don't know what that means, don't bother, there is no reason to call show
directly, however there is the need to document it.
## S4 method for signature 'sim_setup' show(object) ## S4 method for signature 'summary.sim_setup' show(object)
## S4 method for signature 'sim_setup' show(object) ## S4 method for signature 'summary.sim_setup' show(object)
object |
Any R object |
Will print the head of a sim_setup
to the console, after converting it to a data.frame
.
This function will start the simulation. Use the printing method as long as you are testing the scenario.
sim( x, R = 1, path = NULL, overwrite = TRUE, ..., suffix = NULL, fileExt = ".csv", libs = NULL, exports = NULL )
sim( x, R = 1, path = NULL, overwrite = TRUE, ..., suffix = NULL, fileExt = ".csv", libs = NULL, exports = NULL )
x |
a |
R |
number of repetitions. |
path |
optional path in which the simulation results can be saved. They
will we coerced to a |
overwrite |
|
... |
arguments passed to |
suffix |
an optional suffix of file names. |
fileExt |
the file extension. Default is ".csv" - alternative it can be ".RData". |
libs |
arguments passed to |
exports |
arguments passed to |
The package parallelMap is utilized as back-end for parallel computations.
Use the argument path
to store the simulation results in a directory.
This may be a good idea for long running simulations and for those using
large data.frame
s. You can use sim_read_data
to read
them in. The return value will change to NULL in each run.
The return value is a list. The elements are the results of each
simulation run, typically of class data.frame
. In case you specified
path
, each element is NULL
.
setup <- sim_base_lm() resultList <- sim(setup, R = 1) # For parallel computations you may need to export objects localFun <- function() cat("Hello World!") comp_fun <- function(dat) { localFun() dat } res <- sim_base_lm() %>% sim_comp_pop(comp_fun) %>% sim(R = 2, mode = "socket", cpus = 2, exports = "localFun") str(res)
setup <- sim_base_lm() resultList <- sim(setup, R = 1) # For parallel computations you may need to export objects localFun <- function() cat("Hello World!") comp_fun <- function(dat) { localFun() dat } res <- sim_base_lm() %>% sim_comp_pop(comp_fun) %>% sim(R = 2, mode = "socket", cpus = 2, exports = "localFun") str(res)
One of the components which can be added to a simulation set-up. Aggregating the data is a simulation component which can be used to aggregate the population or sample. The aggregation will simply be done after the sampling, if you haven't specified any sampling component, the population is aggregated (makes sense if you draw samples directly from the model).
sim_agg(simSetup, aggFun = agg_all())
sim_agg(simSetup, aggFun = agg_all())
simSetup |
a |
aggFun |
function which controls the aggregation process. At the moment only |
Potentially you can define an aggFun
yourself. Take care that it only has one argument, named dat
, and returns the aggregated data as data.frame
.
agg_all
, sim_gen
, sim_comp_pop
, sim_sample
, , sim_comp_sample
# Aggregating the population: sim_base_lm() %>% sim_agg() # Aggregating after sampling: sim_base_lm() %>% sim_sample() %>% sim_agg() # User aggFun: sim_base_lm() %>% sim_agg(function(dat) dat[1, ])
# Aggregating the population: sim_base_lm() %>% sim_agg() # Aggregating after sampling: sim_base_lm() %>% sim_sample() %>% sim_agg() # User aggFun: sim_base_lm() %>% sim_agg(function(dat) dat[1, ])
Use the 'sim_base' functions to start a new sim_setup
.
sim_base(data = base_id(100, 100))
sim_base(data = base_id(100, 100))
data |
a |
# Example for a linear model: sim_base() %>% sim_gen_x() %>% sim_gen_e()
# Example for a linear model: sim_base() %>% sim_gen_x() %>% sim_gen_e()
sim_base_lm()
will start a linear model: One regressor, one error component. sim_base_lmm()
will start a linear mixed model: One regressor, one error component and one random effect for the domain. sim_base_lmc()
and sim_base_lmmc()
add outlier contamination to the scenarios. Use these as a quick start, then you probably want to configure your own scenario.
sim_base_lm() sim_base_lmm() sim_base_lmc() sim_base_lmmc()
sim_base_lm() sim_base_lmm() sim_base_lmc() sim_base_lmmc()
Additional information on the generated variables:
nDomains: 100 domains
nUnits: 100 in each domain
x: is normally distributed with mean of 0 and sd of 4
e: is normally distributed with mean of 0 and sd of 4
v: is normally distributed with mean of 0 and sd of 1, it is a constant within domains
e-cont: as e; probability of unit to be contaminated is 0.05; sd is then 150
v-cont: as v; probability of area to be contaminated is 0.05; sd is then 40
y = 100 + x + v + e
# The preconfigured set-ups: sim_base_lm() sim_base_lmm() sim_base_lmc() sim_base_lmmc()
# The preconfigured set-ups: sim_base_lm() sim_base_lmm() sim_base_lmc() sim_base_lmmc()
sim_comp_n
and sim_comp_N
will add the sample and population size in each domain respectively. sim_comp_popMean
and sim_comp_popVar
the population mean and variance of the variable y
. The data is expected to have a variable idD
identifying domains.
sim_comp_n(simSetup) sim_comp_N(simSetup) sim_comp_popMean(simSetup) sim_comp_popVar(simSetup)
sim_comp_n(simSetup) sim_comp_N(simSetup) sim_comp_popMean(simSetup) sim_comp_popVar(simSetup)
simSetup |
a |
One of the components which can be added to a sim_setup
. These functions can be used for adding new variables to the data.
sim_comp_pop(simSetup, fun = comp_var(), by = "") sim_comp_sample(simSetup, fun = comp_var(), by = "") sim_comp_agg(simSetup, fun = comp_var(), by = "")
sim_comp_pop(simSetup, fun = comp_var(), by = "") sim_comp_sample(simSetup, fun = comp_var(), by = "") sim_comp_agg(simSetup, fun = comp_var(), by = "")
simSetup |
a |
fun |
a function, see details. |
by |
names of variables as character; identifying groups for which fun is applied. |
Potentially you can define a function for computation yourself. Take care that it only has one argument, named dat
, and returns a data.frame
. Use comp_var
for simple data manipulation. Functions added with sim_comp_pop
are applied before sampling; sim_comp_sample
after sampling. Functions added with sim_comp_agg
after aggregation.
comp_var
, sim_gen
, sim_agg
, sim_sample
, sim_comp_N
, sim_comp_n
, sim_comp_popMean
, sim_comp_popVar
# Standard behavior sim_base() %>% sim_gen_x() %>% sim_comp_N() # Custom data modifications ## Add predicted values of a linear model library(saeSim) comp_lm <- function(dat) { dat$linearPredictor <- predict(lm(y ~ x, data = dat)) dat } sim_base_lm() %>% sim_comp_pop(comp_lm) # or if applied after sampling sim_base_lm() %>% sim_sample() %>% sim_comp_pop(comp_lm)
# Standard behavior sim_base() %>% sim_gen_x() %>% sim_comp_N() # Custom data modifications ## Add predicted values of a linear model library(saeSim) comp_lm <- function(dat) { dat$linearPredictor <- predict(lm(y ~ x, data = dat)) dat } sim_base_lm() %>% sim_comp_pop(comp_lm) # or if applied after sampling sim_base_lm() %>% sim_sample() %>% sim_comp_pop(comp_lm)
One of the components which can be added to a sim_setup
.
sim_gen(simSetup, generator) sim_gen_generic(simSetup, ...)
sim_gen(simSetup, generator) sim_gen_generic(simSetup, ...)
simSetup |
a |
generator |
generator function used to generate random numbers. |
... |
arguments passed to |
Potentially you can define a generator
yourself. Take care that it has one argument, named dat
, and returns a data.frame
. sim_gen_generic
is a shortcut to gen_generic
.
gen_norm
, gen_v_norm
, gen_v_sar
, sim_agg
, , sim_comp_pop
, sim_sample
, sim_gen_x
, sim_gen_e
, sim_gen_v
, sim_gen_vc
, sim_gen_ec
# Data setup for a mixed model sim_base() %>% sim_gen_x() %>% sim_gen_v() %>% sim_gen_e() # Adding contamination in the model error sim_base() %>% sim_gen_x() %>% sim_gen_v() %>% sim_gen_e() %>% sim_gen_ec() # Simple user defined generator: gen_myVar <- function(dat) { dat["myVar"] <- rnorm(nrow(dat)) dat } sim_base() %>% sim_gen_x() %>% sim_gen(gen_myVar) # And a chi-sq(5) distributed 'random-effect': sim_base() %>% sim_gen_generic(rchisq, df = 5, groupVars = "idD", name = "re")
# Data setup for a mixed model sim_base() %>% sim_gen_x() %>% sim_gen_v() %>% sim_gen_e() # Adding contamination in the model error sim_base() %>% sim_gen_x() %>% sim_gen_v() %>% sim_gen_e() %>% sim_gen_ec() # Simple user defined generator: gen_myVar <- function(dat) { dat["myVar"] <- rnorm(nrow(dat)) dat } sim_base() %>% sim_gen_x() %>% sim_gen(gen_myVar) # And a chi-sq(5) distributed 'random-effect': sim_base() %>% sim_gen_generic(rchisq, df = 5, groupVars = "idD", name = "re")
One of the components which can be added to a sim_setup
. It is applied after functions added with sim_gen
.
sim_gen_cont(simSetup, generator, nCont, type, areaVar = NULL, fixed = TRUE)
sim_gen_cont(simSetup, generator, nCont, type, areaVar = NULL, fixed = TRUE)
simSetup |
a |
generator |
generator function used to generate random numbers. |
nCont |
gives the number of contaminated observations. Values between 0
and 1 will be treated as probability. If type is 'unit' and length is
larger than 1, the expected length is the number of areas. If type is
'area' and length is larger than 1 the values are interpreted as area
positions; i.e. |
type |
"unit" or "area" - unit- or area-level contamination. |
areaVar |
character with variable name(s) identifying areas. |
fixed |
TRUE fixes the observations which will be contaminated. FALSE will result in a random selection of observations or areas. |
sim_base_lm() %>% sim_gen_cont(gen_norm(name = "e"), nCont = 0.05, type = "unit", areaVar = "idD") %>% as.data.frame
sim_base_lm() %>% sim_gen_cont(gen_norm(name = "e"), nCont = 0.05, type = "unit", areaVar = "idD") %>% as.data.frame
These are some preconfigured generation components and all wrappers around sim_gen
and sim_gen_cont
.
sim_gen_x(simSetup, mean = 0, sd = 4, name = "x") sim_gen_e(simSetup, mean = 0, sd = 4, name = "e") sim_gen_ec( simSetup, mean = 0, sd = 150, name = "e", nCont = 0.05, type = "unit", areaVar = "idD", fixed = TRUE ) sim_gen_v(simSetup, mean = 0, sd = 1, name = "v") sim_gen_vc( simSetup, mean = 0, sd = 40, name = "v", nCont = 0.05, type = "area", areaVar = "idD", fixed = TRUE )
sim_gen_x(simSetup, mean = 0, sd = 4, name = "x") sim_gen_e(simSetup, mean = 0, sd = 4, name = "e") sim_gen_ec( simSetup, mean = 0, sd = 150, name = "e", nCont = 0.05, type = "unit", areaVar = "idD", fixed = TRUE ) sim_gen_v(simSetup, mean = 0, sd = 1, name = "v") sim_gen_vc( simSetup, mean = 0, sd = 40, name = "v", nCont = 0.05, type = "area", areaVar = "idD", fixed = TRUE )
simSetup |
a |
mean |
the mean passed to the random number generator, for example
|
sd |
the standard deviation passed to the random number generator, for example rnorm. |
name |
name of variable as character in which random numbers are stored. |
nCont |
gives the number of contaminated observations. Values between 0
and 1 will be treated as probability. If type is 'unit' and length is
larger than 1, the expected length is the number of areas. If type is
'area' and length is larger than 1 the values are interpreted as area
positions; i.e. |
type |
"unit" or "area" - unit- or area-level contamination. |
areaVar |
character with variable name(s) identifying areas. |
fixed |
TRUE fixes the observations which will be contaminated. FALSE will result in a random selection of observations or areas. |
x
: fixed-effect component; e
: model-error; ec
: contaminated model error; v
: random-effect (error constant for each domain); vc
contaminated random-effect. Note that for contamination you are expected to add both, a non-contaminated component and a contaminated component.
Functions to read in simulation data from folder. Can be csv or RData files.
sim_read_data(path, ..., returnList = FALSE) sim_clear_data(path, ...) sim_read_list(path) sim_clear_list(path)
sim_read_data(path, ..., returnList = FALSE) sim_clear_data(path, ...) sim_read_list(path) sim_clear_list(path)
path |
path to the files you want to read in. |
... |
arguments passed to |
returnList |
if |
One of the components which can be added to a sim_setup
.
sim_resp(simSetup, respFun) sim_resp_eq(simSetup, ...)
sim_resp(simSetup, respFun) sim_resp_eq(simSetup, ...)
simSetup |
a |
respFun |
a function constructing the response variable |
... |
< The value can be:
|
Potentially you can define an respFun
yourself. Take care that it only has one argument, named dat
, and returns the a data.frame
.
agg_all
, sim_gen
, sim_comp_pop
, sim_sample
, , sim_comp_sample
base_id() %>% sim_gen_x() %>% sim_gen_e() %>% sim_resp_eq(y = 100 + 2 * x + e)
base_id() %>% sim_gen_x() %>% sim_gen_e() %>% sim_resp_eq(y = 100 + 2 * x + e)
One of the components which can be added to a sim_setup
. This component can be used to add a sampling mechanism to the simulation set-up. A sample will be drawn after the population is generated (sim_gen
) and variables on the population are computed (sim_comp_pop
).
sim_sample(simSetup, smplFun = sample_number(size = 5L, groupVars = "idD"))
sim_sample(simSetup, smplFun = sample_number(size = 5L, groupVars = "idD"))
simSetup |
a |
smplFun |
function which controls the sampling process. |
Potentially you can define a smplFun
yourself. Take care that it has one argument, named dat
being the data as data.frame, and returns the sample as data.frame.
sample_number
, sample_fraction
# Simple random sample - 5% sample: sim_base_lm() %>% sim_sample(sample_fraction(0.05)) # Simple random sampling proportional to size - 5% in each domain: sim_base_lm() %>% sim_sample(sample_fraction(0.05, groupVars = "idD")) # User defined sampling function: sample_mySampleFun <- function(dat) { dat[sample.int(nrow(dat), 10), ] } sim_base_lm() %>% sim_sample(sample_mySampleFun)
# Simple random sample - 5% sample: sim_base_lm() %>% sim_sample(sample_fraction(0.05)) # Simple random sampling proportional to size - 5% in each domain: sim_base_lm() %>% sim_sample(sample_fraction(0.05, groupVars = "idD")) # User defined sampling function: sample_mySampleFun <- function(dat) { dat[sample.int(nrow(dat), 10), ] } sim_base_lm() %>% sim_sample(sample_mySampleFun)
Use this function to add a name to a sim_setup
in case you are simulating different scenarios. This name will be added if you use the function sim for simulation
sim_simName(simSetup, name)
sim_simName(simSetup, name)
simSetup |
a |
name |
a character |
sim_base_lm() %>% sim_simName("newName")
sim_base_lm() %>% sim_simName("newName")
Reports a summary of the simulation setup.
## S4 method for signature 'sim_setup' summary(object, ...)
## S4 method for signature 'sim_setup' summary(object, ...)
object |
a |
... |
has no effect. |
summary(sim_base_lm())
summary(sim_base_lm())