Simulate Data

The submodule simulate creates synthetic data from causal models. This is useful for examples, tutorials, simulation studies, checking estimators, and building reproducible workflows before using real data.

The basic structure to simulate data is simulate.<model name>. For instance, simulate.lsem() simulates data from a Linear Structural Equation Model (LSEM) implied by a Graphical Causal Model (GCM). In this case, the simulator uses the graph to determine the order of the variables and which variables are parents of each node. For each endogenous variable, it generates a linear equation using randomly drawn coefficients and Gaussian noise.

The object returned by simulate.<model> stores both the simulated data and the true parameters used to generate them. This is useful because the true data-generating process is known.

Basic example

Start with a GCM. Here, we use one of the built-in examples:

from causalinf import simulate
from causalinf import gcm

G = gcm.examples(which='Two confounders')
G.plot()

To simulate data from a linear structural equation model based on that graph, use:

sim = simulate.lsem(G, seed=1)
print(sim)

Linear Structural Equation Model (LSEM): 
D = (0.4071) + (0.7584)*Z2 + (-0.6624)*Z1
Y = (-0.2731) + (-0.3515)*Z1 + (-0.2122)*Z2 + (-0.8772)*D
shape: (1_000, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ Z1        ┆ Z2        ┆ D         ┆ Y         │
│ ---       ┆ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 1.624345  ┆ -0.153236 ┆ -0.495579 ┆ 0.528705  │
│ -0.611756 ┆ -2.432509 ┆ -2.621482 ┆ 2.393703  │
│ -0.528172 ┆ 0.507984  ┆ 2.445209  ┆ -1.126385 │
│ -1.072969 ┆ -0.324032 ┆ -0.193344 ┆ -0.736317 │
│ 0.865408  ┆ -1.511077 ┆ -1.235186 ┆ -0.484971 │
│ …         ┆ …         ┆ …         ┆ …         │
│ -0.116444 ┆ 0.188583  ┆ 0.911065  ┆ -1.6025   │
│ -2.277298 ┆ 0.560918  ┆ 3.063715  ┆ -0.866042 │
│ -0.069625 ┆ -0.921659 ┆ 0.040004  ┆ 2.018582  │
│ 0.35387   ┆ 0.647375  ┆ 0.288054  ┆ 0.076519  │
│ -0.186955 ┆ 1.386826  ┆ 2.995921  ┆ -3.579248 │
└───────────┴───────────┴───────────┴───────────┘

The argument seed makes the simulation reproducible. If seed is omitted, a new random simulation is drawn each time.

Simulated data

The simulated data are stored in sim.data. The object is a dataframe from tidypolars4sci.

print(sim.data.head())

shape: (5, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ Z1        ┆ Z2        ┆ D         ┆ Y         │
│ ---       ┆ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 1.624345  ┆ -0.153236 ┆ -0.495579 ┆ 0.528705  │
│ -0.611756 ┆ -2.432509 ┆ -2.621482 ┆ 2.393703  │
│ -0.528172 ┆ 0.507984  ┆ 2.445209  ┆ -1.126385 │
│ -1.072969 ┆ -0.324032 ┆ -0.193344 ┆ -0.736317 │
│ 0.865408  ┆ -1.511077 ┆ -1.235186 ┆ -0.484971 │
└───────────┴───────────┴───────────┴───────────┘

The number of observations is controlled by the argument n.

sim_small = simulate.lsem(G, n=5, seed=1)
print(sim_small.data)

shape: (5, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ Z1        ┆ Z2        ┆ D         ┆ Y         │
│ ---       ┆ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 1.624345  ┆ -2.301539 ┆ 1.453514  ┆ 1.564702  │
│ -0.611756 ┆ 1.744812  ┆ -1.841985 ┆ -0.359266 │
│ -0.528172 ┆ -0.761207 ┆ 0.938394  ┆ -1.402515 │
│ -1.072969 ┆ 0.319039  ┆ -1.399277 ┆ 1.431886  │
│ 0.865408  ┆ -0.24937  ┆ 1.095344  ┆ -0.449383 │
└───────────┴───────────┴───────────┴───────────┘

True parameters

The simulator stores the true coefficients in two formats:

sim.parameters, a nested dictionary keyed by endogenous variable and parent variable.
sim.parameters_tidy, a tidy dataframe with one row per parameter.

print(sim.parameters)

{'Z1': {}, 'Z2': {}, 'D': {'1': np.float64(0.40709546440285704), 'Z2': 0.758397121073191, 'Z1': -0.6623996198448128}, 'Y': {'1': np.float64(-0.2731382139476062), 'Z1': -0.3514895517458223, 'Z2': -0.21224372445969886, 'D': -0.8771756918836713}}

print(sim.parameters_tidy)

shape: (7, 5)
┌─────┬─────┬─────┬────────┬───────────┐
│ lhs ┆ op  ┆ rhs ┆ term   ┆ true      │
│ --- ┆ --- ┆ --- ┆ ---    ┆ ---       │
│ str ┆ str ┆ str ┆ str    ┆ f64       │
╞═════╪═════╪═════╪════════╪═══════════╡
│ D   ┆ ~   ┆ 1   ┆ D ~ 1  ┆ 0.407095  │
│ D   ┆ ~   ┆ Z2  ┆ D ~ Z2 ┆ 0.758397  │
│ D   ┆ ~   ┆ Z1  ┆ D ~ Z1 ┆ -0.6624   │
│ Y   ┆ ~   ┆ 1   ┆ Y ~ 1  ┆ -0.273138 │
│ Y   ┆ ~   ┆ Z1  ┆ Y ~ Z1 ┆ -0.35149  │
│ Y   ┆ ~   ┆ Z2  ┆ Y ~ Z2 ┆ -0.212244 │
│ Y   ┆ ~   ┆ D   ┆ Y ~ D  ┆ -0.877176 │
└─────┴─────┴─────┴────────┴───────────┘

The true parameters can be compared with estimates from the corresponding causal model. See Estimation and Summary and Reporting for examples of estimation and reporting workflows.

Noise

The argument noise controls the standard deviation of the Gaussian disturbance terms. A single number sets the same noise level for all variables.

sim_low_noise = simulate.lsem(G, seed=1, noise=0.25)
print(sim_low_noise.data.head())

shape: (5, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ Z1        ┆ Z2        ┆ D         ┆ Y         │
│ ---       ┆ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 0.406086  ┆ -0.038309 ┆ 0.181427  ┆ -0.340498 │
│ -0.152939 ┆ -0.608127 ┆ -0.350049 ┆ 0.125751  │
│ -0.132043 ┆ 0.126996  ┆ 0.916624  ┆ -0.754271 │
│ -0.268242 ┆ -0.081008 ┆ 0.256985  ┆ -0.656754 │
│ 0.216352  ┆ -0.377769 ┆ -0.003475 ┆ -0.593917 │
└───────────┴───────────┴───────────┴───────────┘

Use a dictionary to set variable-specific noise levels. Variables not included in the dictionary use the default value 1.0.

sim_custom_noise = simulate.lsem(
    G,
    seed=1,
    noise={"D": 0.25, "Y": 2.0},
)
print(sim_custom_noise.data.head())

shape: (5, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ Z1        ┆ Z2        ┆ D         ┆ Y         │
│ ---       ┆ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 1.624345  ┆ -0.153236 ┆ -0.712708 ┆ 1.624715  │
│ -0.611756 ┆ -2.432509 ┆ -1.429734 ┆ 0.984361  │
│ -0.528172 ┆ 0.507984  ┆ 1.46796   ┆ 0.944635  │
│ -1.072969 ┆ -0.324032 ┆ 0.605727  ┆ -2.51593  │
│ 0.865408  ┆ -1.511077 ┆ -1.292906 ┆ -1.746182 │
└───────────┴───────────┴───────────┴───────────┘

Binary, ordinal, and categorical variables

By default, all variables are simulated as continuous. The arguments binary, ordinal, and categorical can be used to discretize selected variables.

Use binary for Bernoulli variables. The argument can be a string or a list of variable names.

sim_binary = simulate.lsem(G, seed=1, binary="Y")

print(sim_binary.binary)
print(sim_binary.data.head())

('Y',)
shape: (5, 4)
┌───────────┬───────────┬───────────┬─────┐
│ Z1        ┆ Z2        ┆ D         ┆ Y   │
│ ---       ┆ ---       ┆ ---       ┆ --- │
│ f64       ┆ f64       ┆ f64       ┆ i64 │
╞═══════════╪═══════════╪═══════════╪═════╡
│ 1.624345  ┆ -0.153236 ┆ -0.495579 ┆ 1   │
│ -0.611756 ┆ -2.432509 ┆ -2.621482 ┆ 1   │
│ -0.528172 ┆ 0.507984  ┆ 2.445209  ┆ 0   │
│ -1.072969 ┆ -0.324032 ┆ -0.193344 ┆ 0   │
│ 0.865408  ┆ -1.511077 ┆ -1.235186 ┆ 0   │
└───────────┴───────────┴───────────┴─────┘

Use ordinal to create ordered integer variables. The dictionary values set the lower and upper bounds.

sim_ordinal = simulate.lsem(G, seed=1, ordinal={"Y": [1, 5]})

print(sim_ordinal.ordinal)
print(sim_ordinal.data.head())

{'Y': (1, 5)}
shape: (5, 4)
┌───────────┬───────────┬───────────┬─────┐
│ Z1        ┆ Z2        ┆ D         ┆ Y   │
│ ---       ┆ ---       ┆ ---       ┆ --- │
│ f64       ┆ f64       ┆ f64       ┆ i64 │
╞═══════════╪═══════════╪═══════════╪═════╡
│ 1.624345  ┆ -0.153236 ┆ -0.495579 ┆ 4   │
│ -0.611756 ┆ -2.432509 ┆ -2.621482 ┆ 5   │
│ -0.528172 ┆ 0.507984  ┆ 2.445209  ┆ 2   │
│ -1.072969 ┆ -0.324032 ┆ -0.193344 ┆ 2   │
│ 0.865408  ┆ -1.511077 ┆ -1.235186 ┆ 2   │
└───────────┴───────────┴───────────┴─────┘

Use categorical to create variables with named categories.

sim_categorical = simulate.lsem(
    G,
    seed=1,
    categorical={"Y": ["low", "medium", "high"]},
)

print(sim_categorical.categorical)
print(sim_categorical.data.head())

{'Y': ('low', 'medium', 'high')}
shape: (5, 4)
┌───────────┬───────────┬───────────┬────────┐
│ Z1        ┆ Z2        ┆ D         ┆ Y      │
│ ---       ┆ ---       ┆ ---       ┆ ---    │
│ f64       ┆ f64       ┆ f64       ┆ str    │
╞═══════════╪═══════════╪═══════════╪════════╡
│ 1.624345  ┆ -0.153236 ┆ -0.495579 ┆ medium │
│ -0.611756 ┆ -2.432509 ┆ -2.621482 ┆ high   │
│ -0.528172 ┆ 0.507984  ┆ 2.445209  ┆ low    │
│ -1.072969 ┆ -0.324032 ┆ -0.193344 ┆ low    │
│ 0.865408  ┆ -1.511077 ┆ -1.235186 ┆ medium │
└───────────┴───────────┴───────────┴────────┘

A variable can only use one of these types. For example, the same variable cannot be both binary and ordinal in the same simulation.

Use simulated data for estimation

Simulated data can be passed directly to an estimator. For example, the data generated from a GCM can be used to estimate a Structural Causal Model (SCM):

from causalinf import scm

G_scm = gcm.examples("Two confounders")
sim_scm = simulate.lsem(G_scm, seed=1)
mod = scm.estimate(G_scm, data=sim_scm.data)

Estimating LSEM...done!

The estimated parameters can then be compared with the true simulation parameters.

print(sim_scm.parameters_tidy)
mod.est.parameters.print()

shape: (7, 5)
┌─────┬─────┬─────┬────────┬───────────┐
│ lhs ┆ op  ┆ rhs ┆ term   ┆ true      │
│ --- ┆ --- ┆ --- ┆ ---    ┆ ---       │
│ str ┆ str ┆ str ┆ str    ┆ f64       │
╞═════╪═════╪═════╪════════╪═══════════╡
│ D   ┆ ~   ┆ 1   ┆ D ~ 1  ┆ 0.407095  │
│ D   ┆ ~   ┆ Z2  ┆ D ~ Z2 ┆ 0.758397  │
│ D   ┆ ~   ┆ Z1  ┆ D ~ Z1 ┆ -0.6624   │
│ Y   ┆ ~   ┆ 1   ┆ Y ~ 1  ┆ -0.273138 │
│ Y   ┆ ~   ┆ Z1  ┆ Y ~ Z1 ┆ -0.35149  │
│ Y   ┆ ~   ┆ Z2  ┆ Y ~ Z2 ┆ -0.212244 │
│ Y   ┆ ~   ┆ D   ┆ Y ~ D  ┆ -0.877176 │
└─────┴─────┴─────┴────────┴───────────┘
shape: (16, 9)
┌──────────────┬─────────────┬──────────┬─────┬──────┬───────┬───────┬───────────┬────────┐
│ term         ┆ label       ┆ estimate ┆ sig ┆ se   ┆ lo    ┆ hi    ┆ statistic ┆ pvalue │
│ ---          ┆ ---         ┆ ---      ┆ --- ┆ ---  ┆ ---   ┆ ---   ┆ ---       ┆ ---    │
│ str          ┆ str         ┆ f64      ┆ str ┆ f64  ┆ f64   ┆ f64   ┆ f64       ┆ f64    │
╞══════════════╪═════════════╪══════════╪═════╪══════╪═══════╪═══════╪═══════════╪════════╡
│ Y ~ 1        ┆ beta_0Y     ┆ -0.27    ┆ *** ┆ 0.03 ┆ -0.34 ┆ -0.20 ┆ -7.84     ┆ 0.00   │
│ Y ~ Z1       ┆ beta_Z1.Y   ┆ -0.38    ┆ *** ┆ 0.04 ┆ -0.46 ┆ -0.30 ┆ -9.69     ┆ 0.00   │
│ Y ~ D        ┆ beta_D.Y    ┆ -0.89    ┆ *** ┆ 0.03 ┆ -0.95 ┆ -0.83 ┆ -27.75    ┆ 0.00   │
│ Y ~ Z2       ┆ beta_Z2.Y   ┆ -0.20    ┆ *** ┆ 0.04 ┆ -0.28 ┆ -0.13 ┆ -5.22     ┆ 0.00   │
│ D ~ 1        ┆ beta_0D     ┆ 0.42     ┆ *** ┆ 0.03 ┆ 0.35  ┆ 0.48  ┆ 13.27     ┆ 0.00   │
│ D ~ Z1       ┆ beta_Z1.D   ┆ -0.70    ┆ *** ┆ 0.03 ┆ -0.76 ┆ -0.63 ┆ -21.84    ┆ 0.00   │
│ D ~ Z2       ┆ beta_Z2.D   ┆ 0.74     ┆ *** ┆ 0.03 ┆ 0.68  ┆ 0.80  ┆ 24.32     ┆ 0.00   │
│ Y ~~ Y       ┆             ┆ 1.00     ┆ *** ┆ 0.04 ┆ 0.91  ┆ 1.09  ┆ 22.36     ┆ 0.00   │
│ D ~~ D       ┆             ┆ 0.98     ┆ *** ┆ 0.04 ┆ 0.89  ┆ 1.06  ┆ 22.36     ┆ 0.00   │
│ Z1 ~~ Z1     ┆             ┆ 0.96     ┆     ┆ 0.00 ┆ 0.96  ┆ 0.96  ┆ null      ┆ null   │
│ Z1 ~~ Z2     ┆             ┆ 0.02     ┆     ┆ 0.00 ┆ 0.02  ┆ 0.02  ┆ null      ┆ null   │
│ Z2 ~~ Z2     ┆             ┆ 1.06     ┆     ┆ 0.00 ┆ 1.06  ┆ 1.06  ┆ null      ┆ null   │
│ Z1 ~ 1       ┆             ┆ 0.04     ┆     ┆ 0.00 ┆ 0.04  ┆ 0.04  ┆ null      ┆ null   │
│ Z2 ~ 1       ┆             ┆ 0.03     ┆     ┆ 0.00 ┆ 0.03  ┆ 0.03  ┆ null      ┆ null   │
│ Direct_effec ┆ Direct_effe ┆ -0.89    ┆ *** ┆ 0.03 ┆ -0.95 ┆ -0.83 ┆ -27.75    ┆ 0.00   │
│ t :=         ┆ ct          ┆          ┆     ┆      ┆       ┆       ┆           ┆        │
│ (beta_D.Y)   ┆             ┆          ┆     ┆      ┆       ┆       ┆           ┆        │
│ Total_effect ┆ Total_effec ┆ -0.89    ┆ *** ┆ 0.03 ┆ -0.95 ┆ -0.83 ┆ -27.75    ┆ 0.00   │
│ := Direct_ef ┆ t           ┆          ┆     ┆      ┆       ┆       ┆           ┆        │
│ fect         ┆             ┆          ┆     ┆      ┆       ┆       ┆           ┆        │
└──────────────┴─────────────┴──────────┴─────┴──────┴───────┴───────┴───────────┴────────┘

Practical checklist

When using simulated data, record:

The graph used to define the data-generating process.
The number of observations n.
The random seed.
The noise level.
Which variables, if any, were simulated as binary, ordinal, or categorical.
The true parameters stored in sim.parameters or sim.parameters_tidy.

These details make the simulation reproducible and make it easier to compare estimators against the known data-generating process.