Skip to content

estimate

Signature/Parameters

class estimate
def __init__(self, G, formula = None, data = None, model = 'auto', family = 'auto', se_cluster = None, se = None, model_kws = {}, weights = 1, *args, **kws)

Structural Equation Model (SCM) estimation of Graphical Causal Models (GCM).

Parameters:

Name Type Description Default
G DAG

Causal graph describing relationships among observed and latent variables created by causalinf.gcm.DAG.

required
model

Models for the functions of the structural causal model.

  • “auto” : use LSEM (default)
  • “LSEM”: Estimate (parametric) generalized linear structural equation models.
  • “NPSEM-IE-BART: Estimate nonparametric structural equation models with independent errors using BART.
  • “NPSEM-IE-GAM: Estimate nonparametric structural equation models with independent errors using GAM.
'auto'
formula str or None

Formula for the functional form of the SCM. The formula depends on the model used and defined by the argument model. If None it creates a formula automatically based on the model used. Auto-generated formulas do not include interactions between variables in the DAG. If model='auto', the formula generated includes definitions for direct, indirect, and total effect parameters whenever exposure/outcome roles are defined in the DAG object provided to the G argument. To provide custom formulas:

  • If model='LSEM' or model='auto': Use the formula format as used for the sem() function of R’s lavaan. For additional model-specific documentation, see causalinf.models.lsem.
  • If model='NPSEM-IE-BART': TBD For additional model-specific documentation, see causalinf.models.bart.
  • If model='NPSEM-IE-GAM': TBD For additional model-specific documentation, see causalinf.models.gam.
None
data DataFrame - like

Observational dataset containing all variables referenced by the DAG or formula.

None
family str

Outcome distribution family. Defaults to 'auto', in which case the family is infered from the type of outcome in the data.

'auto'
se_cluster str

Name of the variable to cluster the std. errors.

None
se str or None

Options available depend on the model used. See comments in formula to find model-specific documentation and model-specific options for se. See also Notes below.

None
model_kws dict

Additional keyword arguments for the model defined in model. The accepted arguments for each respective model are those accepted by the function described in formula. See also Notes below.

{}
weights str or array - like

Observation weights passed through to the estimator. Defaults to 1 (equal weights).

1
*args

Additional positional arguments forwarded to the underlying estimator.

()
**kws

Additional keyword arguments forwarded to the underlying estimator.

{}
Notes

For documentation of model-specific arguments, see:

  • LSEM: ‘causalinf.models.lsem
  • NPSEM-IE-BART: ‘causalinf.models.bart
  • NPSEM-IE-GAM: ‘causalinf.models.gam

Attributes:

Name Type Description
G DAG

Original DAG used in the estimation.

formula str

SEM specification employed during fitting.

fit object

Raw estimator output (e.g., lavaan fit object) when available.

est dict

Summary bundle returned by the selected estimator, including parameter tables, fit statistics, and option metadata.

Examples:

>>> dag = gcm.DAG("X -> Y")
>>> data = tp.tibble({'X': [0, 1, 0], 'Y': [1.0, 2.5, 1.2]})
>>> est = estimate(G=dag, data=data, silent=True)
>>> est.est['fit']['N_obs']
3
Source code in causalinf/scm.py
class estimate:
    """
    Structural Equation Model (SCM) estimation of Graphical Causal Models (GCM).

    Parameters
    ----------
    G : gcm.DAG
        Causal graph describing relationships among observed and latent
        variables created by ``causalinf.gcm.DAG``.

    model: str
        Models for the functions of the structural causal model.

        * "auto" : use LSEM (default)
        * "LSEM": Estimate (parametric) generalized linear structural equation models. 
        * "NPSEM-IE-BART: Estimate nonparametric structural equation
                    models with independent errors using BART. 
        * "NPSEM-IE-GAM: Estimate nonparametric structural equation
                    models with independent errors using GAM. 

    formula : str or None, optional
        Formula for the functional form of the SCM. The formula depends on the
        model used and defined by the argument ``model``.
        If ``None`` it creates a formula automatically based on the
        ``model`` used. Auto-generated formulas do not include interactions
        between variables in the DAG. 
        If ``model='auto'``, 
        the formula generated includes definitions for direct, indirect,
        and total effect parameters whenever exposure/outcome roles are defined
        in the DAG object provided to the ``G`` argument. To provide custom formulas:

        * If ``model='LSEM'`` or ``model='auto'``: Use the formula format as used for the ``sem()`` function of R's lavaan.
          For additional model-specific documentation, see ``causalinf.models.lsem``.
        * If ``model='NPSEM-IE-BART'``:  TBD
          For additional model-specific documentation, see ``causalinf.models.bart``.
        * If ``model='NPSEM-IE-GAM'``:  TBD
          For additional model-specific documentation, see ``causalinf.models.gam``.

    data : DataFrame-like
        Observational dataset containing all variables referenced by the DAG or
        formula. 

    family : str, optional
        Outcome distribution family. Defaults to ``'auto'``, in which
        case the family is infered from the type of outcome in the data.
    se_cluster : str
        Name of the variable to cluster the std. errors. 
    se : str or None
        Options available depend on the ``model`` used. See comments in
        ``formula`` to find model-specific documentation and model-specific
        options for ``se``. See also ``Notes`` below.
    model_kws : dict, optional
        Additional keyword arguments for the model defined in ``model``.
        The accepted arguments for each respective ``model`` are those accepted
        by the function described in ``formula``. See also ``Notes`` below.
    weights : str or array-like, optional
        Observation weights passed through to the estimator. Defaults to
        ``1`` (equal weights).
    *args :
        Additional positional arguments forwarded to the underlying estimator.
    **kws :
        Additional keyword arguments forwarded to the underlying estimator.


    Notes
    -----
    For documentation of model-specific arguments, see:

    * LSEM: '``causalinf.models.lsem``'
    * NPSEM-IE-BART: '``causalinf.models.bart``'
    * NPSEM-IE-GAM: '``causalinf.models.gam``'

    Attributes
    ----------
    G : gcm.DAG
        Original DAG used in the estimation.
    formula : str
        SEM specification employed during fitting.
    fit : object
        Raw estimator output (e.g., lavaan fit object) when available.
    est : dict
        Summary bundle returned by the selected estimator, including parameter
        tables, fit statistics, and option metadata.

    Examples
    --------
    >>> dag = gcm.DAG("X -> Y")
    >>> data = tp.tibble({'X': [0, 1, 0], 'Y': [1.0, 2.5, 1.2]})
    >>> est = estimate(G=dag, data=data, silent=True)
    >>> est.est['fit']['N_obs']
    3
    """
    def __init__(self,
                 G,
                 formula=None,
                 data=None,
                 model='auto',
                 family = 'auto',
                 se_cluster=None,
                 se=None,
                 # 
                 model_kws={},
                 # 
                 weights=1,
                 *args,
                 **kws
                 ):
        assert data is not None, 'Data must be provided.'
        data = ut.data2tibble(data)
        self.model = "LSEM" if model=='auto' else model
        self.formula = formula or self._graph2sem(G)
        self.family = family
        # 
        self.G = G
        self.outcome = G.outcome[0]
        self.exposure =G.exposure
        #
        self.se_cluster = se_cluster
        self.se = se

        if self.model in 'LSEM':
            self._lsem(G, data=data, weights=weights, *args, **kws)

    @ut.copy_docstring(lsem)
    def _lsem(self, G, data, weights, *args, **kws):
        est =  lsem(formula=self.formula,
                    data=data,
                    weights=weights,
                    se=self.se,
                    se_cluster=self.se_cluster,
                    *args, **kws)
        self.fit = est.fit
        self.est = est.est

    @property
    def causalinf(self):
        print(self)

    @ut.copy_docstring(ut.summary)
    def summary(self, *args, **kws):
        formula = ("\n" + self.formula).replace('\n', '\n        ')
        kws |= {'formula': kws.get("formula", formula)}

        latex_replace = {"~~": "\\\\leftrightarrow ", "~" : "\\\\leftarrow "}
        kws |= {'latex_replace': kws.get("latex_replace", latex_replace)}

        res = ut.summary(
            model = self,
            id_strategy='SCM',
            *args, **kws
        )
        return res.res

    def get_fit_statsitics(self, stats):
        res = self.est['fit_stats'].get(stats, None)
        if res is None:
            print(f'Statistics {stats} not available.')
        return res

    @ut.copy_docstring(gcm.DAG.plot)
    def plot(self, *args, **kws):
        self.G.plot(estimates=self, *args, **kws)

    def _graph2sem(self, G, parameter_fmt="(beta_{cause}.{effect})"):
        # regression lines
        reg = ''
        for y, pa_y in G.nodes_parents.items():
            pa_y = " + ".join([f"(beta_{v}.{y})*{v}" for v in pa_y])
            reg += f"{y} ~ (beta_0{y})*1 + {pa_y}\n"
        reg = f"# LSEM:\n{reg}"

        # correlations
        corr = "\n".join([f"{n1[0]} ~~ {n2[0]}" for n1, n2 in G.bidirected])
        corr += "\n".join([f"{n1} ~~ {n2}" for n1, n2 in G.undirected])
        if corr:
            corr = f"# Correlations:\n{corr}"

        # indirect effects
        if G.exposure and G.outcome:
            ind = self._graph2sem_indirect_and_total_effects(G, parameter_fmt)
        else:
            ind = ''

        sem = f"{reg}{corr}{ind}"
        return sem

    def _graph2sem_indirect_and_total_effects(self, G, parameter_fmt):
        # """
        # edges    : list of (u, v) directed edges
        # exposure, outcome : compute indirect paths from exposure to outcome
        # parameter_fmt: how to name each coefficient (e.g., 'beta_{u}.{v}' or '(beta_{u}.{v})')
        # returns  : list of (path_nodes, effect_str)
        # """
        # build adjacency
        edges = G.directed
        exposure = G.exposure[0]
        outcome = G.outcome[0]

        adj = defaultdict(list)
        for u, v in edges:
            adj[u].append(v)

        dir_effect = []
        tot_effect = []
        ind_effects = []

        # direct effect
        if outcome in adj[exposure]:
            dir_effect = parameter_fmt.format(cause=exposure, effect=outcome)

        # enumerate simple paths (DFS) recursive function
        def dfs(node, visited, path):
            if node == outcome and len(path) >= 3:  # indirect: at least 2 edges
                # build product beta terms for edges along the path
                terms = [parameter_fmt.format(cause=path[i], effect=path[i+1]) for i in range(len(path)-1)]
                ind_effects.append( (path[:], " * ".join(terms)) )
                return
            for nxt in adj[node]:
                if nxt in visited:       # avoid cycles
                    continue
                visited.add(nxt)
                path.append(nxt)
                dfs(nxt, visited, path)
                path.pop()
                visited.remove(nxt)
        # this wil fill in the info and save it in the variable ind_effects
        dfs(exposure, {exposure}, [exposure])

        res = ''
        ind_effect_str = ''
        if ind_effects:
            for i, ind in enumerate(ind_effects):
                parameter = f"Indirect_effect_{i+1}" # f"beta_{'.'.join(ind[0])}"
                ind_effect_str += f"{parameter} := {ind[1]}\n"
                tot_effect += [parameter]
            res += f"# Indirect effects:\n{ind_effect_str}"
        if dir_effect:
            parameter = "Direct_effect"
            res += f"# Direct effect:\n{parameter} := {dir_effect}"
            tot_effect += [parameter]
        if tot_effect:
            res += f"\n# Total effect:\nTotal_effect := {' + '.join(tot_effect)}"

        return res

    def __repr__(self):
        self.summary(style="full")
        return ''

Examples

Estimate LSEM

Take this GCM example:

1
2
3
4
5
6
from causalinf import gcm
from causalinf import scm
from causalinf import simulate

G = gcm.examples('Two confounders')
G.plot()

Simulate a data from a LSEM based on that DAG:

1
2
3
sim = simulate.lsem(G, seed=10)
df = sim.data
df.head().print()
shape: (5, 4)
┌───────────────────────────────┐
    Z1      Z2       D       Y 
   f64     f64     f64     f64 
╞═══════════════════════════════╡
  1.33    1.51   -2.58    0.50 
  0.72   -0.14   -0.42   -1.39 
 -1.55   -0.14    1.08   -0.59 
 -0.01    1.18   -0.90   -0.47 
  0.62    1.18    0.22    0.07 
└───────────────────────────────┘

Estimat the model:

res = scm.estimate(G, data=df)
print(res)
Estimating LSEM...done!

================================================================================
Model: Model 1
Identification: SCM
Outcome: Y
Exposure: D
Formula: 
# LSEM:
D ~ (beta_0D)*1 + (beta_Z1.D)*Z1 + (beta_Z2.D)*Z2
Y ~ (beta_0Y)*1 + (beta_D.Y)*D + (beta_Z1.Y)*Z1 + (beta_Z2.Y)*Z2
# Direct effect:
Direct_effect := (beta_D.Y)
# Total effect:
Total_effect := Direct_effect
Summary:
--------        
 term                  label          estimate    sig  se      lo       hi       statistic  pvalue 
 D ~ 1                 beta_0D        0.3129      ***  0.0312  0.2518   0.374    10.0358    0.0    
 D ~ Z1                beta_Z1.D      -0.8867     ***  0.0333  -0.952   -0.8214  -26.6224   0.0    
 D ~ Z2                beta_Z2.D      -0.9231     ***  0.0314  -0.9847  -0.8616  -29.3942   0.0    
 Y ~ 1                 beta_0Y        -0.2967     ***  0.0333  -0.3619  -0.2315  -8.9172    0.0    
 Y ~ D                 beta_D.Y       0.0179           0.0322  -0.0451  0.081    0.5571     0.5775 
 Y ~ Z1                beta_Z1.Y      -0.0581          0.0443  -0.1449  0.0287   -1.312     0.1895 
 Y ~ Z2                beta_Z2.Y      0.1846      ***  0.0436  0.0991   0.2701   4.2314     0.0    
 D ~~ D                               0.9715      ***  0.0434  0.8863   1.0566   22.3607    0.0    
 Y ~~ Y                               1.0054      ***  0.045   0.9173   1.0935   22.3607    0.0    
 Z1 ~~ Z1                             0.8798           0.0     0.8798   0.8798   --         --     
 Z1 ~~ Z2                             0.0632           0.0     0.0632   0.0632   --         --     
 Z2 ~~ Z2                             0.9896           0.0     0.9896   0.9896   --         --     
 Z1 ~ 1                               -0.0146          0.0     -0.0146  -0.0146  --         --     
 Z2 ~ 1                               -0.0111          0.0     -0.0111  -0.0111  --         --     
 Direct_effect := (be  Direct_effect  0.0179           0.0322  -0.0451  0.081    0.5571     0.5775 
 Total_effect := Dire  Total_effect   0.0179           0.0322  -0.0451  0.081    0.5571     0.5775 
 Model                 --             (footnote)  --   --      --       --       --         --     
 Outcome type          --             (footnote)  --   --      --       --       --         --     
 Estimator             --             ML          --   --      --       --       --         --     
 Std.Error             --             classic     --   --      --       --       --         --     
 N.obs                 --             1000        --   --      --       --       --         --     
 RMSE                  --             0.0         --   --      --       --       --         --     
 AIC                   --             5670.18     --   --      --       --       --         --     
 BIC                   --             5714.35     --   --      --       --       --         --     
 DF (model)            --             0           --   --      --       --       --         --     
================================================================================
*** p<0.001; ** p<0.01; * p<0.05; + p<0.1
Model 1: Endogenous variable types: Continuous (D, Y); Models: Linear (D, Y)