Challenges Raised by Mediation Analysis in a High-Dimension Setting.

Background: Mediation analysis is used in epidemiology to identify pathways through which exposures influence health. The advent of high-throughput (omics) technologies gives opportunities to perform mediation analysis with a high-dimension pool of covariates. Objective: We aimed to highlight some biostatistical issues of this expanding field of high-dimension mediation. Discussion: The mediation techniques used for a single mediator cannot be generalized in a straightforward manner to high-dimension mediation. Causal knowledge on the relation between covariates is required for mediation analysis, and it is expected to be more limited as dimension and system complexity increase. The methods developed in high dimension can be distinguished according to whether mediators are considered separately or as a whole. Methods considering each potential mediator separately do not allow efficient identification of the indirect effects when mutual influences exist among the mediators, which is expected for many biological (e.g., epigenetic) parameters. In this context, methods considering all potential mediators simultaneously, based, for example, on data reduction techniques, are more adapted to the causal inference framework. Their cost is a possible lack of ability to single out the causal mediators. Moreover, the ability of the mediators to predict the outcome can be overestimated, in particular because many machine-learning algorithms are optimized to increase predictive ability rather than their aptitude to make causal inference. Given the lack of overarching validated framework and the generally complex causal structure of high-dimension data, analysis of high-dimension mediation currently requires great caution and effort to incorporate a priori biological knowledge. https://doi.org/10.1289/EHP6240


Introduction
Mediation analysis is used to help in deciphering mechanisms that relate causes to their consequences. It has been used in many areas of research, including, for example, social psychology, to understand which factors can bridge the gap between intentions and behaviors; cognitive psychology, to analyze how information is transformed into a response; in intervention research, to assess whether an intervention on a specific factor can trigger a positive outcome; or in epidemiology, to quantify to what extent the total effect of a given (environmental or genetic) factor on a health or biological outcome is explained by a so-called indirect effect, through intermediate (e.g., biological) variables on the pathway between exposure and outcome (MacKinnon et al. 2007;VanderWeele 2015VanderWeele , 2016. In environmental epidemiology, for example, it could help in quantifying to what extent air pollution affects respiratory health through oxidative pathways (Romieu et al. 2004).
With the advent of high-throughput screening technologies, there are settings in which one aims to perform mediation analysis with high-dimension data, that is, a data set in which the number of potential mediators p is larger than the number of observations n. For example, in environmental epigenetics, there is increasing evidence that specific environmental exposures such as atmospheric pollutants exposure could influence DNA methylation (Abraham et al. 2018;Gruzieva et al. 2017), which is currently typically assessed using chips measuring methylation in 10 5 -10 6 cytosine-phosphate-guanine (CpG) dinucleotide sites on the genome. Given the role of DNA methylation in gene expression and, consequently, health, methylation at multiple CpG sites could contribute to (high dimension) mediation of the effects of atmospheric pollutants on health. High-dimension mediation also may be relevant to analyses of other types of omics data, such as genomic, transcriptomic, metabolomic, and microbiota data.
From a statistical viewpoint, such analyses considering a highdimension set of potential mediators raise challenges; in particular, approaches used in the case of a single mediator cannot be extended to higher dimensions in a straightforward way. After briefly reviewing the classical case of mediation in a low-dimension setting, we will discuss the issues of mediation analysis with a high-dimension set of covariates and of quantification of the mediated effects, mentioning some of the existing statistical tools, in particular, with regard to applications in environmental epidemiology.

Discussion
Mediation analysis was developed as path analysis in the genetic field, and later in the area of social sciences, and then further formalized in biomedical research in connection with regression modeling in the counterfactual outcome framework of causal inference. For a detailed presentation in the context of causal inference, the reader can refer to Chapters 2 and 5 of VanderWeele (2015), or other sources (Imai et al. 2010;VanderWeele 2016).
If one assumes that a part of the effect of exposure E on outcome Y is mediated by mediator M (Figure 1), then the proportion of the association between E and Y that occurs through M is termed the indirect (or mediated) effect of E. The fraction of the effect of E that occurs independently of M (represented by the direct arrow from E to Y in Figure 1) is called the direct effect. The addition of the direct and indirect effects is termed the total effect.

Mediation with a Single Mediator
Assuming that both Y and M are continuous variables, with further assumptions on the distribution of the error term, one can write several linear regression models: One model relating the exposure E to the outcome Y taking M into account is as follows: Where E is the mathematical expectation, E is the exposure variable, C is a matrix including all potential confounders of the exposure-outcome, exposure-mediator, and mediator-outcome associations (i.e., C 1 , C 2 , and C 3 in Figure 1), EM is the exposure-mediator interaction term, and h 0 , . . . , h 4 are the estimated parameters.
One regression model relating the exposure to the mediator is as follows: where C 0 represents the confounders for the exposure-mediator association. In this linear model setting, one can define the controlled direct effect, corresponding to the average change in the outcome for an increase by one in exposure, assuming that the mediator remains fixed at a given value identical in all subjects. In contrast, the natural direct effect corresponds to the average change in the outcome for an increase by one in exposure, assuming that the mediator level does not vary and is set in each subject at the value it would have in the absence of exposure. In the simpler case of lack of interaction between the exposure and the mediator (h 3 = 0), the controlled direct effect and the natural direct effect are identical. The natural indirect effect is defined as the average change in outcome, under a given fixed exposure, when the mediator level changes to the level it would have attained if the exposure had increased by one.
The estimation of direct and indirect effects requires several assumptions: a) a lack of exposure-outcome confounding (i.e., in the example of Figure 1, efficient adjustment for C 1 ) and b) a lack of mediator-outcome confounding (efficient control for C 3 ). These are the only conditions required for the estimation of the controlled direct effect.
Two additional assumptions, required for the identification of natural direct and indirect effects, are c) a lack of uncontrolled exposure-mediator confounding and d) the lack of mediator-outcome confounder affected by the exposure. Under these hypotheses, and that of the absence of exposure-mediator interaction (h 3 = 0), h 1 provides an estimate of the direct effect of E on Y, whereas the product h 2 b 1 is an estimate of the indirect effect through M. This product is used in the Sobel mediation test, whose null hypothesis is H 0 :h 2 b 1 = 0.
This corresponds to a composite null hypothesis (Baron and Kenny 1986;MacKinnon et al. 2002), implying both h 2 and b 1 . It can be handled either by testing for the product h 2 b 1 being null, corresponding to a product significance test, or by testing separately for both h 2 and b 1 being null, which is termed a joint significance test.

Mediation with Multiple Mediators
VanderWeele and Vansteelandt (2014) and VanderWeele (2015) have discussed the expansion of mediation analysis to the case of several mediators M 1 , . . ., M p , where p is much lower than n. One option is to consider each mediator independently and to fit one exposure-outcome model (Equation 1) and one exposuremediator model (Equation 2) for each mediator M 1 , . . ., M p . This approach works fine as long as there is no mediator M i influencing another mediator M j and no mediator-mediator interaction. If an influence of one mediator over another exists, the abovementioned assumption of the lack of confounding of the mediatoroutcome association by a factor influenced by exposure no longer holds. Indeed, in the situation of Figure 2 in which mediator M 1 influences M 2 , if one considers solely mediator M 2 , then M 1 acts as a confounder of the relation between mediator M 2 and outcome Y influenced by E. Several approaches have been proposed in a low-dimension setting to estimate direct effects in this case, such as marginal structural modeling and structural mean models (VanderWeele 2015).
An alternative to the mediator-by-mediator approach is to treat all mediators as a whole (VanderWeele and Vansteelandt 2014). In this case, any possible influence between mediators (such as that of M 1 on M 2 ), including interactions, can be Figure 1. Example of the effect of a single exposure E whose effect on outcome Y is mediated by a single mediator M. Exposure-outcome (C 1 ), exposure-mediator (C 2 ) and mediator-outcome confounders (C 3 ) need to be controlled for. Adapted from VanderWeele (2015). ignored, as long as one is not interested in estimating pathspecific effects of sequential mediators. In practice, for a continuous outcome, one can fit an outcome model without mediators: and the following model for p mediators M 1 , . . . , M p : thus allowing us to estimate the indirect effect mediated by all mediators M 1 , . . . , M p as a whole by the difference h 0 1 -h 1 . Causal interpretation of this quantity depends on correct model specification and lack of unmeasured confounding and of interaction, as in the singlemediator case, so that C should include all exposure-outcome, mediator-outcome, and exposure-mediator confounders.
A third option relying on inverse probability weighting has been proposed that assumes exposure to be categorical, with few categories (VanderWeele and Vansteelandt 2014). This approach does not require models for the mediators, but a model predicting the outcome as a function of exposure and the mediators is necessary (VanderWeele and Vansteelandt 2014). Finally, estimating so-called interventional (in)direct effects, which requires weaker assumptions than identification of natural (in)direct effects, is also an option (Vansteelandt and Daniel 2017). Bellavia et al. (2019) have provided other examples of methods applied when considering the mediating effects of (low-to-intermediate) chemical mixtures on health. We will now assume that the number of mediators p is of the same or of a larger order of magnitude than the number of observations ( Figure 3).

Identification of Mediators in High Dimension
Generally, several issues need to be considered when trying to translate the mediation analysis framework to the case of high-dimension mediation. These issues relate to the knowledge about the causal model underlying the data and, relatedly, the identification of the mediators, to the correction for multiple testing, to the consideration of composite tests, and to the estimation of the share of the effect of the considered exposure explained by mediators.
The approach to mediation analysis as developed in the framework of causal inference (VanderWeele 2015) makes the assumption that the causal structure underlying the data is known a priori, as opposed to inferred from the data. This means that the directions of the relations between E, M, Y and all potential confounders are known. However, a priori knowledge of the causal structure is less likely to be available as the dimension of the set of (potential) mediators increases, that is, as the biological system considered becomes more complex. Moreover, as in the low-dimension case, exposure-outcome, exposure-mediators, and mediators-outcome confounders need to be identified and controlled. In Table 1, we list different approaches proposed to perform mediation analysis in a high-dimension setting.
Independent consideration of each potential mediator. One technically relatively simple option has been to consider all potential mediators independently, in separate models. Küpers et al. (2015) for example used a two-step approach consisting of detecting associations between maternal smoking and genome-wide methylation levels using an epigenome-wide association study (corrected for multiple testing), and then performing a series of mediation analyses of the maternal smoking effects on birth weight mediated by each of the methylation hits (i.e., the potential mediators) identified in the first step. Such an approach makes among other the strong assumptions of lack of correlation or interactions between mediators, as discussed above for multiple mediators, which is unlikely to hold in many settings. It is tempting to try controlling for the potential mediators that may influence the mediator considered in the second step, but unfortunately the tools classically used in low-dimension settings to control for confounding (a priori identification of potential confounders and adjustment for these factors based on a multiple regression model), cannot be applied in a straightforward way when confounders need to be identified from a high-dimension vector, unless knowledge of the causal relations within the biological layer corresponding to mediators is known a priori. Trying to identify the confounders in a data-driven approach is challenging; indeed, in this setting, if some correlation exists among the mediators, then univariate approaches are expected to suffer from a high false detection rate, even when multiple correction techniques are used. A simulation study aiming to relate a large number of (weakly) correlated factors to a health outcome showed that false discovery rate (FDR)-correction techniques yielded false detection rates far higher than the expected value of 5%, as a result of this correlation between disease predictors (Agier et al. 2016), a situation that may also happen when modeling associations of candidate mediators with disease risk.
An approach related to that from Küpers et al. (2015), used to identify if epigenetic marks mediate the relationship between genotypes and disease status, considered the causal inference test (CIT) (Liu et al. 2013). CIT is related to the Baron and Kenny (1986) procedure in that it corresponds to a chain of mathematical conditions that must be satisfied to conclude that each potential mediator causally influences the outcome (Millstein et al. 2009). This approach can be seen as attempting to reconstruct the whole causal structure underlying the data, but it does so considering each potential mediator separately, which, again, may be challenging if there is correlation between mediators (Wang and Michoel 2017).
Permutation tests for mediators. In the two abovementioned approaches (Küpers et al. 2015;Liu et al. 2013), candidate mediators are tested separately, paralleling genome-wide association Figure 3. High-dimension mediation. Hypothesized relation between an exposure E; a health outcome Y; an exposure-outcome confounder C 1 ; a highdimension mediator M = ðM i Þ i ≤ p , where p is typically larger than the number of observations in the data set, an exposure-mediator confounder C 2 ; and a mediator-outcome confounder C 3 . Causal influences also exist among the candidate mediators (here, M j influences M i ). p is typically much larger than the number of observations n in the data set. studies (GWAS). Boca et al. (2014) developed a permutation test that allows controlling the family-wise error rate while testing a large number of mediators, again under the assumption of a lack of unmeasured confounding. Permutation tests account for the underlying correlation between mediators and do not suffer from the problem of the Bonferroni correction, which is increasingly conservative as the correlation between mediators increases (Boca et al. 2014).
Composite tests. Under specific hypotheses, several approaches commonly used to test mediation, such as the product test, are overly conservative (Barfield et al. 2017;MacKinnon et al. 2002). This issue can be addressed by computing empirical p-values based on bootstrapping, which provides an increased power to detect mediation for a given sample size (Barfield et al. 2017). Boca et al. (2014) and Sampson et al. (2018) developed several statistical procedures, which are implemented in the R package MultiMed, to increase the power of such composite analyses when testing multiple mediators. Huang (2018) developed a joint significance test in the context of multiple mediators. Huang (2019) also leveraged the composite nature of the null hypothesis to construct a new test statistic that is less conservative than the Sobel test.
Empirical estimation of the null distribution. The limited power of the Sobel test can also be viewed as related to wrong assumptions regarding the theoretical null distribution. When a test statistic assumes a given distribution under the null hypothesis (e.g., a chi-squared distribution) while the real distribution under the null hypothesis is modified (e.g., by unmeasured confounders), hypothesis testing and FDR-control procedures may be invalidated (Devlin and Roeder 1999;Efron 2004;François et al. 2016;Strimmer 2008). In a high-dimension setting, this issue can be identified by displaying the distribution of p-values or of z-scores (Efron 2004). Solutions have been reviewed in the GWAS context and include empirical null distribution and genomic inflation factors to calibrate p-values, that is, the correction of p-values in a way ensuring that their distribution is flat under the null hypothesis (Efron 2004;François et al. 2016;Strimmer 2008). We illustrate this in Figure 4. Simulations were performed using the mediation model of Equation 4, assuming that, of 5,000 putative mediators, 500 variables were involved in the indirect path relating the exposure to the outcome. Raw p-values of Sobel test testing for mediation (Figure 4, red histogram) were shifted toward values closer to 1 compared with the expected distribution, which is, a mixture of a uniform distribution and a distribution with an excess of small p-values. After application of empirical null hypothesis testing techniques, the (adjusted) distribution of p-values of the Sobel test (blue distribution) became closer to the expected one.
Mediators considered as a whole. In the context of highdimension mediation, as outlined above with a few mediators, there can be mutual influences between mediators. For example, in the epigenetic field, the methylation level on one CpG site can influence the methylation level of other CpG sites. This is the case for alterations in the methylation of DNA-methyltransferase (DNMT) genes, which may alter the level of methyltransferase enzymes, which in turn impact the methylation of several other genes (Zhang and Xu 2017). These relations among mediators can create confounding and hamper identification of mediators causally linked with the health outcome when considering each mediator separately (VanderWeele and Vansteelandt 2014). For example, in Figure 3, if one tries to consider mediators separately, the proportion mediated by M j must be identified for the proportion mediated by M i to be properly quantified (VanderWeele 2015).
Identifying the mediator(s) responsible for this confounding bias is, as discussed above, challenging in a high-dimension setting. An alternative to the separate consideration of each candidate mediator is to follow the logic of the approach highlighted above for multiple mediators considering all (potential) mediators as a whole. In this situation, a subtle understanding of the causal relations within the high-dimension mediators is not necessary and only identification and control for confounders outside the set of mediators is required. In particular, mutual influences or interactions among mediators can be ignored in this setting, provided one does not aim to identify specific causal mediator(s), or, in the case of sequential mediators, the effect of a specific causal path.
When using classical (least squares or maximum likelihood for logistic regression) estimators, the exposure-outcome regression model including all potential mediators (Equation 4) provides estimation of regression parameters with a prohibitively large variance as p increases and reaches a fraction of n (Sur and Candès 2019; Vittinghoff and McCulloch 2007). Once p is larger than n, the model cannot be estimated by ordinary least squares or maximum likelihood anymore. Instead, one of the multivariate variable selection or dimension reduction techniques proposed to relate high-dimension variables to one or a few unidimensional variables can be used (Chadeau-Hyam et al. 2013). Several approaches have been developed in this spirit, which we describe below, starting with approaches relying on variable selection and then presenting those related to dimension reduction.
High-dimension mediation based on variable selection. Zhang et al. (2016) implemented in the R package HIMA a threestep approach that, first, excludes candidate mediators that are not strongly associated with the health outcome in an univariate approach then, second, uses a regularized multivariate mediation model allowing further restriction to a smaller group of mediators whose mediation effect is tested in a last, third, step. A limitation of this approach is that it assumes a lack of confounding or residual confounding, similarly to the abovementioned approaches considering each mediator independently (to which it is actually related). One way to overcome this would be to rely on iterative sure independence screening (or ISIS) in the first step, which was designed to cope with situations such as that of a covariate not marginally associated with the outcome but related with it conditionally on another covariate (Fan et al. 2009).
High-dimension mediation based on dimension reduction. Another approach, called the directions of mediation, was used to determine which brain locations mediate the relationship between the application of a thermal stimulus and self-reported pain (Chén et al. 2018). It does not attempt to identify true mediators but, rather, seeks linear combinations of mediators that capture the mediators' effect according to a criterion related to the ability to predict the outcome and to be explained by exposure E. This approach is therefore related to dimension reduction techniques such as (supervised) principal component analysis (PCA). Relatedly, Huang and Pan (2016) developed a test for mediation for high-dimension mediators relying on spectral decomposition of the set of mediators, followed by a series of univariate regression models with the independent components. As an extension, Zhao et al. (2020) introduced sparsity into the PCA-type analysis used by Huang and Pan (2016).
Advantage can be taken of the nature of the layer of mediators. If, for example, a distance can be defined among the potential mediators, then it may be used to decrease the dimension of the mediation layer. This approach has been applied to characterize to what extent the effect of diet on body mass index is mediated by changes in the composition of the gut microbiota (Zhang et al. 2018). Data on microbiota composition obtained from 16S rRNA gene sequencing can be classified according to operational taxonomic units, from which a distance based on DNA sequence divergence (or distance of species in the phylogenetic tree) can be calculated. Of course, here a central assumption relates to the relevance of the proposed distance for the health outcome considered in the microbiota example that microbiota diversity, as quantified from DNA sequence, influences the health outcome considered. As discussed by Zhang et al. (2018), approaches that accommodate different types of distances, without having to a priori choose one of them, have been proposed.
All of these techniques are limited when it comes to the causal interpretation: Indeed, these approaches will generally allow identifying sets or combination of covariates with predictive power, which does not imply that they are the causal agents.
Issues related to overfitting. Options to a priori reduce the dimension of the mediators' layer are all the more relevant because of issues related to overfitting. Indeed, in the context of highdimension mediators, the ability of the mediators to predict the outcome can be overestimated; this is all the more a concern because many machine-learning algorithms tend to be optimized to increase their predictive ability rather than their aptitude to make causal inference (Hernán et al. 2019). Mistaking the predictive ability of a model with its value for causal inference may lead to overestimating the share of the mediated effect. For example, using an approach such as the least absolute shrinkage and selection operator, or LASSO, to relate the candidate mediators to the health outcome may lead to a model with a very high predictive value owing to the fact that the ability to predict a unidimensional variable increases with the number of potential predictors, but which may include noncausal predictors of the outcome among the candidate mediators (Leng et al. 2006;Meinshausen and Bühlmann 2010). In order to limit such overestimation of the estimated effect of the candidate mediators on the health outcome, algorithms and strategies targeted for counterfactual prediction and causal inference should be preferred over those favoring the model's predictive ability. This will be best achieved through some knowledge on the confounders external to the set of mediators (which may be more easily a priori available than information on the relation among the potential mediators) and on factors that are certainly not confounders of associations between the exposure and mediators or outcome, such as those influenced by exposure (Hernán et al. 2019). Purely statistical considerations will also weight on the efficiency of a strategy in terms of counterfactual inference; for example, maximizing the predictive ability of a model may not be the most relevant choice, whereas trying to maximize stability (Meinshausen and Bühlmann 2010) or specificity is expected to be more favorable. Finally, approaches relying on a targeted minimum loss estimator (TMLE), which provides efficient prediction while remaining in a causal inference framework, are certainly worth considering here (Lendle et al. 2013;Zheng and van der Laan 2018).

Estimation of the Proportion of Effect Mediated
Measures of mediated effects are an expected output of mediation analysis (MacKinnon et al. 1995). Again, translating in high dimension the approach used with a single mediator is not straightforward. Some analyses with high-dimension mediators may report mediated effects for each of the candidate marker, which is not informative about the overall mediated effect because single components of a set of high-dimension mediators can have opposite effects (Küpers et al. 2015;Zhang et al. 2016). In addition, because of correlation and interactions among mediators, the sum of the proportion mediated can be more than 100% (VanderWeele and Vansteelandt 2014).
Alternatively, one might wish to estimate the global proportion of effect mediated by all potential mediators considered simultaneously. Insight may, again, be gained from the GWAS field. In GWAS, methodological efforts have been devoted to estimate how much of the population variability in a phenotypic trait is explained by genetic variation among individuals. Some statistical models to estimate the influence of genetic polymorphisms do not rely on identification of true causal markers but, rather, assume a polygenic model where each marker has a (possibly) infinitesimal effect (Yang et al. 2011). Similarly, a polymediator approach transposing this logic and considering mediators as a whole could be applied: If one relied on one of the abovementioned approaches allowing one to build a linear combination of the candidate mediators (Chén et al. 2018;Huang and Pan 2016), then the proportion mediated by this new (unidimensional) variable could be estimated as in the single-mediator case. This approach has been applied by Zhao et al. (2020), who relied on a sparse PCA of the candidate mediators to identify a linear combination of parameters quantifying the activity of various brain regions likely to mediate the effect of a learning task on reaction time; they provided an estimate of the indirect effect.

Conclusion
The increasing interest for omics data and the role of epigenetic marks, RNA, and protein levels or of the microbiota on disease phenotypes make it very appealing to try to quantify to what extent the effect of environmental (Bind et al. 2014), infectious, behavioral, social (Huang 2018, or genetic (Liu et al. 2013) factors on health is mediated by these biological layers. As we discussed, generalizing mediation analysis techniques developed for one or a few mediators to high-dimension mediation is not straightforward. A key conceptual issue relates to the fact that mediation analysis assumes a priori knowledge of the causal relations between the exposure, the mediator, the outcome, and confounders. In a high dimension, the causal relations between all considered factors (e.g., among CpGs) is unlikely to be accurately known a priori. It is optimistic to expect the causal structure to be unraveled by machine-learning techniques because many models developed in data science tend to excel in predictive ability but may be of more limited use in making causal inference (Hernán et al. 2019), in particular in a context of a rather Figure 4. Raw distribution of the p-values of the Sobel mediation test for 5,000 simulated variables that are putative mediators (in red, not uniform) and corrected distribution (blue) after using the fdrtool package (R version 3.6.1; R Development Core Team). After correction, the distribution is closer to that expected under the simulated causal model, which assumes the presence of mediators, so that one observes a mixture of a uniform distribution and a distribution with an excess of small p-values. The distribution of the raw p-values should be uniform except for an excess of small p-values corresponding to true mediators. The fact that the (red) distribution is not uniform may indicate several deviations from the null model such as confounding factors or poor standardization of the test statistic. The red histogram indicates that the Sobel test is too conservative (MacKinnon et al. 1995). Here we use the R package fdrtool that implements an empirical null distribution approach to transform initial p-values to uniformly distributed p-values and that provides control of the false discovery rate (Strimmer 2008). To perform simulations, we consider the mediation model of Equation 4, where there are 500 random mediators influenced by the environment that affect the simulated outcome according to Equation 4. We considered 4,500 additional putative mediators distributed according to a multivariate distribution that did not depend on environment and outcome (see code on GitHub https://github.com/mblumuga/opinion_mediation/blob/master/Simus_Sobel_FDR.R). limited number of training samples. TMLE represents a way to try to accommodate both aims that is certainly worth further considering in a high-dimension context (Lendle et al. 2013). Although we took the example of high-dimension mediators, most of the issues discussed here also apply to problems in which the mediators have an intermediate dimension (with p lower than n but still relatively high), as would, for example, happen if someone wished to quantify to what extent the association of socioeconomic status and a health outcome is mediated by a set of several hundred exposures assessed in a population of a few thousand subjects.
The fact that we mostly discussed data-driven approaches should not let the reader believe that this is the only way forward. On the contrary, any effort to incorporate in models biological knowledge should be undertaken. This may, for example, be done by restricting analyses to a subset of genes with high a priori plausibility for an effect on the outcome or by reducing the dimension of the layer of potential mediators using a biologically relevant distance, as is done for microbiote data, on the basis, for example, of the phylogenetic distance between the microbiote species (Zhang et al. 2018). Once such options have been considered and, if relevant, implemented, one may then turn to more data-driven approaches.
High-dimension omics layers may have a complex and hard to identify causal structure, for example, in the case of mutual influences among the mediators (CpG sites or protein levels). Such situations in which a mediator is also a confounder for the relation between another mediator and the outcome influenced by the exposure of interest are hard to handle rigorously considering each mediator separately, and for such high-dimension omics layers, it is unlikely that molecular biology will soon unravel all causal relations among all variables. Currently, approaches considering each potential mediator separately or treating the potential mediators as a whole coexist. Issues identified in the (low dimension) case of multiple mediators tend to indicate that approaches considering mediators as a whole (rather than individually) should be preferred with a high-dimension mediator (VanderWeele and Vansteelandt 2014).
Issues less specific to high-dimension mediation analysis add to the abovementioned issues; these include reverse causation (Liu et al. 2013), measurement error in the exposure or in the mediator(s) (Valeri et al. 2017), and reliance on observational studies to test mediation (Richmond et al. 2014).
Data collection should generally be guided by power calculations, which are challenging in high-dimension mediation given that the question of sample size requirements for mediation analysis is not even completely solved in a single-mediator setting (VanderWeele 2015). Simulation studies prior to the design of a new study should be considered but are challenging with complex data structures; a few examples exist (Barfield et al. 2017;Boca et al. 2014;Huang 2018).
Further extensions of the case that we discussed (Figure 3) could be worth considering. First, we did not single out the multimediator case with ordering, where there may be several successive ordered layers of potential mediators: with each M 1 , . . . M k possibly being highly dimensional. In addition, not only the mediator, but also the exposure, could be multidimensional. This situation has been considered in a casecontrol study considering the mediating role of DNA methylation in the association between genetic polymorphisms (assessed from a genome-wide screening leading to about 300,000 genetic polymorphisms) and arthritis risk (Liu et al. 2013). Exposome studies in which methylome or metabolome data are available can lead to a similar data structure (Maitre et al. 2018). Further, the outcome could be multidimensional, corresponding to outcome-or diseasome-wide studies (VanderWeele 2017). This would imply a move from the rather simple three-variable system corresponding to the typical original mediation framework (MacKinnon et al. 2007) to a much more complex three-layer system. Finally, all these situations could be combined, for example, in a study assessing in the same population multiple layers of interconnected omics layers, from a large exposome to, for example, microbiota, methylome, transcriptome, proteome, or diseasome data. There are descriptive tools for exploring the relations within and between such layers, for example, in the literature referring to multimodal data, with approaches such as sparse generalized canonical correlation analysis (Garali et al. 2018). However, a rigorous causal analysis of such data, whose collection on hundreds or thousands of subjects is now feasible (Maitre et al. 2018), would require knowledge on the causal relations between (and possibly within) each data layer, which may, in many situations, be very difficult to attain. Inferring causal structure from data without strong a priori is an expanding field of research (Scanagatta et al. 2019;Uusitalo 2007). The approaches initially used have tended to have a complexity increasing at least exponentially with the number of possible nodes in the causal diagram to infer, but alternatives have been recently suggested that may make the problem tractable also in high dimension .
Omics platforms generate a huge amount of data, opening the way for joint analysis of environmental exposures, intermediate biological layers of data, and health outcomes. High-dimension mediation analysis constitutes one promising framework to handle such multiple layers of data in a causal inference framework; however, it is still in its infancy and raises numerous challenges. These challenges should be tackled by biostatisticians, biologists, and epidemiologists in order to better understand the determinants of health, to make efficient use of the data generated beyond their use for prediction, and avoid making a data cemetery out of the promised knowledge Eldorado (Hunter 2006).