Testing one-loop galaxy bias -- I. Power spectrum

We test the regime of validity of one-loop galaxy bias for a wide variety of biased tracers. Our most stringent test asks the bias model to simultaneously match the galaxy-galaxy and galaxy-mass spectrum, using the measured nonlinear matter spectrum from the simulations to test one-loop effects from the bias expansion alone. In addition, we investigate the relevance of short-range nonlocality and halo exclusion through higher-derivative and scale-dependent noise terms, as well as the impact of using co-evolution relations to reduce the number of free fitting parameters. From comparing validity and merit of these assumptions we find that a four-parameter model (linear, quadratic, cubic nonlocal bias, and constant shot noise) with fixed quadratic tidal bias provides a robust modeling choice for the auto power spectrum of the less massive halos in our set of samples and their galaxy populations (up to $k_{\mathrm{max}} = 0.35\,h/\mathrm{Mpc}$ for a sample volume of $6\,(\mathrm{Gpc}/h)^3$). For the more biased tracers it is most beneficial to include scale-dependent noise. This is also the preferred option when considering combinations of the auto and cross power spectrum, which might be relevant in joint studies of galaxy clustering and weak lensing. We also test the use of perturbation theory to account for matter loops through gRPT, EFT and the hybrid approach RESPRESSO. While all these have similar performance, we find the latter to be the best in terms of validity and recovered mean posterior values, in accordance with it being based partially on simulations.


I. INTRODUCTION
The varying degrees of clustering displayed by different types of galaxies, or clusters of galaxies, has led to the understanding that these objects cannot be unbiased tracers of the underlying matter distribution (e.g. [1][2][3][4][5][6], for a recent review see [7]). In order to utilize measurements from large-scale structure surveys for cosmological studies, it is therefore of critical importance to have a robust model of the galaxy-matter connection, commonly referred to as galaxy bias. The accuracy and consistency of these models will be challenged by great improvements in the statistical precision of upcoming survey generations, e.g. DESI [8] and Euclid [9], as well as the combination of multiple probes, such as clustering and weak lensing. In this light it is interesting to consider the range of scales over which we can trust our currently best descriptions of galaxy bias and how much freedom we need to allow for when analyzing the two-point clustering of galaxies -both largely unresolved questions.
We intend to address these questions in the context of the perturbative galaxy bias expansion, which relates the * alexander.eggemeier@durham.ac.uk galaxy density contrast δ g to a series of physically motivated terms that involve the matter density fluctuations δ and its tidal field. The well-known linear relationship δ g = b 1 δ [2], with linear bias parameter b 1 , represents the lowest order term in this expansion and is valid only on the largest scales. Additional terms with their own bias parameters become relevant on successively smaller scales, such as powers of the matter field [10,11], b 2 δ 2 , b 3 δ 3 etc., as expected from a spherically-symmetric gravitational collapse [12,13]. Similarly it has been argued that anisotropies in this process should lead to a dependence on the tidal field [14][15][16][17], which was confirmed by the inadequacy of the power series expansion to fully explain the clustering of dark matter halos [18,19] and inconsistencies between two-and three-point statistics [20,21]. The first direct evidence from simulated data was reported in [22,23]. Further developments have put these arguments on a more solid theoretical footing by identifying the matter density and tidal field as the leading, local gravitational effects that leave an imprint on galaxy formation as a result of the equivalence principle [24,25]. In addition, since the evolution of galaxies occurs over long timescales, one should not expect the galaxy density to depend on these quantities at only a single point in time, but rather on their entire past light-cone. It has been shown that this time dependence can be traded for a set of extra (nonlocal) terms at each order of perturbation theory that are generated by time evolution [16,22]. Systematic procedures to identify these terms have been presented in [7,26,27] and provide a complete basis at fixed time for the general bias expansion.
An important assumption that goes into the derivation of this basis is that galaxy formation is spatially local. It is well known that galaxies collect matter from an extended region of space, but as this is roughly limited to the Lagrangian radii R of their host halos, the spatially local assumption must be valid in the large-scale limit. However, corrections known as higher-derivative terms can become relevant on scales approaching R, starting with a term R 2 ∇ 2 δ [16,28,29]. Furthermore, whether a galaxy forms at a given point will not be solely determined by the large-scale fields, but also have a dependency on the very small scale modes. In the absence of strong primordial non-Gaussianities they are uncorrelated with any large-scale effects and so, from the perturbative point of view, look like a stochastic contribution [30][31][32]. This gives rise to a constant offset from Poisson shot noise in the galaxy power spectrum [33], which can be interpreted as being due to the halo exclusion effect [12,34,35]. On smaller scales halo-halo exclusion imprints a scale-dependence on the stochasticity, whose strength is controlled by the Lagrangian radius [7,36] like for the higher-derivative terms. Up to date it is not clear which of the two, if any, might have the more dominant effect on galaxy clustering.
Several tests of this bias modeling, to varying levels of detail and either in configuration or Fourier space, have already been conducted in the literature. One of these studies [37] applied the model to the cross power spectrum and bispectrum between the dark matter field and halos in various mass bins and at multiple redshifts. Across these various samples they found good agreement with their simulation measurements and consistency in the fitted bias parameters from the two statistics up to scales of k ∼ 0.1 h/Mpc. Furthermore, they demonstrated the need for a cubic nonlocal bias contribution and presented indications that its associated parameter as well as the quadratic one corresponding to the tidal field closely follow the so-called local-Lagrangian (LL) relations (see also [38]).
These relations arise by making the assumption that at some time in the far past the galaxy overdensity can be exhaustively described by powers of δ alone. Upon translation to later times, all remaining terms identified as part of the general bias expansion appear, but their amplitudes are fixed in terms of b 1 , b 2 etc. and so, if this assumption is merited, the LL relations provide a valuable reduction of the modeling degrees of freedom.
An analysis similar to [37] has been presented in [39] including the halo auto power spectrum and a higherderivative parameter that was not taken into account by [37]. Using the goodness-of-fit as an indicator, they reported an accurate match to the data up to scales of k = 0.3 h/Mpc. Building on this work, the authors of [40] also considered the possibility of scale-dependent stochasticity, but were neither able to claim a clear detection of such an effect nor a contribution from the higherderivative term. Moreover, none of these studies assessed how many free model parameters are actually necessary to describe their measurements -a subject that was addressed in [41,42] by means of a Bayesian model selection criterion. From fitting the auto halo power spectrum in redshift space, [41] thus found a slight preference for the LL model over leaving the corresponding bias parameters free, while [42] came to the conclusion that a four parameter model including b 1 , b 2 , as well as the tidal and higher-derivatives biases, performs best for their real-space halo cross power spectrum (without having considered application of the LL relations) and scales up to k = 0.2 h/Mpc. However, neither of these two analyses determined clearly over which ranges of scales their preferred models remain valid and both considered only a single sample of biased tracers.
The theoretical developments implemented in these simulation studies have also been successfully applied to clustering measurements from the BOSS galaxy survey [43][44][45][46][47] and more recently in [48][49][50]. The exact modeling choices differ in all these cases, e.g. [45] employs the LL relations for both the tidal second-order bias and the nonlocal third-order bias parameter, whereas only the former is fixed in this way by [43,46,50] and [48] sets the latter to zero, but leaves the tidal bias parameter free. Each of these analyses involved a considerable effort in validating their models, consisting of fits using large sets of mock catalogs and blinded mock challenges (e.g. [51]) in order to carefully test whether they obtain unbiased results in their cosmological parameters. However, there is little variety in the galaxy samples used in these tests as they are tuned to reproduce the clustering properties of the observed galaxies, and beyond a determination of the range of validity of the respective fiducial model no further systematic checks of the various assumptions are carried out. The fact that they still arrive at comparable cosmological constraints might mean that the differences in the galaxy bias modeling are negligible compared to the statistical uncertainties in the BOSS survey, but this is likely going to change with DESI and Euclid.
Given the large variety of results in the literature, our goal in this paper is to present a rigorous and systematic approach to testing the bias modeling that combines the strengths of several previous studies. We take a large pool of different tracers -three galaxy samples resembling the SDSS main galaxies, and the LOWZ and CMASS samples of BOSS, as well as four halo catalogs in different mass bins and redshifts -and analyze their two-point clustering with identical models and parameter priors using the same range of scales and with statistical uncertainties all corresponding to an effective volume of 6 (Gpc/h) 3 . We want to conduct a precise test of galaxy bias alone and for that reason do not take into account redshift space distortions. This would introduce further modeling uncertainties, whose potential deficiencies could be absorbed by the bias parameters, or vice versa. Moreover, we choose to keep the cosmological parameters fixed, which allows us to decouple the modeling of the bias contributions from the nonlinear matter power spectrum by replacing the latter with its simulation measurement. In this framework we evaluate the range of validity of different assumptions, such as the application of LL relations and/or inclusion of higher-derivative and scale-dependent stochasticity, from a combination of two performance metrics: 1) the unbiased-ness of recovered parameters, and 2) the goodness-of-fit. Since models with more degrees of freedom give rise to larger parameter uncertainties, we compare these validity estimates with a measure of the model's merit to discern how many and which parameters constitute an optimal choice, following [52]. We perform this analysis not only for the auto power spectrum, but also in combination with the cross power spectrum between galaxies/halos and matter, as consistency between the two statistics provides an even more stringent test of the bias model and is of great relevance for joint studies of galaxy clustering and weak lensing data.
Our paper is organized as follows. In Section II we start off with an overview of the galaxy power spectrum at next-to-leading order in perturbation theory including the various contributions mentioned above. We then present our simulated data sets in Section III, along with the measurements, our fitting procedure and prior choices, as well as a precise definition of the performance metrics employed in this work. Section IV reports our main findings on the model's range of validity and a determination of the importance of higher-derivative and scale-dependent stochastic terms using the measured nonlinear matter power spectrum. In Section V we explore how these findings are affected when the latter is calculated from different approaches using perturbation theory. Finally, we give our conclusions in Section VI.

II. ONE-LOOP PERTURBATION THEORY FOR BIASED TRACERS
Recent analyses of galaxy surveys typically rely on oneloop perturbation theory predictions to model the twopoint statistics of biased tracers. The following sections serve to provide an overview of the relevant results and various employed modeling assumptions, as well as to establish our notation to be used throughout the remainder of this paper.

A. Galaxy bias expansion
One-loop corrections to the linear theory predictions for two-point statistics arise from terms up to third order in perturbations. For our purposes it is therefore sufficient to write the galaxy bias expansion as (follow-ing [27], which also contains relations to other popular bias parametrizations) where, for brevity, we have omitted all dependencies on redshift. Since δ g is a scalar, the matter tidal field first enters at second order of this expansion described by the Galileon It is expressed here in terms of second derivatives of the normalized velocity potential Φ v , which is linked to the divergence of the matter velocity field θ via The nonlocal nature of gravitational collapse requires the existence of further terms in the bias expansion, the first of which appears at third order and is given by [22] where ϕ 1 is the linear Lagrangian perturbation theory (Zel'dovich approximation) potential, satisfying the Poisson equation ∇ 2 ϕ 1 = −δ. The nonlocality of gravitational evolution manifests in all higher-order corrections to the Zel'dovich approximation [53] starting from the second-order potential, defined by ∇ 2 ϕ 2 = −G 2 (ϕ 1 ). In addition to these terms, we have also included the leading higher-derivative contribution ∇ 2 δ in Eq. (1). Such terms are expected as the formation of dark matter halos and galaxies necessarily depends on the initial matter distribution within some finite region, which leads to a short-range nonlocality [4,16,29,32,54] that needs to be distinguished from that due to gravitational evolution alone. They are therefore tied to a particular length scale, which for halos is close to their Lagrangian radius [16,55], but can potentially be significantly altered due to baryonic physics for certain types of observable tracers [7]. The existence of such a length scale implies a crucial difference between the higher-derivative contributions and the remaining terms in Eq. (1), whose relative importance is entirely governed by the nonlinearity scale of the matter density. If we assume here that the nonlinearity scale is comparable to the (Lagrangian) size of halos, we can expect the leading higher-derivative term to be of similar order as the nonlinear bias corrections. Consequently, terms involving even higher numbers of derivatives, for instance ∇ 4 δ, must then be suppressed on the scales we are interested in. However, our approach in this paper is to precisely test this assumption, i.e. the relative importance of higher derivative bias compared to nonlinear bias corrections with a wide range of biased tracers.
Before we can proceed we need to address the renormalization of the bias parameters [56,57]. This issue arises when computing observable quantities (correlation functions) based on the bias expansion above, as this leads to dependencies on the variance σ 2 = δ 2 (x) . The variance is sensitive to the highly nonlinear regime where perturbation theory breaks down, but these sensitivities are not physical as they rely on the definition of the bias parameters and, in fact, can be completely absorbed by appropriate redefinitions of the bare parameters appearing in Eq. (1) [56].
However, it was shown in [27] that a convenient way to circumvent this problem altogether is to recast the bias expansion in terms of multi-point propagators. In analogy with renormalized perturbation theory [58] and related approaches [59], the galaxy multi-point propagators Γ (n) g are defined as ensemble averaged derivatives of δ g with respect to the linear matter density δ L . This correspond to associating the (renormalized) bias parameters with sums of reducible diagrams with a fixed number of external lines (denoted by the number of derivatives). Written in Fourier space 1 we thus have: where k 1···n = k 1 + . . . + k n . As the multi-point propagators are observables themselves, they are automatically renormalized and act as the scale-dependent bias parameters in an expansion of the form which is equivalent to Eq. (1) after the usual renormalization procedure. The product ⊗ is given by and the H n are the Wiener-Hermite functionals [27,60], whose first two representatives are H 1 = δ * L (k) and . The scale dependence of the multi-point propagators can be predicted based on the various bias contributions and their nonlinear evolution. Up to the order we are working at one can show that [27] at one loop, as well as to tree level, where P L is the linear matter power spectrum defined below, while F n and G n denote the standard perturbation theory (SPT) kernels (see e.g. [61]), and K(k 1 , k 2 ) = (k 1 · k 2 /k 1 k 2 ) 2 − 1 is the Fourier transform of the kernel describing G 2 (Φ v ). The bias parameters in Eqs. (7) and (8) are now the renormalized ones (indicated by the lack of an over-bar) and their relation to the bare parameters can be computed if desired [27]. We stress that the expressions for the multipoint propagators are free from contributions involving σ 2 . Propagators for the nonlinear matter field, Γ (n) m , can be derived from the expressions above in the limit that b 1 → 1 and all other bias parameters are set to zero, e.g. Eq. (7) gives the well known one-loop propagator Γ (1) m = 1 + P 13 /(2P L ), with P 13 the only one-loop power spectrum reducible diagram in SPT [61].

B. Power Spectra
We are interested in the auto and cross power spectra of galaxies and matter, which are defined as follows When δ and δ g are expressed in terms of multi-point propagators, we can exploit the orthogonality of the Wiener-Hermite functionals [27,60] to immediately obtain an expression for the cross power spectrum at oneloop order: keeping only terms that are at most quadratic in P L . An analogous result holds for P gg (k) and upon inserting Eqs. (7) and (8) we get where P mm (k) is the one-loop nonlinear matter power spectrum.

C. Matter modeling
The dominant contribution to Eqs. (12) and (13) in the nonlinear regime typically comes from the nonlinear matter power spectrum, i.e., the term that is multiplied by either b 1 or b 2 1 in P gm and P gg , respectively. Its correct modeling is therefore of particular importance and it has been shown to great detail that the SPT matter power spectrum that derives from the expressions above is subject to sizable inaccuracies, which spoil its agreement with measurements from simulations. This can be attributed to two main reasons: first, SPT does not properly take into account the effect from large-scale relative displacements, which give rise to a damping of the baryon acoustic oscillation (BAO) signature. Second, a further simplification made in the context of SPT is the assumption that the dark matter field behaves as a pressureless perfect fluid, which implies a vanishing stress-tensor, that can be described perturbatively. While this is a good approximation on sufficiently large scales, it is no longer applicable in the nonlinear regime and therefore needs to be accounted for by appropriate correction terms.

BAO damping from large-scale "infrared" modes
It has long been recognized that large-scale modes are responsible for the damping of the amplitude of BAO and can thus be treated perturbatively [62][63][64][65][66]. The relevant quantity for the power spectrum is the relative displacement field two-point function at the BAO scale, which smears the BAO signal and receives relatively large corrections from large-scale modes.
The large-scale displacement field gives the most important contribution to the decay of the matter propagator, and these contributions were first resummed in the context of renormalized perturbation theory (RPT) [62] 1 For Fourier space integrals we use the convention δ(x) = k exp (−ik · x) δ(k) with the short-hand notation k 1 ,...,kn ≡ d 3 k 1 /(2π) 3 · · · d 3 kn/(2π) 3 .
(see also [67][68][69][70]) and applied to BAO in [66]. Another useful way to think about BAO damping at large scales is to decompose the linear power spectrum into a smooth (P nw ) and wiggly component 2 (P w ), such that [72] P L (k) = P nw (k) + P w (k) .
then the effect of relative displacements is to "dewiggle" the spectrum, which can be described to leading order in PT by applying the smearing operator in the Zel'dovich approximation exp(−k 2 Σ 2 ) to the wiggly part, where Σ 2 is the relative displacement field two-point function [63] at the BAO scale. These two approaches (resummation and dewiggling) have been brought together and put on a much sounder footing more recently, as we now discuss. RPT and related methods, e.g. RegPT [69,70], resum the propagator but keep the mode-coupling contributions to the power spectrum at fixed order, and this asymmetry leads to breaking of the Galilean invariance (GI) 3 of the equal-time correlators [77], which results in unphysical damping of the broadband power, particularly at large k. This can be avoided to a large extent, a posteriori, by requiring that the power spectrum be invariant under large-scale displacements, effectively constructing a resummation of the mode-coupling contributions consistent with that of the propagator. However this procedure, known as gRPT, cannot generate all invariants in the mode-coupling contributions as they are not required to cancel the non-invariants from the propagator contributions. We will use gRPT as implemented in [43,50] as one of the ways to compute the nonlinear matter contributions in Section V. For a similar approach see [78].
Generalizing such resummations to preserve GI throughout is technically cumbersome, but has been done in [79] by mixing Eulerian PT with the Lagrangian description to compute the large-scale relative displacements effects on the density power spectrum BAO. This resummation procedure for the equal-time two-point correlator reduces to RegPT (restricted to infrared modes as in [80]) when one of the fields is at the initial conditions, i.e. when applied to the propagator, as expected. Taking advantage of the fact that large-scale relative displacements smear sharp features such as the BAO but not the broadband shape, [81] and [82] show that one can formulate the resummation of infrared modes systematically directly in Fourier space by using the decomposition in Eq. (14). The full expressions of the infrared resummed matter power spectrum at linear and to one-loop order are then given by [81] where P nw,1−loop is the SPT one-loop correction but evaluated using the smooth linear spectrum P nw and P w,1−loop = P SPT,1−loop − P nw,1−loop . The first term in the square bracket of Eq. (17) guarantees that the limit of small Σ 2 reduces to one-loop SPT, as it should. The damping kernel is given by the relative displacement twopoint function computed in the Zel'dovich approximation at the BAO scale [63]: where j n (x) are spherical Bessel functions of order n and k BAO corresponds to the BAO scale in frequency space, i.e., k BAO = π/ BAO with BAO scale taken as a fixed scale BAO = 110 Mpc/h. For q k BAO the square brackets suppress the integrand by (q/k BAO ) 2 , as expected since displacements longer than BAO do not contribute to the smear of the BAO, although that suppression does not affect the value of Σ 2 in practice due to the shape of the CDM spectrum. For q k BAO the square brackets are dominated by one-point displacement (first term). The other scale that enters Eq. (18) is k S , which serves to separate large-scale from small-scale modes and thus indicates the range of scales that are being resummed. Strictly speaking, to perform an infrared resummation one should sum over modes q < k, but in practice this cutoff is typically set to the fixed value k S = 0.2 h/Mpc e.g. [48,82], and we will follow this practice below in Section V. Equations (16)(17) then represent the infrared resummed one-loop power spectrum in SPT.

Small-scale corrections: non-trivial stress tensor
Orbit crossing generates a non-zero stress tensor even if dark matter is perfectly cold to begin with, i.e. if it has an initial distribution function corresponding to a delta function in momentum space. The width of this distribution function is characterized by the stress tensor σ, which gets generated by dynamics in multistreaming regions [83]. Estimates from numerical simulations show that for CDM spectra this effect is small compared to loops but not altogether negligible on large scales [83][84][85][86]. The leading order correction to the power spectrum is to add a k 2 P (k) suppression [83,87,88], with c 1 being a number that is linked to the timedependence of the stress-tensor and occasionally referred to as effective speed of sound in the context of effective field theory (EFT, [87,88]). The time-dependence of the stress-tensor is difficult to calculate for generic ΛCDM universes and so c 1 has to be treated as an additional free parameter that is determined from the data itself (see e.g. [89,90]). Moreover, Eq. (19) satisfies a second purpose: it allows to absorb the leading sensitivities to the highly nonlinear regime, which can arise as the loop integrals are nominally performed over the full range of scales, including those where we do not expect perturbation theory to hold. For typical linear ΛCDM power spectra these integrals converge rapidly, which implies that any such sensitivities must be dominated by the lowest order term in the high-q expansion 4 of a given loop integral [90][91][92][93][94], and which scales as ∼ k 2 P L (k).
Combining the IR resummed one-loop SPT spectrum with these small-scale corrections, the total matter power spectrum can thus be written as where we denote this as the EFT one-loop matter spectrum, and we have replaced the linear power spectrum from Eq. (19) with its infrared resummed equivalent 5 . Similarly, one should add P σ to the gRPT predictions discussed above, but as mentioned the broadband gRPT power is a bit lower at high-k compared to SPT, which means that c 1 would be even a smaller correction in this case, therefore we shall set c 1 = 0 for simplicity. This is also reasonable since we are ultimately interested here in computing the power spectrum of biased tracers, and bias loops also change broadband power; in particular, the γ 21 bias contribution is exactly degenerate with P σ in the low-k limit [43]. We discuss this issue further in Section V.
As a third approach beyond gRPT and EFT, we consider the combination of a simulation calibrated power spectrum at a fixed cosmology with a perturbative model based on RegPT for the so-called response function. The response function K(k, q) was introduced in [94,95] and quantifies the variation of the nonlinear matter power spectrum at scale k induced by a change of the linear power at scale q. More precisely and in similarity to the multipoint propagators, it is defined as a functional derivative of P mm (k) with respect to P L (q), A model for the response function using RegPT at twoloop order was presented in [96] and compared to measurements from simulations. This study revealed very good agreement for k q, but not as good otherwise. The deviations q k are due to the breaking of GI in the RegPT resummation as discussed above, whereas for q k the response function had too strong sensitivity to small scale modes, a sign of a non-trivial stress tensor that is neglected in RegPT. Based on these considerations, the authors of [96] provided a phenomenologically motivated modification to their model (without the need for additional free parameters) that was subsequently shown to reproduce the simulation measurements over a wide range of scales for k and q.
The response function approach becomes particularly powerful once the matter power spectrum for some fiducial set of cosmological parameters θ fid is known with high precision since the response function allows us to translate this to another cosmology θ as follows: This expression is only valid if the difference between the linear power spectra of the fiducial and target cosmologies is small, but as shown in [96] this limitation can be overcome by performing a multi-step reconstruction that takes into account the cosmology dependence of the response function. This procedure has been implemented in the RESPRESSO package [96], which is making use of a fiducial matter power spectrum at the Planck 2015 cosmology [97] obtained from a set of high-resolution simulations with suppressed variance [98].
Summarizing, in Section V we shall compute the nonlinear matter spectrum using gRPT, EFT and RESPRESSO and check whether our results on one-loop bias are sensitive to this choice, in particular when compared to using the measured nonlinear matter spectrum in the simulations as done in Section IV (our baseline, main results). Note that these three models are on a somewhat unequal footing, in the sense that gRPT and RESPRESSO have the same number of free parameters, but the latter has in it information from simulations already; on the other hand, EFT has one extra free parameter compared to the other two that should help fit the data, at the expense of perhaps a lower figure of merit. That provides a useful range of strategies that can be applied to data.

Degeneracy between stress-tensor and higher-derivative effects
The counter-term due to a non-vanishing stress-tensor is clearly degenerate with the higher-derivative bias contribution that was discussed in Section II A. However, since at one-loop order the counter-term enters only through P mm , the two effects leave different imprints on the auto and cross galaxy power spectrum: which would in principle allow us to break their degeneracy and constrain c 1 and β 1 simultaneously in a joint analysis, for instance, in combinations of galaxy clustering and weak lensing. This ceases to be true if there are significant sensitivities to the nonlinear regime stemming from the bias loop integrals, which would also be absorbed by the k 2 P L (k) dependent terms. As they are not guaranteed to be identical for P gm and P gg , this would lead to different values of c 1 in Eqs. (23) and (24). For that reason we choose the more conservative approach of collectively describing these effects by two independent parameters, β × P and β P , when performing model fits. If, on the other hand, only higher-derivative bias is relevant we note that in that case we would have β P = 2b 1 β × P . This is what happens, by definition, when we use the nonlinear matter spectrum measured from simulations to test one-loop galaxy bias in Section IV rather than the models described above, and thus it is a useful consistency check.

D. Stochasticity
We now turn to the final ingredient in modeling the clustering of biased tracers, namely, how their formation history is impacted by very short-wavelength fluctuations. While the description of these modes is beyond the conventional reach of perturbation theory, an important characteristic is that they are mostly uncorrelated with the long-wavelength perturbations. That means on large scales they can only contribute as a stochastic field 6 ε g , which does not correlate over long distances and is thus described by highly localized N -point functions in configuration space. In Fourier space this allows for the following expansion of the stochasticity power spectrum C gg (k) [7]: with the constant term, N 0 , representing deviations from purely Poissonian shot noise on large scales. The scaledependent piece is meant to account for the leading shortrange nonlocality of the stochastic field in analogy with the higher-derivative effects introduced above. Consequently, the parameter N 2 is also associated to some intrinsic length scale tied to the Lagrangian radius of the dark matter halos 7 . These shot noise corrections are qualitatively expected from the halo exclusion effect that arises from the condition that dark matter halos cannot overlap, which implies that their correlation function must approach −1 on scales below their radii [12,[34][35][36]. As has been shown in [99], for the power spectrum this leads to a modification of Poisson shot noise on large scales that transitions to the familiar value 1/n in the limit k → ∞ (n being the average tracer number density). Depending on the tracer of the matter field, N 0 can either be negative or positive, indicating large-scale sub-or super-Poissonian shot noise, respectively. The former occurs usually for galaxy populations with low satellite fractions that trace the centers of massive halos, whereas super-Poisson values have been observed in simulations for low-mass subhalos or galaxies that frequently appear as satellites [36,100].
While the first scale-dependent term in Eq. (25) is able to capture an emerging change in the shot noise value with increasing k, we stress that it needs to be strictly interpreted as a low-k expansion because the truncation at finite order introduces various evident shortcomings: 1) in the limit k → ∞ Eq. (25) does not approach Poisson noise as expected, and 2) it does not converge upon Fourier transformation to configuration space, so that the description of analogous effects in the correlation function cannot be described in this way. However, as long as one is interested in large scales, the nontrivial noise properties are all constrained to small scales in the correlation function, and therefore they can be ignored.
The stochastic terms are also required from a different point of view. Similar to the loop integral's sensitivity to the non-perturbative regime for growing values of k, they terms of the general bias expansion, such as δ etc. For the power spectrum at one-loop order these extra fields are of subleading importance, but they give a relevant contribution to the bispectrum [7]. 7 Throughout this paper we will refer to this interchangeably as scale-dependent stochasticity or scale-dependent noise.
can also leave an impact on very large scales. This becomes clear when taking the large-scale limit of Eq. (13), which gives but this value is highly dependent on the order at which we truncate the bias expansion as terms at two-loop order and beyond will add to this limit [27]. For that reason, we deal with such terms by subtracting Eq. (26) from Eq. (13) and absorb their contribution into the free parameter N 0 [16,56]. Subleading corrections to Eq. (26), if relevant, scale as ∼ k 2 and can similarly be absorbed by N 2 . Finally, it is also possible to have stochasticity in the cross power spectrum. While the field ε g cannot contribute since ε g δ = 0, it is expected that the matter field itself is described by a deterministic part, captured by perturbation theory, and a stochastic part ε m , which is due to small-scale modes leaving an imprint on large scales. Such a field, on the other hand, would correlate with ε g and on the grounds of mass and momentum conservation of the matter field the resulting cross [101][102][103]. Therefore, which shows that stochasticity in the cross power spectrum is already a higher-order effect.

E. Co-evolution relations
Nonlinear evolution contributes to the biasing of any class of tracers. It was shown in [22,104] that even when starting from a purely local relationship between the galaxy and matter density contrasts at the time of formation (i.e., only the local bias parameters b n are unequal to zero), this is no longer the case at any later point in time. Assuming conserved evolution (hereafter co-evolution) of the galaxies after formation one can derive the following co-evolution relations [22,23,27] where quantities with subscript L denote the Lagrangian bias parameters, i.e. their values at past infinity. The additional terms involved in Eqs. (28) and (29) are those induced by gravitational evolution and can be entirely expressed in terms of lower-order bias parameters. We see that having zero nonlocal bias parameters at late times requires highly fine-tuned Lagrangian parameters and various analyses that measured the tidal bias from simulated halo catalogs using various techniques  [22,23,37,[105][106][107] have conclusively ruled out this scenario. On the other hand, the local Lagrangian approximation (LL), γ 2,L = 0, provides a much more accurate description of these measurements, even though the most recent results from [106,107] indicate slightly lower values. This is shown in Fig. 1, where we plot their data obtained from various halo samples at different redshifts against the LL prediction (dashed line). An alternative estimation (strictly speaking, not a coevolution relation) of the tidal bias parameter in the context of the excursion set approach was discussed in [105]. Using a random walk with correlated steps to determine the probability for crossing the collapse barrier (see also [108,109]), they make a prediction for γ 2 , which can be represented by the following quadratic fit: This fit is shown as the solid line in Fig. 1 and provides a slightly better description of the measurements than the LL assumption for b 1 1.3. Hereafter we shall refer to Eq. (30) as the "excursion set relation" as a shorthand, but note this does not represent a first-principle calculation. The shaded region around Eq. (30) in Fig. 1 denotes a Gaussian distribution with uncertainty of 0.25 around γ 2,ex , which corresponds to our best constraints in fits to be discussed below when γ 2 is set to be free. Given these uncertainties the distinction between Eq. (30) and the LL relation is likely irrelevant for the purposes of this work, but should be reconsidered when γ 2 is significantly better constrained, as is the case in joint fits with the bispectrum [110].
Relations such as the ones above can be useful because they allow us to reduce the number of free model param-eters that have to be marginalized over, which in turn restores constraining power on other, potentially more interesting parameters. In general, fixing bias parameters via the co-evolution relations seems preferable over ignoring them altogether, but one should proceed with caution, as this can still be a potential source of bias in the analysis. In Section IV we try to determine in detail for what range of scales this is the case.

A. Galaxy and halo catalogs
In order to test the performance of the previously described model for a diverse range of bias properties, we use various populations of tracers at different redshifts, whose details are summarized in Table I.
We consider three galaxy catalogs at redshifts z = 0.132, 0.342 and 0.57 that were generated by assigning galaxies to selected halos and subhalos of dark matter only N-body simulations according to a halo occupation distribution (HOD). The catalogs for the lower two redshifts are based on the Carmen and Oriana boxes of the LasDamas simulation suite [111,112], whose cosmology is defined by the parameters given in Table II. The two boxes have a volume of (1000 Mpc/h) 3 and (2400 Mpc/h) 3 , respectively, mass resolution of 4.9 × 10 10 M /h and 4.6×10 11 M /h, and in total there are 40 realizations each that were set up with independent initial conditions using CMBFAST [113] and second-order Lagrangian perturbation theory [114]. The parameters of the HOD model have been tuned for these two cases to match the number densities and projected two-point clustering of the SDSS Main Galaxy Sample (MGS) with M r < −21 and the BOSS LOWZ sample, see [112] for more details. The high-redshift galaxy catalog derives from the Minerva simulations [115], a set of 100 boxes of volume (1500 Mpc/h) 3 (see Table II for cosmological parameters) with an HOD model that reproduces the properties of the BOSS CMASS galaxies. Even though the volumes of the individual boxes are different from those of the actually observed samples and our catalogs do not include the survey geometry or any systematic effects, for convenience we will still refer to them as MGS, LOWZ and CMASS.
In addition, we identified two low-and two high-mass halo samples at redshifts z = 0 and 0.974 from the Oriana boxes. The low-mass bin at z = 0 contains halos of masses 1 -10 × 10 13 M , while all of the more massive halos are grouped together in the high-mass bin. For the high redshift slice we split halos according to M halo ∈ [1.3, 2] × 10 13 M and M halo > 2 × 10 13 M . In the following we label these four samples as HALO1 to HALO4. All halos, including the ones in the Minerva simulations on which the HOD's are based, were identified using the friends-of-friends algorithm [3] with linking length equal to 0.2 times the mean interparticle separa- tion. The halos in the Minerva simulations are then further processed through the subfind algorithm [116].

B. Measurements of power spectra and their covariances
We measure the auto power spectrum, as well as the cross power spectrum with the nonlinear matter field, for each of the galaxy and halo catalogs in Table I using the estimator described in [117]. For the galaxy samples these measurements are carried out in bins of ∆k = k f over the range k min = ∆k to k max = 0.35 h/Mpc, whereas we use ∆k = 2k f and an according range of scales for all four halo samples. Here, k f = 2π/L box denotes the fundamental frequency of the simulation box, which is given by 6.3 × 10 −3 h/Mpc, 2.6 × 10 −3 h/Mpc and 4.2 × 10 −3 h/Mpc for Carmen, Oriana and Minerva, respectively. Finally, we correct the galaxy auto power spectra by subtracting the Poisson shot noise contribution 1/n.
From the measurements over the N R independent realizations we further estimate the covariance matrices for P gg , P gm and their correlation as follows: where X i , Y i = P gg (k i ) or P gm (k i ) and over-bars denote their averages. As the number of realizations is generally small compared to the total number of measurements, the estimated covariance matrices are noisy or not even invertible in case their dimensionality exceeds N R . However, with the exception of MGS, all samples have such low number densities that the covariance matrices receive strong contributions from shot noise, which boosts the diagonal elements and thus relatively decreases the correlation in the off-diagonal terms. For that reason we only consider the diagonal part of the covariance matrix when fitting P gg or P gm individually, but we retain the correlation between the two power spectra when performing a joint analysis. To reduce noise we compare the measured uncertainties to the Gaussian prediction and retain the maximum between the two, which provides a conservative estimate. The resulting diagonal or block-diagonal matrices can be inverted analytically.
As the various samples differ in terms of volume, number density and amplitude of the power spectrum, their statistical power to constrain model parameters can also vary significantly. To reduce the effect of some of these dependencies, we apply a volume scaling factor η chosen such that all samples have identical effective volume [118,119], which is given by when assuming a constant number densityn. We take the effective volume of the LOWZ galaxy sample evaluated at k = 0.1 h/Mpc as our reference, which yields V eff,LOWZ ≈ 6 (Gpc/h) 3 . The scaling factor for all other samples is thus determined by η = V eff,LOWZ /V eff,X and as the covariance matrices at leading order scale inversely with volume, we modify our estimates according to In Fig. 2 we show the cumulative signal-to-noise of the various samples as a function of maximum wavenumber k max using these rescaled covariances (see Figure legend for the value of η in each case). While the differing levels of shot noise lead to a more or less strong suppression for high k max , we note that the signal-to-noise is generally in good agreement over a large range of scales relevant to our analysis.

FIG. 2.
Cumulative signal-to-noise of galaxy and halo samples (solid and dotted lines, respectively) after rescaling the covariance matrices by the volume factor η (see Eq. 33).

C. Measurements of the linear bias parameter and large-scale shot noise
The simulations allow us to make precise measurements of the linear bias and the non-Poissonian correction to large-scale shot noise. As discussed in Section II D P gm is free of shot noise in the limit k → 0, so that taking the ratio with the matter auto power spectrum P mm (also measured directly from the simulations in the same k-bins) recovers b 1 : In practice we compute this ratio for each realisation before taking the average in order to cancel cosmic variance and afterwards fit a constant to the first few bins. Specifically, we use the cutoff scales k max = 0.025 h/Mpc, 0.028 h/Mpc and 0.021 h/Mpc for Carmen, Oriana and Minerva, respectively. The resulting values of b 1 are given in Table I and will be considered as the ground truth, when comparing to the output of the model fits below. Using these measurements of the linear bias parameter we can construct the fieldε g (k) = δ g (k) − b 1 δ(k), which corresponds to the stochastic field introduced in Section II D on large scales, where contributions from higher-order bias terms are negligible. The power spectrum ofε g can be expressed in terms of the measured galaxy and matter power spectra [36,120] , and should asymptote to Eq. (25) in the large-scale limit. This is demonstrated by Fig. 3, which plotsC gg in units of the Poisson noise of each sample and shows that the stochasticity power spectrum approaches a constant on large scales, as expected. Moreover, we see that this constant is negative for all but the low-mass halo sample at redshift z = 0, indicating that they have sub-Poissonian noise and are thus dominated by the halo exclusion effect [12,[34][35][36]. We measure the value of N 0 by fitting a constant to the data in Fig. 3 using the same cutoff scales as for the measurement of b 1 (approximately illustrated by the vertical dashed line). The results are given in Table I and are also shown by the solid lines. Unfortunately, it is not possible to determine the scaledependent component of the stochasticity directly from this data, as its effect is conflated with higher-order bias terms to produce the deviations from a constantC gg for scales beyond ∼ 0.03 h/Mpc.
In principle, the procedure followed here can be extended to measure higher-order bias parameters as shown in [106,107]. While this could be an interesting consistency check with the results from the model fits, we leave this possibility for future work. We now proceed to the description of how we infer model parameters from the measured power spectra. Firstly, since the different realizations are statistically independent, we define the overall likelihood as the product of all likelihood functions for a single ensemble member, which is just saying that we are simultaneously fitting a given number of realizations with the same model, see [121] for an identical approach. We assume that the likelihood function for a single realization is given by a multivariate Gaussian, so that where X (n) i is the measurement from the n-th realization in bin i, µ i the corresponding model prediction and C ij the rescaled covariance matrix as described in Section III B. The factor 1/N R ensures that after combining N R likelihood functions the sampling volume still corresponds to our desired volume of V eff,LOWZ and one can show that Eq. (36) is equivalent (up to a constant) to the likelihood for the mean of the data with covariance C X,ij . However, for all cases considered here our sampling volume is smaller than the combined volume of the N R simulation boxes, so the scatter in the data is less than statistically expected. We have checked that this has no significant impact on the parameter posteriors, but has to be accounted for when using the χ 2 as an indicator for the goodness-of-fit (see Section III E 2).
In a next step we minimize Eq. (36) by varying the model parameters using Markov chain Monte Carlo (MCMC) and a standard Metropolis-Hastings sampling algorithm, leaving the cosmological parameters fixed to their fiducial values. We generally adopt wide and flat prior distributions for all other parameters and the exact bounds used in our analysis are given in Table III. The bounds of the N 0 prior are determined from the condition that the overall stochasticity contribution to the power spectrum cannot be negative and that super-Poisson noise values (N 0 > 0), if they occur, tend to be small (see e.g. [100]). For the HALO4 sample we adapted the b 2 prior such that it is constrained to positive values in order to prevent a strongly bimodal posterior distribution. This behavior is caused by the b 2 2 term in Eq. (13), which comes to dominate on small scales because the high redshift and the highly biased nature of this particular sample lead to a generically large b 2 . Since the peak-background split prediction [122] strongly favors a positive b 2 given the linear bias parameter of this sample, we choose the prior to be positive. The only exception in our list of priors is the tidal bias parameter γ 2 , which we assume to be a Gaussian centered on the excursion set relation γ 2,ex (b 1 ) (see Eq. 30) with standard deviation 0.5. We stress that the mean of this prior depends on the linear bias parameter and therefore changes in each step of the Markov chain. This choice is motivated by 1) an otherwise strong degeneracy 8 between γ 2 and γ 21 , and 2) the good agreement (much better than the width of the prior) between the excursion set relation and independent measurements of γ 2 from halos.
For every fit that we perform we run several independent Markov chains and determine their convergence by means of the Gelman-Rubin criterion [123], specifically R − 1 < 0.01, but make sure that they run for at least a total of 120,000 accepted steps. After removing the burn-in these are combined into a single chain, which we pass to the software package getdist [124] to extract the parameter posteriors.

E. Performance metrics
Adding complexity to the model of the galaxy power spectrum can increase its range of validity down to smaller scales when compared to data, but the price to pay is a larger set of nuisance parameters that have to be marginalized over in order to arrive at the desired constraints on any cosmological parameters. Vice versa the application of co-evolution relations as discussed in Section II E can improve the merit of the model, but potentially only over a rather limited range of scales. Clearly, there is a compromise to be found between the validity and merit of our perturbative models. To quantify such a balance we compute various indicators from our Markov chains, following a similar analysis presented in [52] that focused on the matter power spectrum alone.

Figure of bias
One quality of a robust model must be its ability to return unbiased estimates of model parameters. Having determined the posterior means (symbolized by an overbar) of a set of parameters θ α along with their covariance matrix S αβ , we can define the following simple measure, which we denote as the Figure of Bias (FoB): Here we have additionally taken into account any uncertainty in the fiducial parameters by writingS αβ = S αβ + S fid,αβ . Since in this work we do not vary the cosmological parameters, we define the FoB solely with respect to the linear bias parameter and consider the measurements detailed in Section III C as the truth, in which case Eq. (37) simplifies to FoB = b 1 − b 1,fid / σ 2 b1 + σ 2 fid,b1 with σ fid,b1 denoting the uncertainties reported in Table I. As an amplitude parameter, which scales up and down contributions from different terms to the power spectrum, it can be regarded as a proxy for σ 8 . However, we caution that it is most likely not representative of other cosmological parameters, such as Ω m and h, which leave a stronger impact on the baryon acoustic oscillation signature and its overall shape. As a test of the galaxy bias modeling our definition of the FoB should certainly be adequate.

Goodness-of-fit
The FoB alone is insufficient to judge the validity of a model. It is easy to imagine a situation, particularly so when the FoB is only based on a subset of all model parameters, where one recovers the fiducial values, while the model is actually not a good description of the data. For that reason we also need to assess the goodness-offit, which we quantify in terms of the minimum standard χ 2 values that are computed as part of our likelihood, i.e., χ 2 tot = 1/N R N R n=1 χ 2 (n) (see. Eq. 36). However, as already mentioned in Section III D, when considering the total amount of data the measurement uncertainties used in our analysis are statistically too large by a factor of N R /η, so we have to rescale the χ 2 accordingly to get a meaningful value: This value can subsequently be compared to the 68 % or 95 % confidence limits of the χ 2 -distribution with dof degrees of freedom to determine the goodness-of-fit. The degrees of freedom are given by where N bin and N p are the number of bins and model parameters, respectively. In the following we will also perform MCMC runs where we replace the nonlinear matter power spectrum (i.e., the term that gets multiplied by b 2 1 or b 1 in the galaxy auto and cross power spectrum) with direct measurements from the underlying matter fields. In that case we inadvertently model part of the scatter in the data, which means that the difference between data X i = P gg (k i ), P gm (k i ) and model µ X,i is no longer given by a multivariate Gaussian with covariance ma- where n X = 2 (1) for X = P gg (P gm ) and C m,ij and C X×m,ij are the covariance of the matter field and its cross-covariance with X. This leads to a reduction of the inferred χ 2 value by the amount which we correct for using the fiducial value of b 1 before applying the rescaling according to Eq. (38). For joint analyses of the auto and cross power spectrum we proceed in the analogous fashion.

Figure of merit
Finally, we define the constraining power of a given model by the reciprocal of the posterior volume corresponding to the 68 % confidence limit. This is related to the determinant of the parameter covariance matrix and so we write our .
The inclusion of the additional factor θ fid,α θ fid,β ensures that the FoM is defined in terms of the relative parameter uncertainties, which yields more comparable results for our different samples. As for the FoB we focus on the linear bias parameter, such that we simply have FoM = b 1,fid /σ b1 .

IV. TESTING ONE-LOOP GALAXY BIAS
We now turn to the main goal of the paper, that is, to test the regime of validity of one-loop galaxy bias, and to see which effects in the bias expansion are most important. To do so we use the measured nonlinear matter spectrum in place of P mm , as this allows us to concentrate on bias independently of any issues related to the nonlinear evolution of matter description. As mentioned in the previous section, we test a variety of biased tracers at different redshifts to extract robust conclusions about which bias effects are generically important. Also note that by ignoring redshift-space distortions, we are avoiding extra parameters that can mask failures of the bias model. Furthermore, we check that the bias parameters we obtain from our MCMC chains satisfy basic sanity checks with independent measurements we can make and/or fitting formulae when available in the literature. Our most stringent test asks the bias model to simultaneously match the auto (galaxy-galaxy) and cross (galaxymass) spectrum, but we start with the most common (weaker) test of using the auto spectrum alone. In addition, we investigate the relevance of the higher-derivative and scale-dependent noise terms, as well as the impact of using co-evolution relations to reduce the number of free fitting parameters.
A. Validity of one-loop galaxy bias for the auto power spectrum In this section we analyze the performance of the galaxy auto power spectrum for three different modeling options: 1) only including terms from the one-loop galaxy bias expansion (also referred to as "standard" model in the following), 2) taking also into account short-range nonlocality and the resulting higher-derivative contribution, and 3) considering scale-dependent stochasticity instead of higher-derivative bias. While the first option has five free parameters in total (b 1 , b 2 , γ 2 , γ 21 and N 0 ), the other two have one extra parameter each.

Fiducial survey volume
We plot the derived FoB, reduced minimum χ 2 and FoM of these three models in Fig. 4, represented by the thick, solid lines. Our metrics are shown as a function of the maximum mode k max that is included in the fitting procedure up to k max = 0.35 h/Mpc, but for ease of comparison we introduce a scale k † at which point the model is deemed to fail and we truncate the FoM, which is indicated by a triangle. We define this scale as the combination of FoB and minimum χ 2 reaching a certain critical value, specifically with σ crit = 1.5 9 . This means we allow for a maximum bias of 1.5 σ if the minimum χ 2 matches the degrees of freedom (i.e., the expected χ 2 for a good fit to the data), and similarly, if the fit is completely unbiased, the discrepancy between χ 2 and dof can be as large as one and a half times the corresponding value for the 95 % confidence limit. The gray shaded areas in Fig. 4 mark the 68 % and 95 % limits for the FoB and χ 2 . We first notice that the standard model (black) performs very well and delivers unbiased constraints on b 1 as well as a good fit to the measurements for a broad range of scales. In fact, according to our criterion it does not break down before k max = 0.35 h/Mpc for all samples with the exception of the LOWZ and the two high-mass halo samples (HALO2 and HALO4), for which the model stops working at k max ∼ 0.2 h/Mpc. In these cases we can extend its range of validity by including either a higher-derivative (red) or scale-dependent noise term (blue) and whereas the former does not yield significantly smaller FoBs, the latter allows us to fit the data again up to k max = 0.35 h/Mpc. However, as anticipated, this comes with a penalty in the FoM and comparing the maximally achievable FoM for LOWZ, HALO1 and HALO2 the standard and scale-dependent noise models are about equal. In all other samples adding a scaledependent noise parameter has a bigger impact on the FoM than the higher-derivative term and leads to a decrease of ∼ 25 % in the FoM (at least within the tested range of scales).
Furthermore, we study how our performance metrics change when fixing γ 2 using the excursion set relation from Eq. (30) and also by additionally constraining γ 21 to co-evolution as a function of both b 1 and γ 2 (see Eq. 29). These cases are shown by the dashed and dotted lines in Fig. 6 for each of the three modeling options. Focusing on the dashed lines first, we observe that they tend to be valid over the same range of scales as when γ 2 is being varied under a Gaussian prior with the exception of HALO4, where fixing γ 2 gives rise to a significant increase in FoB. Moreover, they yield improvements of the FoM that are of the order 20-30 % for LOWZ and HALO2, but more modest in all other cases (some even have lower FoM), which seems to imply a dominance of the Gaussian prior. Remarkably, fixing γ 2 and γ 21 at the same time does not generally lead to any stricter limitations in the  FIG. 4. FoB, goodness-of-fit and FoM for the auto power spectrum as a function of the maximum k-mode included in the fit (note the differing y-axis ranges for the FoM). Grey shaded areas indicate the 68 % and 95 % confidence limits for the FoB and the χ 2 distribution, which are used to assess the breakdown of the model, at which point we stop plotting the FoM. Colors distinguish between the standard one-loop bias model (black, 5 free parameters) and extensions to higher derivatives (red, 6 free param.) and scale-dependent noise (blue, 6 free param.). Thick, solid lines only impose a Gaussian prior centered on γ2.ex(b1), while the thinner dashed and dotted lines have γ2, or both, γ2 and γ21 fixed in terms of the excursion set and co-evolution relations.
range of validity apart from the HALO1 sample, whose FoB and minimum χ 2 rise quickly for k max 0.2 h/Mpc. However, the benefit can be much higher and in particular for the standard model can result in improvements that are as large as a factor of three (e.g. MGS and CMASS).
We also note that for many samples γ 21 is clearly constrained to be non-zero, even for moderate scales k max ∼ 0.2 h/Mpc, when we fix γ 2 and thus break their degeneracy. This can be seen in Fig. 5, where we plot  Fig. 4. We therefore conclude that ignoring γ 21 in models of the galaxy power spectrum is highly disfavored opposed to employing the co-evolution relation.

Estimating the dependence on survey volume
We can now raise the question how these results depend on our criterion that defines the validity of the model and the adopted survey volume. To a first approximation these two questions are equivalent as a simple estimate reveals: increasing the volume by a factor V eff /V eff,ref (V eff,ref denotes our reference volume of V eff,LOWZ ) means that the parameter uncertainties decrease by the square root of that factor and hence the FoB increases accordingly. While the χ 2 grows linearly with volume, so do the degrees of freedom, such that the ratio χ 2 /dof must stay invariant. However, the 95 % confidence limit behaves in a different way, namely dof, which can be easily shown from the fact that the χ 2 distribution is well approximated by a Gaussian with mean dof and standard deviation √ 2dof for dof 1. Therefore, the second term in Eq. (43) also scales as the square root of the volume factor and we can write In reality this scaling will not be exactly satisfied because of parameter degeneracies and noise in the data, but it can be exploited to glean useful insights into how our results extrapolate to larger survey volumes. Using Eq. (44) we solve Eq. (43) for k † as a function of volume and determine the corresponding FoM, which we scale up by V eff /V eff,ref . The results, ranging from our nominal volume up to a ten-fold increase, are shown in Fig. 6 for the standard, higher-derivative and scaledependent noise models when only the Gaussian prior on γ 2 is applied. Note that we are limited in our k max range from 0.1 h/Mpc to 0.35 h/Mpc, so if the validity criterion is not met even for the smallest mode, we opt to compute the FoM for k † = 0.1 h/Mpc, which is clearly an over-estimation and thus marked by dashed lines in the figure.
As expected, we see that an increase in volume leads to a decrease in the range of validity for all three models with the extension to scale-dependent stochasticity consistently providing the largest k † values, followed by the higher-derivative and standard model. In particular, even for a ten-fold increase in volume the former still proves to be robust up to 0.2 h/Mpc and in several cases beyond. A special case is the HALO1 sample, which quickly reaches the lower limit for k † , caused by the FoB already being of order one for our fiducial volume. Since this is independent of k max (see Fig. 4) and we find smaller FoBs when combining with the cross power spectrum in Section IV B, this can likely be attributed to a projection effect when marginalizing over the posterior.
Furthermore, Fig. 6 demonstrates that the decrease in k † follows roughly a power law, whose slope is similar for the various models, but varies from sample to sample. However, for CMASS and HALO3 it is notably shallower when allowing for scale-dependent stochasticity, which has important consequences for its FoM, as the initial discrepancy compared to the standard model can be compensated at larger volumes. More generally we see that with increasing volume the scale-dependent noise model always gives rise to either equal or better FoMs than the standard model and so its extended k max range can overcome the penalty of having an extra free parameter -on the other hand, this is not true for the higher-derivative model.

B. Consistency between auto and cross power spectra
We now move on to check over which range of scales the various models already discussed in the previous section are able to make consistent predictions for the auto and cross power spectra. When dealing with a single observable whose model involves a large enough parameter space, it is possible that a failure of the model can be disguised by the nuisance parameters (meaning here parameters whose values are not being tracked by the FoB) absorbing any lacking contributions. Consistency in the auto and cross power spectra is therefore a much more stringent test of the one-loop galaxy bias model, especially because from a perturbation theory point of view one would expect these two statistics to be valid over the same range of scales.
In Fig. 7 we present the three performance metrics derived from jointly analyzing the measurements of the auto and cross power spectra. While the standard model still only has five free parameters, the other two now have seven each as we allow for independent higher-derivative or scale-dependent stochasticity effects in P gg and P gm . Focusing to begin with on the thick, solid lines again, we observe that the range of validity of the standard model is reduced by up to 30 % compared to the auto power spectrum alone. This can be attributed to a quick increase in the minimum reduced χ 2 , which happens for most samples at scales k max ∼ 0.2 h/Mpc (somewhat later for MGS and HALO1) and indicates an arising inconsistency between the auto and cross power spectra. Indeed, by comparing the parameter posteriors of the individual fits, we find that the scale at which the rise in χ 2 occurs is typically accompanied by a mismatch of marginalized parameter constraints, particularly for the second-order bias b 2 .
Introducing higher-derivative terms helps alleviating this inconsistency as the decrease in the χ 2 values shows, but it does not resolve it, which is especially evident for LOWZ and the two high-mass halo samples. On the other hand, the scale-dependent noise model brings about significant improvements for all samples without any indication of breaking down in the range of scales probed, except for HALO2 and HALO4, where it remains valid until k max ∼ 0.3 h/Mpc. This suggests that for our most massive objects either both, higher-derivative and scale-dependent noise effects, eventually become relevant, or the lack of two-loop terms from the general bias expansion. Comparing the various FoMs we come to similar conclusions as for the auto power spectrum alone: in cases where the standard model fails early on, the extended range of the more complex models can compen- sate for their extra free parameters, but does not lead to significantly larger FoMs. As discussed above, this might change when a larger survey volume (or stricter validity requirement) is considered.
As for the auto power spectrum alone, using the excursion set relation for γ 2 does generally not diminish any of the three model's range of validity (as before with the exception of HALO4 and to a lesser degree HALO1). However, since the combination of P gg and P gm reduces the impact of the Gaussian prior on the γ 2 posterior, the improvements in FoM are consistently larger and of the order ∼ 10 -20 % in most cases. Further and even more substantial improvements can be achieved by fixing γ 21 to its co-evolution relation, although we now observe that this leads to premature failures of the model for more samples than in case of P gg .
Finally, an interesting feature of the FoM graphs for various modeling options in Fig. 7 is their tendency to FIG. 8. Constraints on large-scale deviations from Poisson shot noise, as captured by the parameter N0. The red and blue shaded error bands show the 1-σ uncertainties obtained from a joint fit of the auto and cross power spectrum at a given kmax including either a higher-derivative or scale-dependent noise term. The hatched area represents the constraint from the large-scale (model-independent) fit as described in Section III C.
flatten off towards large k max . This suggests that the information that can be extracted from the nonlinear regime, at least for the linear bias parameter, is saturated for the combination of the auto and cross power spectrum beyond a certain value of k max .

C. Constraints on stochasticity and higher-derivative parameters
In the previous two sections we have seen that the scale-dependent noise model provides a more accurate and less biased description of the measurements in the nonlinear regime than the extension to higherderivatives. We now intend to further investigate this assertion by considering the derived constraints on the stochasticity and higher-derivative parameters.
To begin with we check the consistency between the results of our model fits and the (model-independent) determination of the constant shot noise parameter N 0 from large scale data only (see Section III C). This is shown in Fig. 8, where we plot the fully marginalized 1-σ uncertainties from the higher-derivative and scaledependent noise models as a function of k max (red and blue bands), obtained from the joint P gg and P gm fits with all parameters left free to vary. When compared to the large-scale measurement, depicted by the hatched band, we find in general good agreement with the model predictions. However, for scales k max 0.25 h/Mpc the higher-derivative model develops a dependence on k max that leads to a slight over-estimation of N 0 for all samples apart from HALO3. This is not the case when allowing for scale-dependent stochasticity instead, and we only identify a trend with k max for HALO2, but on scales where Fig. 7 already suggests a breakdown of the model.
Next, we consider the more interesting question whether we can put constraints on the scale-dependent noise or higher-derivative parameters that enter the auto and cross power spectra. Their 1-σ uncertainties, plotted against k max , are shown in Fig. 9, where we have again chosen the most conservative case in which none of the model parameters are held fixed. Each panel in the top half of the figure corresponds to either N 2 or N × 2 for a given sample, and similarly for β P and 2b 1 β × P in the bottom half. This choice of variables for β's is convenient since we are using the nonlinear matter spectrum to do the fits, in which case P σ = c 1 = 0, thus one should recover β P = 2b 1 β × P , see Eqs. (23)(24). The results derived from the individual P gg and P gm fits are shown as the gray shaded error bands, while the data points were obtained when jointly analyzing the two observables. Note that the scale-dependent noise and higher-derivative parameters have units of (Mpc/h) 5 and (Mpc/h) 2 , respectively, so we show them as the dimensionless numbers that multiply the factors k −2 HD /n or k −2 HD with the (arbitrary) normalization scale k HD = 0.4 h/Mpc.
First we note that the uncertainties on N 2 from the auto power spectrum alone are rather large, so that inclusion of even the smallest scales considered in our analysis does not yield a clear detection. The same is also true for β P in the higher-derivative model. On the other hand, the cross power spectrum, which does not contain N 0 and therefore has one free parameter less, gives non-zero values for N × 2 for all samples except MGS and HALO1 at a significance above the 1-σ level for k max = 0.27 h/Mpc. For the same samples we also find detections of β × P , but FIG. 9. Constraints on scale-dependent stochasticity in the auto and cross power spectra, N2 and N × 2 (upper half), and on higher-derivative effects, βP and β × P (lower half). Grey shaded areas represent the 1-σ uncertainties from the individual auto and cross power spectrum fits as a function of kmax, whereas the data points stem from their joint analysis. All results shown derive from fits in which all model parameter are allowed to vary.  Table I). The dashed line is a linear fit to this data with zero intercept.
their posterior means vary strongly with increasing k max , while those of N × 2 stay mostly constant beyond a certain scale.
For both models the parameter uncertainties shrink significantly once the auto and cross power spectrum are combined, which comes as no surprise given our conclusion in Section IV B that the stochasticity or higherderivative terms are required to restore consistency between the two statistics. Again it is noteworthy that the constraints on N 2 and N × 2 are much less sensitive to k max than those for β P and β × P , and additionally, they are in better agreement with the results from the individual fits, as is particularly evident for LOWZ, HALO2 and HALO4. This lends further support to the claim that scale-dependent stochasticity is the favored model extension for the samples under consideration.
Studying the recovered stochasticity constraints in more detail, we find that the N 2 parameter is consistently a factor of a few larger than N × 2 , meaning it is a stronger effect in the auto power spectrum than it is in the cross spectrum. Indeed, we obtain a similar outcome when we evaluate the relative contributions to the total auto or cross power spectrum from all combined galaxy bias loop corrections on the one hand and the scale-dependent noise term on the other 10 : while the two effects become nearly equal for P gg at k ∼ 0.3 h/Mpc, even on such small scales the stochastic term remains subdominant by at least an order of magnitude compared 10 For this we have used the best-fit parameter values obtained at kmax = 0.27 h/Mpc, though the precise kmax value is irrelevant.
to one-loop bias contributions in P gm , which is consistent across all samples. This seems reasonable if the stochasticity in the auto power spectrum is dominated by halo exclusion, which does not affect the cross power spectrum.
The halo exclusion effect should give rise to an additional signature in our constraints of N 2 . As discussed in Section II D we expect the stochasticity to approach the Poisson limit when k becomes large, which implies that for sub-Poissonian samples we should have N 2 < 0 and vice versa, N 2 > 0 for super-Poissonian populations of galaxies or halos. Among our samples only HALO1 has super-Poissonian shot noise on large scales and this is also the only case where we recover negative values of N 2 . Moreover, we find a strong (anti-) correlation between N 2 and the scale-independent shot noise parameter, which is shown in Fig. 10 where we plot the fiducial values of N 0 from Section III C versus the measured N 2 from the combination of P gg and P gm at k max = 0.27 h/Mpc. The data allows us to perform a simple linear, one-parameter fit, which gives and is shown by the dashed line. This suggests that the stronger the deviations from Poisson shot noise on large scales, the bigger will be the response from scaledependent stochastic contributions. The one outlier above the dashed line (and not included in the fit) corresponds to HALO4, the most extreme biased tracer in our sample. Even though the behavior of the measured N 2 and N × 2 parameters seems reasonable in the context of scaledependent stochasticity, it is difficult to ascertain that this is the correct interpretation. That is because its effect cannot be clearly distinguished from other higherderivative terms such as ∇ 4 δ. This term would give rise to a power spectrum contribution scaling as k 4 P L (k), which can appear identical to scale-dependent noise, as P L (k) ∼ 1/k 2 in the weakly nonlinear regime. For that reason we have repeated our analysis for the auto power spectrum where we implemented the exact k 4 P L (k) term in exchange for scale-dependent noise. We find indeed similar results that, however, display slightly larger χ 2 values consistently over all samples. Moreover, provided that the higher-derivative terms are a valid perturbative expansion in powers of (k R) 2 , higher order terms should be expected to become increasingly relevant at larger k, so it seems peculiar why the second one should dominate, while the first is mostly irrelevant. For these reasons we consider scale-dependent stochasticity as the more likely explanation. On the other hand, we caution that noise properties of halos/galaxies in our simulations may be unrealistic, in the sense that using a friends-of-friends halo finder imposes strong exclusion properties that may not be realized in nature. See e.g. [125] for the impact of halo finder in the small-scale clustering properties of halos.
FIG. 11. Same as Fig. 7, but only for the scale-dependent noise case with γ2 fixed to the excursion set relation, identified in Section IV as the bias model giving best results over a large range of scales when using the measured nonlinear matter power spectrum. Differently colored lines correspond to different choices of the nonlinear matter power spectrum, while the black solid line represents the previous result obtained with the measured nonlinear matter power spectrum.

V. RESULTS FOR SURROGATE MATTER MODELS
In this section we compare the results obtained using the measured nonlinear matter power spectrum with the various models presented in Section II C: the two fully perturbative predictions from gRPT and EFT, as well as the hybrid approach RESPRESSO. The aim of this analysis is to reveal how each of the three options affects the range of validity and the FoM.
In order to simplify this comparison we only focus on the case that includes scale-dependent noise and constrains γ 2 to follow the excursion set relation, which we previously identified as giving consistent results over a large range of scales, independent of the particular sample under consideration. Furthermore, we concentrate on the combination of the auto and cross power spectrum, as this allows for the most stringent test of the matter modeling. While gRPT and RESPRESSO do not contain any additional model parameters, we include the counterterm c 1 when performing fits using the EFT model and adopt a wide and flat prior in the range [−50, 50] k −2 HD for this extra parameter. Repeating all previous steps for analyzing the Markov chains, we arrive at the FoB, reduced χ 2 and FoM as a function of k max shown in Fig. 11, where the black, solid lines are for the true nonlinear matter power spectrum and the various colored ones correspond to its three surrogate models.
As expected, we first note that modeling the matter power spectrum introduces a further source of inaccuracy, which leads to a degradation of the reduced χ 2 for all samples. However, the decrease in the goodness-of-fit differs between the various models and we find that the EFT model is closer to the results obtained for the true nonlinear matter power for the halo samples, whereas RESPRESSO gives the best χ 2 behavior for the galaxies, but curiously it is somewhat worse for the two halo samples at redshift one, whereas gRPT has the largest χ 2 values for almost all cases and scales. These results are probably expected given the discussion in Section II C -EFT has the advantage of an extra free parameter compared to gRPT and RESPRESSO, while the latter has some information from simulations already built in. One could improve the χ 2 behavior of gRPT by including stress tensor corrections. In connection to this, [43] argued that having γ 21 free in the bias model partially compensates for the lack of stress tensor correction (as the two are exactly degenerate in the low-k limit), but our results indicate that this is probably not enough because these terms are important at nonlinear scales where the shape of the γ 21 contributions is not fully degenerate with k 2 P (k). Note also that there is mild evidence of overfitting caused by the extra parameter in the EFT, as it leads to better χ 2 than using the actual nonlinear spectrum measured in the simulations for the LOWZ and HALO2 samples, but overall this does not appear as a strong concern.
One may wonder to what extent are two-loop corrections in the matter field at play here. If this were the case, RESPRESSO should be uniformly the clear winner, particularly at low redshift (MGS, HALO1, HALO2) since it is the only method incorporating two-loop information. However, as seen in Fig. 11, the situation is not as clear cut, in particular for the halos. Another issue at play is that RESPRESSO uses perturbation theory to compute the difference in the power spectrum with respect to the Planck 2015 reference cosmology. The CMASS sample based on the Minerva simulations is the only sample whose cosmology is close to the reference cosmology, and we see that RESPRESSO performs clearly the best in that case, as expected. For the other samples, based on the LasDamas simulations, the cosmology is fairly different and this might be playing a role, since RESPRESSO assumes small deviations between target and reference cosmology.
The FoB of the surrogate models is generally similar to the true matter power with the two notable exceptions of MGS and HALO1, which become biased for the EFT model. In some other cases their FoB can also fall below that of the true matter power, which should not be interpreted as the surrogates being more accurate, but rather that the determination of the FoB is subject to some degree of noise. However, in total this means that the range of validity determined from our combined criterion following Eq. (43) is dominated by the increase in the reduced χ 2 . From the FoM panels of Fig. 11 (which terminate at the breakdown scale k † ) we see that RESPRESSO typically has the largest range of validity across all samples, closely followed by the EFT. Apart from the HALO1 sample where all models fail at k † ∼ 0.2 h/Mpc, only gRPT suffers more considerable reductions, in particular for LOWZ, CMASS and HALO2 with breakdown scales of the order k † ∼ 0.2−0.25 h/Mpc, compared to 0.3 h/Mpc and beyond for the true nonlinear matter power spectrum.
Finally, the FoM is not affected at all when employing the RESPRESSO and gRPT models, but not surprisingly decreases for the EFT due to its additional free counterterm parameter. Interestingly, although its FoM can be reduced by up to 40 % at low k max , it is able to compensate for the most part within its range of validity for all samples except MGS. This is in contrast with the modeling of the matter power spectrum alone, where it was shown in [52] that the FoM 11 derived from RESPRESSO was superior to the EFT.
A breakdown of the matter modeling can also be disguised by other model parameters absorbing the emerging differences. It is therefore interesting to check for any inconsistencies in the full parameter space compared to the true matter power spectrum results, which we present for the galaxy samples in Fig. 12 using again the scaledependent noise case with fixed γ 2 as an example. Each panel shows the 1-σ uncertainties at a given value of k max for the three models under consideration (colored bands), while the results using the true nonlinear matter power spectrum are depicted by the black data points. In general, the agreement is good for the majority of the parameter space and values of k max , but we see that gRPT tends to overestimate the large-scale shot noise amplitude N 0 and both, gRPT and EFT, consistently return larger values of the scale-dependent noise parameters N 2 and N × 2 . The latter is particularly evident for MGS, where the constraints suggest that the gRPT model compensates for a lack of matter power on small scales. Further, albeit less consistent deviations occur for b 2 and γ 21 in case of gRPT and EFT. To the contrary, RESPRESSO is an excellent match to the results from the true matter power spectrum for all samples and scales. In principle one could include two-loop information in the other calculations to make the comparison with RESPRESSO on a more equal footing, however this requires the introduction of further counter-terms (due to the increased sensitivity of the two-loop integrals to nonlinearities) and this probably will lead to a significant decrease in the overall FoM.
Finally, Fig. 12 also demonstrates nicely that for the measured matter power spectrum and, at least, RESPRESSO the posterior averaged mean parameter values are very insensitive to the fitted range of scales, when their uncertainties decrease. This implies that the bias model is not attempting to compensate for any unaccounted contributions and therefore is a further convincing point in favor of the robustness of the chosen model.

VI. CONCLUSIONS
This paper has addressed two leading questions: on which scales can one-loop perturbative models of galaxy bias accurately describe measurements of two-point statistics, and how much freedom in terms of unknown bias parameters do we need to allow for. In order to draw conclusions that are as general as possible, we have systematically analyzed a diverse collection of tracers, comprising three galaxy and four halo samples at different redshifts, each with statistical uncertainties corresponding to an effective volume of 6 (Gpc/h) 3 . To robustly test one-loop galaxy bias, we used the measured nonlinear matter spectrum as this allows us to concentrate on bias independently of any issues related to the nonlinear evolution of matter description. In addition, by ignoring redshift-space distortions we are avoiding extra parameters that can mask failures of the bias model.
Our most stringent test is based on MCMC fits to the auto power spectrum of galaxies (or halos) and their cross spectrum with the underlying matter field. We tested various modeling assumptions, and quantitatively assessed the model's performances by means of three metrics. Two of those, the figure of bias (FoB) and goodness-of-fit, jointly determine the range of validity by guaranteeing an unbiased recovery of model parameters (here measured in terms of the linear bias parameter b 1 , which can be thought of as a proxy for the ampli-tude of the matter power spectrum) and a good match to the data. The third metric, the figure of merit (FoM), quantifies the statistical uncertainty on model parameters (here also derived from b 1 ) and allows us to identify a potential compromise between a reduced parameter set (i.e., more constraining power) and range of validity.
The "standard" galaxy bias model for two-point statistics up to one-loop order contains five parameters: apart from b 1 , it depends on the quadratic and tidal biases, b 2 and γ 2 , a contribution to tides due to nonlocal gravitational evolution that appears at third order, γ 21 , as well as a constant stochastic term N 0 . In addition we have allowed for either higher-derivative or scale-dependent stochasticity effects, and tested for the impact of fixing γ 2 and γ 21 using excursion set and co-evolution relations. Our main findings from studying these various options can be summarized as follows: (i) The standard five-parameter model applied to the auto power spectrum performs remarkably well and is applicable to the full tested range of scales, k max = 0.35 h/Mpc, for all our tracers except the two massive halo samples and LOWZ (the galaxy sample hosted by the most massive halos) in which case it fails at k max ∼ 0.2 h/Mpc. In a joint analysis with the cross power spectrum the model develops inconsistencies in the nonlinear regime (as demonstrated by a quickly deteriorating goodness-of-fit), which lead to maximum validity scales that are reduced by up to 30 % with respect to the auto power spectrum alone.
(ii) The diminished model reliability for massive halos and the inconsistencies between auto and cross power spectra cannot be sufficiently resolved by the inclusion of the leading higher-derivative corrections. On the other hand, accounting for halohalo exclusion through scale-dependent stochasticity brings significant improvements for all samples, allowing us to fit the measurements (even in combination with the cross spectrum) nearly up to k max = 0.35 h/Mpc and likely beyond for some of the samples. From this we conclude that scale-dependent stochasticity has a stronger impact on galaxy/halo clustering than large-scale higherderivative effects.
(iii) This conclusion is further supported by our constraints on the scale-dependent stochasticity and higher-derivative parameters. From the joint analysis at k max = 0.25 h/Mpc and for the most conservative case (all model parameters are varied in the Markov chain), we find detections of the former at the level of 1.4 to 7-σ in the auto power spectrum (only our MGS sample is below the 1-σ threshold) and similar, but slightly less significant results in the cross spectrum. These detections become even more pronounced at larger wave numbers, while the mean parameter values are largely independent of k max . We also find clear detections of the higherderivative parameters, but in contrast these depend strongly on the range of scales included in the fit. We further find that scale-dependent stochasticity affects the auto power spectrum more substantially than the cross spectrum and a tight correlation between the scale-independent and -dependent parameters (see Fig. 10), both of which are in line with our interpretation in terms of halo-halo exclusion.
(iv) Application of the excursion set relation for γ 2 (see Fig. 1 of how this compares with the local Lagrangian relation and precise measurements from [106]) improves the FoM without diminishing the validity ranges for all samples and models (the only notable exception being the high-mass halos at redshift z = 1, the most biased tracer in our sample), which makes this a preferred choice for reducing the parameter space. We find that γ 21 is generally in good agreement with its co-evolution relation (and constrained to be non-zero, particularly when γ 2 is fixed) and when used in combination with the excursion set relation for γ 2 can give rise to an even more substantial enhancement of the FoM. Therefore, it typically provides the best compromise between constraining power and scales on which the model is applicable. However, we caution that in this case the latter can vary from sample to sample, so great care should be taken when choosing to apply this approximation to real survey data.
(v) Combination of all previous points suggests that the standard model with tidal bias fixed by the excursion set relation (four free parameters in total) provides a robust modeling choice for the auto power spectrum of the less massive halos in our set of samples and galaxy populations living in those (MGS, CMASS). For the more massive halos and the LOWZ galaxy sample it is most beneficial to include an extra parameter corresponding to scaledependent stochasticity. This is also the preferred option when considering combinations of the auto and cross power spectrum, which might be relevant in joint studies of galaxy clustering and weak lensing. In this case there would be six free bias parameters to account for independent stochastic contributions in the two spectra.
All of these results (Section IV) were derived using the measured power spectrum of the underlying matter field, which enters in the models of the galaxy auto and cross power spectra through the linear bias terms (see Eqs. 12 and 13). This was done explicitly to test one-loop galaxy bias independently of one-loop corrections to the matter field.
In Section V, we tested the impact of modeling the nonlinear matter spectrum for the complete set of biased tracers, using the "best choice" of bias priors, i.e. including scale-dependent noise and fixing the quadratic tidal tensor bias γ 2 to follow the excursion set relation, which as discussed above was identified as giving consistent results over a large range of scales. We considered two fully perturbative surrogates of the matter power spectrum, gRPT and EFT, as well as the hybrid approach RESPRESSO. While RESPRESSO contains no free parameters, the EFT model includes one free parameter associated to a counter-term arising from stress-tensor corrections, which is effectively built-in in RESPRESSO. In principle, gRPT should also include stress tensor corrections, but we left these out following the implementation in [43], which argued that having γ 21 free in the bias model partially compensates for this choice (as the two are exactly degenerate in the low-k limit). Compared to the results with the measured matter power spectrum all three models have a similar performance, but in some cases reduced validity ranges. We find that the best surrogate (in terms of validity and similarity of the recovered mean posterior values) is RESPRESSO, followed by the EFT (see Fig. 12). Due to its extra parameter the latter can lag behind in FoM, but can mostly compensate for this at larger k max .
As stated above, all MCMC fits carried out in this analysis made use of statistical measurement uncertainties corresponding to an effective volume of 6 (Gpc/h) 3 . This is significantly below the total volume that will be observed by upcoming galaxy surveys such as DESI and Euclid, but one should keep in mind that for the actual clustering analyses the total volume will be split into a number of redshift slices, for which our adopted volume here should be more than representative. Nonetheless, we have investigated how our derived model validity ranges scale with a more restrictive breakdown criterion, which we showed can to first order be interpreted as an increase in the effective volume. While the breakdown scales move towards lower wave numbers, this test revealed that they do so more slowly for the model that includes scale-dependent stochasticity. For that reason its maximum FoM values become level or even better than those of the standard model, suggesting that it might be the optimal modeling choice for any of our samples at larger volumes.
Although our collection of tracers span a large variety, we miss samples consisting of halos less massive than 10 13 M , as well as galaxies that match the clustering properties of emission line galaxies, which are the main targets of the DESI and Euclid surveys. There is no guarantee that our conclusions regarding the excursion set and co-evolution relations for γ 2 and γ 21 , respectively, would still be valid for these types of galaxies. In general, galaxies are expected to inherit their bias from their host halos and a deciding factor in the similarity between the two is the satellite fraction of the galaxy population. For all three of our mock galaxy samples the satellite fraction is rather low ( 10%), making them reliable tracers of the halo centers. As these obey the excursion set relation for the tidal bias, this likely explains why we have also obtained good results for our galaxy samples. On the other hand, for populations with higher satellite fractions one could consider using a given HOD model to reweight the tidal bias of the host halos according to the expected central and satellite fractions. We leave a more detailed exploration of this possibility for future work.
In forthcoming publications we will extend the results in this paper to account a variation of cosmological parameters [126], as well as extending the test of one-loop galaxy bias to the one-loop bispectrum [110].