Cubic-scaling iterative solution of the Bethe-Salpeter equation for finite systems

The Bethe-Salpeter equation (BSE) is currently the state of the art in the description of neutral electron excitations in both solids and large finite systems. It is capable of accurately treating charge-transfer excitations that present difficulties for simpler approaches. We present a local basis set formulation of the BSE for molecules where the optical spectrum is computed with the iterative Haydock recursion scheme, leading to a low computational complexity and memory footprint. Using a variant of the algorithm we can go beyond the Tamm-Dancoff approximation (TDA). We rederive the recursion relations for general matrix elements of a resolvent, show how they translate into continued fractions, and study the convergence of the method with the number of recursion coefficients and the role of different terminators. Due to the locality of the basis functions the computational cost of each iteration scales asymptotically as $O(N^3)$ with the number of atoms, while the number of iterations is typically much lower than the size of the underlying electron-hole basis. In practice we see that , even for systems with thousands of orbitals, the runtime will be dominated by the $O(N^2)$ operation of applying the Coulomb kernel in the atomic orbital representation

The Bethe-Salpeter equation (BSE) is currently the state of the art in the description of neutral electron excitations in both solids and large finite systems. It is capable of accurately treating charge-transfer excitations that present difficulties for simpler approaches. We present a local basis set formulation of the BSE for molecules where the optical spectrum is computed with the iterative Haydock recursion scheme, leading to a low computational complexity and memory footprint. Using a variant of the algorithm we can go beyond the Tamm-Dancoff approximation (TDA). We rederive the recursion relations for general matrix elements of a resolvent, show how they translate into continued fractions, and study the convergence of the method with the number of recursion coefficients and the role of different terminators. Due to the locality of the basis functions the computational cost of each iteration scales asymptotically as O(N 3 ) with the number of atoms, while the number of iterations is typically much lower than the size of the underlying electron-hole basis. In practice we see that , even for systems with thousands of orbitals, the runtime will be dominated by the O(N 2 ) operation of applying the Coulomb kernel in the atomic orbital representation

I. INTRODUCTION
Ab initio simulation of optical spectra is an essential tool in the study of excited state electronic properties of solids, molecules and nanostructures. For finite systems timedependent density functional theory (TDDFT) 1 based on local or semi local functionals is widely used. However, TDDFT fails in certain cases, notably for charge transfer excitations 2 which are essential in, e.g., photovoltaic applications. An alternative to TDDFT is Hedin's GW approximation 3 followed by the solution of the Bethe-Salpeter equation (BSE) 4 . Based on many-body perturbation theory 5,6 , the GW/BSE method is a more systematic approach than TDDFT, and it has been shown to give a qualitatively correct description of excitonic effects in solids 4,7 and charge transfer excitations 8,9 .
The Bethe-Salpeter equation is a Dyson-like equation for the two-particle Green's function, or equivalently for the fourpoint polarizability 10 . Within the field of electronic structure theory, developments of the BSE can be traced back to the beginning of sixties 6,11,12 , with the first ab initio implementations appearing a couple of decades later [13][14][15] . The GW/BSE method has been implemented using plane waves and real space grids, 10,[16][17][18][19][20][21][22][23][24] , linear combination of atomic orbitals (LCAO) [25][26][27][28][29] and within the FLAPW framework 30 . In practice, the standard way of solving the BSE is by converting it to an effective eigenvalue problem in a particle-hole basis. Since the size of the particle-hole basis scales quadratically with the number of atoms N, a straightforward diagonalization of the BSE Hamiltonian will scale like O(N 6 ). This very steep scaling makes it difficult to treat large scale systems like nanostructures and realistic models of organic photovoltaic devices. For such systems an improved scaling with the number of atoms would be highly beneficial.
Avoiding an explicit diagonalization of the BSE Hamiltonian can be done by using an iterative method to obtain a few low-lying transitions (e.g. the Davidsson method 31,32 ), or to directly aim for the spectrum, which can be done frequency by frequency using for example the GMRES method 31,33,34 or for the full spectrum with the Haydock recursion scheme 20,35,36 . Another option is to go over to the time domain and solve the equations of motion by time propagation 37,38 . These methods only require matrix-vector products to be performed, and assuming that the number of iterations, or time steps, is much smaller than the size of the particle-hole basis, the asymptotic scaling will be O(N 4 ). However, setting up the BSE Hamiltonian explicitly will still have the cost of O(N 5 ), and to avoid this, the matrix-vector products need to be performed on the fly, without explicitly constructing the matrix.
Benedict and Shirley made use of the Haydock recursion method to compute optical spectra in the Tamm-Dancoff approximation (TDA) without actually computing the whole BSE Hamiltonian 23 . This was achieved by using, in addition to the particle-hole basis, a real space grid product basis |x, y , in which the screened direct Coulomb interaction is diagonal (the exchange term is sparse in this representation). The scaling of the algorithm was reported to be O(N 4 ) with the number of atoms, however, a more careful analysis shows that it can be made to scale like O(N 3 ) by a proper ordering of the loops 39 .
This favorable scaling is heavily based on the use of a realspace representation for the particle-hole states. Similar gains can be obtained with the use of LCAO basis sets, where the same asymptotic scaling can be obtained by making use of the sparsity in both direct and exchange Coulomb interaction terms. It should be mentioned that by using additional assumptions of locality, which implies screening away Coulomb matrix elements between basis functions that are spatially far from each other, one could even achieve linear scaling 40 , however, the BSE has so far not been treated with these methods. Another linear scaling approach to many-body theory methods has recently been published by Baer and coworkers that make use of stochastic wave functions together with time propagation 41-43 . In the present publication, we will not venture into the realm of linear scaling but rather make use of the more standard iterative methods that, together with locality, lead to cubic scaling with the number of atoms. We present an iterative algorithm to obtain the BSE spectrum for molecules, making use of localized basis sets both for orbitals and products of orbitals. To go beyond the TDA a pseudo-Hermitian version the Haydock recursion scheme 20 is used. We derive the recursion relations for general matrix elements of a resolvent and show how they translate into continued fractions. Our method has been interfaced to the SESTA code 44 which is widely used for ground state density functional theory calculations (as an alternative, we can do all-electron calculation using numerical orbitals in an in-house implementation). For the case of the benzene molecule, as a prototypical example, we present a detailed study of the convergence properties of the iterative method, both within the Tamm-Dancoff approximation and for the full BSE. In particular, we study the effect of different termination schemes. Furthermore, for the sake of clarity, we provide a detailed account of the BSE method itself using our notation. Our algorithm scales asymptotically like O(N 3 ) with the number of atoms and uses O(N 2 ) memory. We present proof of principle calculations of our implementation, where the runtime is seen to be dominated by the O(N 2 ) scaling operations for systems up to several thousand orbitals, and discuss some of the bottlenecks and possible improvements of the scheme.

II. THEORY
A. Quasiparticles with the GW approximation Before the BSE can be set up and solved, the quasiparticle energies must be obtained from a preceding GW calculation 3 . Since the details of our GW implementation have been published elsewhere 45,46 , we will here only give a brief summary of the method. The poles of the one-particle Green's function G for an N-electron system occur at the ground and excited states of the corresponding N+1 and N-1 systems, that is at the electron addition and removal energies. Hedin's GW approximation connects the (irreducible) polarizability P, the non-interacting and interacting Green's functions (G 0 and G), the screened interaction W, and the self energy Σ in a set of closed equations G(r, r , ω) = G 0 (r, r , ω)+ G 0 (r, r 2 , ω)Σ(r 2 , r 3 , ω)G(r 3 , r , ω)d 3 r 2 d 3 r 3 .
In our implementation of the GW method the Green's function is expanded in a basis of numerical atomic orbitals (AO) of finite support { f a (r)} Here and in the following we explicitly write out the overlaps S ab = f * a (r) f b (r)d 3 r when they appear, the matrix quantities G ab (ω) are always contravariant and the placement of the indices as subscripts or superscript is arbitrary. With this representation of the Green's function G, we see that the polarizability (1) involves products of AOs f a (r) f * b (r). These products are expanded in an (auxiliary) product basis {F µ (r)} of localized numerical functions 45,46 where the expansion coefficients V ab µ and the product basis functions {F µ (r)} are determined by numerically expanding the products around a common center and removing redundant functions by a diagonalization based procedure 47 . Only overlapping pairs of orbitals are considered, making the matrix of expansion coefficients sparse when using AOs of local support. The indices {aa bb } will be reserved for atomic orbitals and {µ, ν} for product functions of atomic orbitals in the following. Using the product basis, the polarizability P(r, r , ω) is represented similarly to the Green's functions (5) where the overlap of the product functions S µν = F * µ (r)F ν (r)d 3 r appears. Similarly, it can be seen from equation (2) that the matrix elements of the bare v and screened W Coulomb interaction must be expanded in the product basis, while the self-energy Σ is expanded in the AO basis. For finite systems both the { f a (r)} and the product basis {F µ (r)} can be chosen as real.
The frequency-dependent quantities like G ab (ω) and P µν (ω) are represented on an even-spaced, real-axis, frequency grid via their corresponding spectral functions. An imaginary part of the energy is added in the Green's function G 0 (ω) and polarizability P(ω), that is sufficient to ensure their smoothness on the chosen frequency grid. The convolutions of spectral functions implied by equations (1) and (3) are computed via fast Fourier transforms. Due to the the fast convolutions and the locality of the product basis set, the asymptotic scaling of the algorithm is O(N 3 ) with the number of atoms N 45 . Finally the Dyson equation (4) is directly solved for each frequency to obtain G ab (ω). The quasiparticle energies are poles in G ab (ω) and can in certain cases be determined from inspection of the density of states. This does not give the quasiparticle wave function, however. In this paper we adopt the standard way of proceeding and assume that the Kohn-Sham 48 (KS) or Hartree-Fock (HF) eigenfunctions that are used to construct the zeroth order Green's function G 0 (ω) are good approximations to the quasiparticle states, so that they can be kept fixed and only the quasiparticle energy corrected. We will here only consider the so-called G 0 W 0 approximation where a single iteration of the GW equations is performed without self-consistency. We focus on the KS "starting point" in this subsection. The KS Hamiltonian is with T the kinetic energy, V ext (r) the external potential, V H (r) the Hartree potential and V xc (r) the exchange-correlation potential. The KS eigenfunctions are expanded in the AO basis where X ia = a S −1 aa a |i are the eigenvectors of the general- If we additionally assume that interacting Green's function G is diagonal in the KS eigenstates ψ i (r), the Dyson equation (4) reduces to a set of scalar equations where we have subtracted the exchange-correlation potential V xc ii in order to be able to work with the KS eigenvalues. The (assumed real) poles are then found by identifying the zeros of the denominator, either by a graphical solution if the full frequency-dependent quantities are available, or more commonly, by an expansion of Σ ii (ω) around KS i , which leads to Since we have access to the full frequency dependence of the self energy we can use the graphical method, which in principle is more accurate and also has the advantage that problems with satellite peaks can be avoided 49 . For comparison purposes we will also make use of the simpler equation (12).

B. Optical spectra with the Bethe-Salpeter equation
The directionally averaged absorption cross section of a molecule is given by where α mm (ω) is the dynamical dipole polarizability tensor given by α mm (ω) = − d 3 rd 3 r r m χ(r, r , ω) r m .
The interacting density response function, or reducible polarizability, χ(r, r , ω) is defined in the time domain as a functional derivative of the density with respect to the change of the external potential: χ(1, 2) ≡ δρ(1) δU(2) . Numbered bold indices, i = {r i , σ i , t i }, refer to space, spin, and time coordinates, whereas plain numbered indices contain space and spin, i = {r i , σ i }. χ(1, 2) is a two-point quantity and it is directly connected to the non-interacting density response χ 0 (1, 2) in RPA or in TDDFT with semi-local functionals 50 . However, when the Hamiltonian becomes non-local in space (as in the case of TDHF, TDDFT with hybrid functionals or Hedin's GW approximation) one must first find the retarded four-point polarizability L(1, 2, 3, 4), and then obtain the two-point one using the relation χ(1, 2) = L(1, 1 + , 2, 2) (see appendix A).
The four-point polarizability L(1, 2, 3, 4) satisfies the Bethe-Salpeter equation as derived in appendix A. In the frequency domain the BSE can be written L(1, 2, 3, 4 | ω) = L 0 (1, 2, 3, 4 | ω) with L 0 (1, 2, 3, 4 | ω) the non-interacting four-point polarizability and the BSE kernel. Already here the approximation has been made that the screened interaction W(1, 2) is independent of the frequency. Introducing an orthonormal two-particle basis |i j that has the representation 1, 2|i j = ψ i (1)ψ * j (2) in terms of the quasiparticle spin orbitals, we can expand L as with the matrix elements given by L 0 is expanded similarly. This leads to the matrix equation Equation (19) has to be inverted for each frequency which is computationally cumbersome. Fortunately, with certain approximations, it can be reformulated as an effective eigenvalue problem that only has to be solved once. To proceed with this we choose as our one-particle states the quasiparticle states in which the interacting Green's function G is assumed to be diagonal. This leads to L 0 being diagonal in the two-particle basis where f i denotes the occupation number of spin orbital ψ i . We put the expression (20) in equation (19), rearrange terms and get after some algebra where we introduced the frequency-independent BSE Hamiltonian The matrix H BSE is non-Hermitian. If we solve for its right eigenvectors and eigenvalues and define expansion coefficients of the eigenvectors in terms of the the two-particle basis A λ i j = i j|λ , we can obtain a spectral representation of the interacting polarizability as Here the overlap of the right eigenvectors S λ,λ = i j A λ * i j A λ i j appears because the eigenvectors of a non-Hermitian eigenvalue problem are generally not orthogonal. Using equations (14), (17) and a resolution of the identity in the quasiparticle product states, we can rewrite α mm (ω) in terms of L as with the transition dipoles Here ψ i (r) is the spatial part of ψ i (1), and x(σ) is the corresponding spin function. Here we denote the dipole operator as a ket, since in general a normal two-point operator A can be expanded as A = i j A i j |i j| ≡ i j |i j A i j . In the preceding analysis spin is explicit in the orbitals. However, H BSE is not diagonal in a spin orbital basis. If it is diagonalized in the spin indices (see appendix B), one singlet and three triplet product functions result, where the singlet one being the only one to have a non-vanishing transition dipole moment and so the one visible in the optical response. In the following we will suppress the spin indices and only work with the space quantities. Because of spin symmetry the coupling elements K are modified with the factor f s/t being 2 for the singlet and 0 for the triplet and the transition dipoles for the singlet get an additional factor of and the triplet transition dipole is zero. This means that the dynamic dipole polarizability effectively gets an additional factor of two for the singlet transition. Since we always consider the singlet for dipole transitions we can drop the "singlet" superscript and let D m i j refer to equation 28. An important simplification to the problem is that, due to the occupation factors, only particle-hole and hole-particle product states contribute to the polarizability (see appendix B) and we can write the eigenfunctions of H BSE as Here and in the following the indices {vv } will denote occupied (valence), {cc } empty (conduction, unoccupied) and {i jkl} general molecular orbitals. Projecting the eigenvalue equation (23) from the left with vc| and cv| we obtain a matrix equation with the following block structure where Using the symmetry properties of the BSE kernel K i j,kl = K * ji,lk = K * kl,i j and of the noninteracting Hamiltonian H 0 i j,kl = −H 0 ji,lk , we can also write The second form (31) is useful because it leads to computational savings when explicitly setting up the matrix. In the Tamm-Dancoff approximation the off-diagonal blocks in the H BSE (i.e. the couplings between hole-particle and particlehole states) are set to zero. This leads to two uncoupled Hermitian eigenvalue equations for A λ vc and A λ cv . Due to the symmetries displayed in equation (31) we see that the eigenvalues of the two blocks are related as vc λ = − cv λ , and the eigenvectors as A λ cv = A λ, * vc , where the superscript refers either to the {cv} or the {vc}-sector. Therefore, only one of the equations needs to be solved, for example the one for the the {vc}-sector: Using the fact that the eigenvectors are orthogonal for a Hermitian problem, the non-zero blocks of the the four-point polarizability are The TDA is a widely used approximation that, in addition to the computational advantages, often provide good agreement with experimental excitation energies for organic molecules [51][52][53] . At this point it is interesting to note the similarities of the BSE, TDDFT and time-dependent Hartree-Fock (TDHF). In TDDFT, although for semi-local functionals it is in principle sufficient to look at the response of the density, one can more generally look at the response of the density matrix as was done by Casida 54 . The resulting equations are very similar to the BSE, with the only difference that the GW eigenvalues are replaced by KS eigenvalues, and that the direct term is replaced by a TDDFT exchange-correlation kernel. For semi-local exchange-correlation functionals, and real orbitals, the resulting eigenvalue problem can be reduced to a Hermitian problem of half the size -the preferred formulation of TDDFT in quantum chemistry. However, when Hartree-Fock exchange is included (in hybrid functionals for example) the reduction to the Hermitian form does not simplify things quite as much, since one needs to take the square root of a full matrix which requires an additional diagonalization. The TDHF response equations have the same structure as the BSE ones with Hartree-Fock eigenvalues and an unscreened direct term. The Tamm-Dancoff approximation is also useful in TDDFT and TDHF. For TDHF with TDA one recovers the configuration interaction singles (CIS) equation.
To set up and diagonalize the BSE Hamiltonian (31) is feasible only for systems with a few thousand of particle-hole pairs. For larger matrices an iterative procedure is essential both for memory and runtime requirements. In the following we describe how the the dynamical dipole polarizability tensor (25) can be computed with a Lanczos-type iterative method.

Continued fraction expression for the BSE polarizability
Using equations (21) and (25) we can rewrite a matrix element of the dynamical dipole polarizability tensor (14) in a form involving the resolvent of the BSE Hamiltonian where In the last equation i j|D m refers to the singlet transition dipole in equation (28), and we denote the occupation difference matrix by In the Tamm-Dancoff approximation we only consider {vc} states which means that the transition dipoles become (the TDA-superscript since it will be clear from the context if the TDA is used or not). An attractive method of dealing with resolvents is the Haydock recursion scheme 35 , where a diagonal matrix element of a resolvent is efficiently computed from Lanczos recursion coefficients by means of continued fractions. Recently it has been shown that also non-diagonal matrix elements of the resolvent can be computed from the same Lanczos coefficients 20,36 . Usually the continued fraction representation of the resolvent is derived by determinant relations. Here we present an alternative derivation that only uses the power series expansion of the resolvent and the orthogonality among the Lanczos vectors. The off-diagonal matrix elements come out naturally in this formulation, and it is straight-forwardly extendible to block Lanczos, two-sided Lanczos and pseudo-Hermitian Lanczos schemes. Our derivation also connects to the theory of relaxation functions, also known as the Mori projection technique, first introduced to describe the Laplace transformed correlation function of dynamical systems 55 and later reformulated by Lee 56 in a form more closely related to the one we use here.
We want to compute i|(ω − H) −1 | j -a general matrix element of the resolvent of the Hermitian operator H, with the frequency ω in general a complex number. Let us define a frequency-dependent solution vector where |j = | j /|| j|| is the normalized | j . The matrix element of the resolvent in terms of the solution vector (37) reads Now we generate a set of orthonormal Lanczos vectors {| f n } with the starting state | f 0 = |j , using the standard recursion with the real coefficients a n = f n |H| f n and b n = f n−1 |H| f n . Next we expand the solution vector |j(ω) in the Lanczos basis where the frequency dependent expansion coefficients c n (ω) are given by projection onto the basis The expansion coefficients c n (ω) contain the information necessary to compute the sought matrix elements of the resolvent. The diagonal matrix element is especially simple (remembering that | f 0 = |j ) that is, only the zero-th coefficient c 0 (ω) is needed.
In the original Haydock recursion scheme only diagonal matrix element were computed. For our purposes we also need the off-diagonal elements, which can be computed using the higher expansion coefficients The projections i| f n of the vectors i| with the Lanczos basis can be computed and saved when the Lanczos vectors are available, thus avoiding the storage of more than the last two vectors. As we shall see, the coefficients c n (ω) can be computed from continued fractions. An advantage of using continued fractions is that one can terminate them in a physically sensible way which can reduce the number of Lanczos vectors one has to explicitly compute. Projecting the Hermitian transpose of equation (39) onto the solution vector |j(ω) gives b n+1 f n+1 |j(ω) = f n |H|j(ω) − a n f n |j(ω) Applying the operator H onto the solution vector gives which follows directly from the definition of the inverse together with the definition of the solution vector |j(ω) (37). Inserting equation (45) into equation (44) we obtain a recursion relation for the expansion coefficients c n (ω) For n = 0 the relation can be rearranged to give while for n > 0 we obtain We now introduce the relaxation functions of order n ϕ n (ω) 55,56 After inserting the expansion coefficients (50) in equations (48), (49) we obtain the continued fraction relations familiar from the Haydock recursion scheme After the relaxation functions have been computed for a certain frequency, the expansion coefficients c n (ω) can be recovered by inverting the relation (50) In summary, first the coefficients a n and b n , as well as the needed projections i| f n are obtained from equation (39), then for each ω (adding a small positive imaginary part, as appropriate for the retarded response), the relaxation functions ϕ n (ω) are computed from equations (51) using a properly chosen terminator. Then, the expansion coefficients c n (ω) are obtained from equation (52). Finally, the matrix elements are computed from equations (42) and (43).

Iterative non-TDA BSE
The full BSE Hamiltonian is non-Hermitian which means that the Lanczos procedure outlined above must be modified. A two-sided Lanczos procedure where both left and right eigenvectors are generated in the recursive procedure can be used, although it suffers from instability issues due to the loss of orthogonality between the Lanczos vectors, often requiring explicit reorthogonalization 58,59 . It also involves twice the number of applications of the Hamiltonian. Recently, a pseudo Hermitian algorithm was published that exploits the structure of the BSE eigenproblem to convert it into a Hermitian problem in a special scalar product 20 . In this algorithm one avoids the extra multiplication of the Hamiltonian that is present in the two-sided scheme. Below we summarize the pseudo-Hermitian algorithm in our notation.
An operator A is pseudo-Hermitian 60 with respect to the invertible Hermitian operator η, if This means that ηA is Hermitian, or equivalently that A is Hermitian under the scalar product ·|· η = ·|η· , provided that the metric η is positive definite so that the scalar product is well-defined. Furthermore, the eigenvalues of A are real if it is pseudo-Hermitian with respect to an operator that can be written like η = OO † with O an invertible operator 61 , and such a factorization can always be found for a positive definite η.
If A is a product of two Hermitian operators A = BC, then A is pseudo-Hermitian with B −1 and C, which can be checked using equation (53). The BSE Hamiltonian H BSE given by equation (22) can be written in matrix form with F given by equation (35). Since F 2 = I we can write Since FH 0 is diagonal and real, and K i j,kl = K * kl,i j , it follows thatH is Hermitian. From the preceding discussion it is clear that H BSE is pseudo-Hermitian with respect to η = F −1 or η =H. Since F is not positive definite it doesn't serve as a metric for a scalar product.H however, should be positive definite unless there exist singlet-triplet instabilities 51,52,62 . Such instabilities do occur for molecules, and especially for triplet excitationsH can lose its positive definiteness. This will make the pseudo-Hermitian algorithm fail. However, since in this case also direct diagonalization gives unphysical results one should not view this failure as a drawback of the method.
Within the pseudo-Hermitian Lanczos scheme the same steps are followed as in the Hermitian case. The only difference is that the scalar product is changed form the ordinary ·|· to ·|H· , with the Lanczos vectors orthonormal in this product. This means that equation (39) stays the same, but the Lanczos coefficients are modified to a n = f n |HH BSE | f n and b n = f n−1 |HH BSE | f n , which can be seen by multiplying equation (39) byH and using the orthogonality of the Lanczos vectors in the ·|H· scalar product. To make the starting vector normalized, it is chosen as | f 0 = |j = | j j|H| j −1/2 . Due to the metric introduced in our scalar product we effectively have right and left Lanczos vectors, related by | f L n = H| f R n , and | f R n = | f n , although only one set of vectors is necessary in the actual computation. The resolution of the identity in the Lanczos vectors is which means that the matrix element of the resolvent must be computed as = n i| f n cH n (ω).
Here cH n (ω) = f n |H|j(ω) replaces equation (41) -the other equations that are needed can be derived as in the Hermitian case, only replacing the scalar product. Here, even if we only want a diagonal matrix element we have to sum over the projections of all the Lanczos vectors, because the starting (right) vector is not orthogonal (in the ordinary scalar product) to the other Lanczos vectors.

C. Implementation of the Bethe-Salpeter equation
Having a general description of the BSE and of an iterative algorithm for solving it, we will describe our implementation using local basis functions.

Non-iterative algorithm
It is straightforward to compute the matrix in equation (31) and diagonalize it to obtain the four-point polarizability from equations (24) and (25). The matrix elements of the kernel K are computed using equation (27). The construction of the matrix requires O(N 5 ) operations (N being the number of atoms) and O(N 4 ) memory for storage. Solving the resulting eigenvalue problem using standard diagonalization techniques gives an even more prohibitive scaling of O(N 6 ) with the number of atoms. A way to avoid this excessive scaling is to limit the number of electron-hole pairs that are included in the calculation. However, the energy range covered by a constant number of pairs decreases with increasing system size, leading to a deteriorated description of the spectrum. In practice, the limit where explicit diagonalization is feasible is reached for a few tens of atoms: for larger systems iterative schemes are more efficient. Nevertheless, for small systems and for testing purposes straightforward diagonalization is a simple and useful alternative. Using our localized product basis set {F µ (r)}, the exchange and direct terms in equation (27) where the bare and screened Coulomb matrix elements in the local product basis are The expansion coefficientṼ i j µ of a product of two quasiparticle states is given byṼ where the expansion coefficients V ab µ are those appearing in equation (6). Unlike the local product coefficients V ab µ , the eigenstate product coefficientsṼ i j µ are not sparse and the equations (59) will scale like O(N 5 ) if the loops are ordered in the proper way as shown by the boxes (equation (61) costs O(N 4 ) operations). The singlet transition dipoles can also be calculated from the product functions where the dipole moments in the local product basis are D m µ = d 3 rF * µ (r)r m .

Iterative computation of the BSE
Let us first look at the TDA which is simpler than the full BSE. Because only the {vc}-sector needs to be solved, the eigenvalue problem is Hermitian. Moreover, because |D m = |D m in equation (36), we only need to calculate a diagonal matrix element of the resolvent to get the diagonal dynamical dipole polarizability Here |D m = |D m /||D m || in equation (34) is used as the starting vector in the Lanczos recursion. The dynamical dipole polarizability can directly be written as a continued fraction using equations (50) and (51): The Lanczos procedure for TDA is where first a non-normalized vector |f n+1 is computed and the b 2 n+1 coefficient is computed from its norm. The most timeconsuming step in computing the Lanczos coefficients is the application of the Hamiltonian to a vector. Generally, we express the Lanczos vector in the |vc , |cv basis, similarly to the BSE eigenvectors in equation (29) | f n = vc |vc f vc n + |cv f cv n , with the expansion coefficients f vc n = vc| f n , f cv n = cv| f n .
In the TDA we only make use of the |vc functions. We want to find the expansion coefficients of the vector resulting from the application of the Hamiltonian, that is vc|H BS E | f n The action of non-interacting part H 0 is evaluated in O(N 2 ) operations To exploit the sparsity in the kernels H ex and H dir , we also make use of an atomic orbital product basis |ab with real space representation rr |ab = f a (r) f * b (r ). Using the expansion of the quasiparticle states in AOs, equation (9) we have which allows us to rewrite the kernel as with the matrix elements of the kernel expressed in the AO basis The application of K to a Lanczos vector becomes The operation is separated in three steps: first the coefficient vector is transformed from the eigenstate basis to the local basis then the kernel K is applied in the local basis and the action on the coefficients becomes For the direct term we similarly get The Coulomb matrix elements v µν and W µν are given by equation (60). Because by construction the matrix of product coefficients V ab µ is sparse, a fixed number of atomic orbitals couple for each µ or ν and the operations in equations (77) and (79) scale asymptotically as O(N 2 ).
For the solution of the full BSE problem we use the pseudo-Hermitian Lanczos scheme with the scalar product ·|H· as explained in the previous section. A matrix element of the dynamical dipole polarizability computed with the iterative algorithm is given by where the coefficients cH n (ω) are computed from the continued fractions ϕ n (ω) as given by equations (51) and (52). Note that since we are already computing off-diagonal matrix elements there is little extra cost to obtain the full dynamical dipole polarizability tensor, and not just the diagonal matrix elements as is usually done in the TDA case. The Lanczos procedure in the pseudo-Hermitian case is In this scheme the intermediate vector | f n is saved between iterations in order to minimize the number of applications of the Hamiltonian. To perform a Lanczos iteration we need to applyH given by equation (56) to some vector | f n , now containing both particle-hole and hole-particle amplitudes. This in done much in the same way as in the TDA case. The term FH 0 is diagonal in the eigenstate basis and becomes (82) For the coupling matrix elements we transform to the atomic basis as in the TDA case, with the exception that we need to make use of both |vc and |cv vectors. The transformation is done for both sets of vectors as shown below After the auxiliary vector in equation (83) has been computed, the exchange and direct terms are applied in exactly the same way as in the TDA case, and after that the coefficients are back-transformed as (85) To conclude, we have shown that the application of the Hamiltonian onto a particle-hole state takes O(N 3 ) operations, both when using the Tamm-Dancoff approximation and solving the full BSE. If we use the continued fraction method with a given broadening we can assume that the number of Lanczos coefficients will be independent 23 of the number of atoms. This then leads to an overall O(N 3 ) complexity scaling of the algorithm.

III. TEST CALCULATIONS
A. Simple cases: Na 2 and CH 4 As a first test of the implementation we look at two simple test systems for which we can make accurate comparisons to other codes. The sodium dimer is simple in that it has only one valence orbital (filled with two electrons) which makes the spectrum dominated by single transitions. We computed the G 0 W 0 /BSE for this system starting from an all-electron HF ground state calculation performed with our code, using the cc-pVDZ Gaussian basis set treated as numerical atomic orbitals. The G 0 W 0 calculation was performed using fully frequency-dependent self energy in the range of the valence and semi-core states while the 1s core orbitals were treated with HF exchange only. The quasiparticle energies were computed using the standard first order expansion of the Re Σ ii (ω) around the initial HF eigenvalue i according to equation (12). This procedure is less accurate than solving for the quasiparticle energy graphically, but here we are more interested in a comparison rather than a fully converged result. The BSE was solved by direct diagonalization. For comparison we use the MOLGW code by Bruneval 28,63,64 where we as far as possible use the same parameters as in our code. In table I we compare the first ionization potential (IP) and electron affinity (EA) as well as the position of the first BSE transition obtained with the two codes. The agreement is excellent. Furthermore, the computed cross sections for both TDA and full BSE match very well even for higher transitions -the obtained optical spectra lie on top of each other, as can be seen in the figure 1. In the calculation of absorption cross section a Lorentzian broadening of 0.2 eV was used.
As a second example we chose to investigate the methane molecule, CH 4 . Similarly to the case of sodium dimer, we have chosen cc-pVDZ basis in both calculations, 0.2 eV Lorenzian broadening and also followed as much as possible the same procedure to extract G 0 W 0 eigenvalues. In table II we make the same comparison as in previous example, with the same excellent agreement for GW energies, first optical excitation in TDA and full BSE. The computed spectra are also in perfect agreement, as can be seen in figure 2.

B. Iterative method versus diagonalization
Confident that our BSE matrix is set up correctly we now turn to the iterative method. As a more suitable test case we chose the benzene molecule that is small enough for direct diagonalization (with a moderately large basis set) while still having many transitions that contribute to the spectrum. The ground state calculation was done with the SIESTA code 44 using the PBE functional and a DZP basis set, using an energy shift of 3 meV. Although this basis set is not fully converged for GW quasiparticle energies and optical properties it gives reasonable results for the IP (8.85 eV) and EA (-1.34 eV) compared to earlier obtained results 45 , and to experimental values 65 . The first visible optical transition in our calculations occurs at 6.95 eV for the TDA and 6.18 eV for the full BSE, compared to the experimental value of 6.92 eV (extracted from the experiment shown in ref. 66). We note that the effect of introducing the TDA here is quite large. A detailed account of the convergence properties of quasiparticle energies and BSE spectra for this system, as well as for larger  organic molecules, will appear in a forthcoming publication 49 . For the evaluation of the iterative method, the parameters we choose here are fully sufficient.
In figures 3 and 4, we show the comparison of the iterative method for TDA and non-TDA to direct diagonalization. A simple truncation of the continued fraction is used here. We see that the converged iterative spectrum is obtained with around 200 recursion coefficients for TDA and around 400 for the non-TDA spectrum for this broadening. Note that the full particle-hole space has a dimension of 1400 for TDA and 2800 for the full BSE.
Next we look at different terminators of the continued fraction. The last relaxation function in equation (51) is assumed to satisfy where ϕ T (ω) the terminator function. The simplest terminator is obtained by truncation, which means that the remaining coefficients that are not explicitly computed are set to zero. This gives ϕ T (ω) = 1/ω, and corresponds to a representation of the dynamical dipole polarizability as a sum of delta functions. However, often a more suitable terminator can be found by extrapolating the remaining coefficients according to some physical model suited to the system of study. In the first model we consider, the dynamical dipole polarizability is assumed to be a continuous distribution without gap, centered at a and with width 2E W . In this case the a n coefficients should converge to a, and b n should converge to b = E W /2 35,67 . At convergence, we get the the so called "self-consistent" terminator (SC) 35 which has the solution where the negative root was chosen. In the TDA case we only look at positive energies, so the dynamical dipole polarizability could be approximated (with sufficient broadening) to be a continuous distribution where the terminator (88) can be used. For the full BSE case however, both positive and negative frequencies are explicitly treated. Since the time-ordered polarizability (as well as the case without any imaginary convergence factor) is symmetric around ω = 0 it has at least two distributions separated by a gap.
The presence of the gap in the middle of the distribution is included in the second model we look at. Turchi et al analyzed the behavior of the recursion coefficients for densities of states with a gap and showed that if 2E G is the gap (a and E W defined as before) the a n coefficients oscillate with limits a ± = a ± E G , and b n with limits (b ± = E W ± E G )/2. 67 The period of the oscillations depends on the details of the density of states. If no gap is present, we have the situation of equation (88). For a symmetric distribution around the middle of a single gap, which could be a good approximation to the full BSE case, the period is two, and the terminator is which has the solution (for the negative root) We denote this model SC2. Because of symmetry around frequency ω = 0 the a n coefficients will oscillate around zero in the full BSE case. Indeed, since only the odd moments of the line shape contribute to a n , they should be zero 35,55 . However, in practice, orthogonality between the Lanczos vectors will eventually be lost due to numerical errors, and this introduces non-zero values of a n . In practice one can at any point in the recursion sequence make the assumption that the coefficients have converged and so put in the value of the last computed coefficients in equation (88) or (90). Another option is to make the assumption that the coefficients will converge to the average value of the already computed coefficients, removing some of the bias of the exact point in the chain the termination was made. We denote the averaged terminators by SC-av and SC2-av when the average is applied for the terminator in (88) or in (90) respectively. When a n is set to zero in the equations (88), (90), the terminator reduces to the one used in refs 20 and 36 (except for the signs of b 2 n and b 2 n+1 ) which is appropriate for the full BSE case. The consequence of the choice of terminator is illustrated in figures 5 and 6 for the TDA and full BSE case respectively. The dynamical dipole polarizability was computed with 20 and 100 iterations for TDA and full BSE correspondingly, while using simple truncation, or terminators defined by equations (88) or (90), with or without averaging. For TDA the self-consistent terminator SC gives a slight improvement while SC2 does better, although it introduces more broadening. When averaging the coefficients we introduce even more broadening in the continuum part of the dynamical dipole polarizability.
Looking at the a n and b n coefficients we see that they do not converge but oscillate, which is expected because our small basis set cannot give rise to a continuous dynamical dipole polarizability in the continuum. For an arbitrary stick-like distribution the behavior of the coefficients is complicated. If we look at the averages of the coefficients, we see that a n ≈ 72 eV which is close to the center of the spectrum, 65 eV, as estimated as half the range of the GW eigenvalues, while b n ≈ 33 eV which is close to a quarter of the range of the spectrum, as expected. Averages of the even and odd b n coefficients do not differ almost at all, hence the very similar appearance of the averaged versions of terminators SC and SC2. Using two following b n coefficients however, preserves some oscillations and gives a slightly better agreement to the converged spectrum.
In the non-TDA case, the SC terminator fails completely and gives negative intensities. Here it is clear that at least two oscillating coefficients must be used for a reasonable description. a n was confirmed to be zero, and the averages of the odd and even coefficients were seen to be 72 and 64 eV respectively. Their difference (8 eV) should correspond to half the gap E G , roughly 6 eV in our calculations, estimated from the GW eigenvalues. Here again, we observe that taking the average leads to a smoother spectrum that does not necessarily improve things from only using the last two coefficients. This is likely due to the complicated oscillations coming from the stick-like distribution obtained with our small basis set.

C. Demonstration of the low scaling with system size
To demonstrate the scaling properties of our algorithm we performed Lanczos iterations for alkane chains of increasing length. One-dimensional systems are the most favorable cases for algorithms that make use of sparsity, since the number of overlapping functions will be small. To demonstrate the asymptotic scaling of our algorithm this system is also ideal -the part that scales cubically depends on the number of AOs, while the dominant quadratic scaling operations involve the number of overlapping AOs. A sparse one-dimensional system maximizes the ratio of the former to the latter. The ground state calculation was done with SIESTA using the LDA functional and a minimal SZ basis set. Although scaling like O(N 3 ), our GW scheme turned out to be a bottleneck as the systems grow larger, and we therefore chose to bypass the GW step and directly do a TDHF benchmark starting from LDA eigenstates. For the purpose of testing the iterative BSE algorithm the choice of starting point makes no difference. In figure 7 we show the runtime, per Lanczos step, or alkane chains of different sizes divided by the runtime of the smallest chain, C 64 H 130 . The largest alkane chain we considered was C 1024 H 2050 with 6146 basis functions. The pseudo-Hermitian algorithm was used in this comparison. In the figure the part of the runtime coming from the basis transformation in equations (84), (84) that should scale cubically is contrasted to the remaining runtime contributions. For small systems the basis transform is negligible in comparison to the other terms but due to its cubic asymptotic scaling it will eventually start to dominate. We see that for the largest chain considered the basis transformation consumes around half the runtime, and we would need to go to even larger systems for the cubic terms to dominate completely. We must here stress the fact that we have used and almost artificially sparse system in order to demonstrate the cubic scaling of the algorithm. For more realistic systems that are less sparse and have more basis functions per atom, the onset where the cubic terms start to dominate will occur much later. We can thus expect that the quadratic and lower terms will dominate for systems with up to several thousands of basis functions, i.e., for most systems that can be practically treated with standard DFT methods.

IV. CONCLUSIONS
We have described, and implemented, an iterative scheme to compute the optical response of molecular systems at the Bethe-Salpeter level, using local basis sets. We go beyond the Tamm-Dancoff approximation by an extension of the Hermitian Haydock recursion scheme to the pseudo-Hermitian case and provide a derivation of this extension. We show that it is possible to develop an implementation with low scaling with the system size by exploiting the localization of the basis set of numerical atomic orbitals. Proof of principle calculations are shown, focusing on the case of benzene, and the influence of the number of recursion vectors is discussed, as is the effect of different terminators of the continued fractions on the obtained dynamical dipole polarizability.
The theoretical scaling of our method is O(N 3 ). However, calculations performed for alkane chains containing up to 1024 carbon atoms shows that the contribution of the cubic terms is small. Even for the largest systems considered the contribution of the cubic terms is of comparable magnitude to to that of the quadratic terms coming from the application of the Coulomb kernel in the atomic basis.
What we have presented here is a proof of principles of the method, plus an analysis of the convergence of our iterative BSE scheme. Our final goal, however, is to create a method (implemented in a suite of programs) capable of accurately investigating complex systems containing thousands of atoms. In order to reach this goal we are currently investigating ways to improve the performance of the method. These include an efficient parallelization scheme and a more optimized basis set for the expansion of atomic orbital products. It is also the case that the GW calculation needed to obtain the quasiparticle energies and states, as well as the screened interaction matrix elements, can benefit from similar improvements.

ACKNOWLEDGMENTS
We acknowledge support from the Deutsche Forschungsgemeinschaft (DFG) through the SFB 1083 project, the ANR ORGAVOLT project and the Spanish MINECO MAT2013-46593-C6-2-P project. Discussions with Mark Casida are gratefully acknowledged. DF thanks Lorin X.Benedict for discussions concerning the scaling of the algorithm. PK acknowledge financial support from the Fellows Gipuzkoa program of the Gipuzkoako Foru Aldundia through the FEDER funding scheme of the European Union, Una manera de hacer Europa. FF acknowledges support from the EXTRA programme of the "Università degli Studi di Milano-Bicocca" and from the Erasmus Placement programme for student mobility.
Appendix A: Derivation of the Bethe-Salpeter equation Here we derive the BSE equation following an approach similar to that given in refs 68 and 69. The purpose of this appendix is to derive the equations using our notation to avoid possible confusions with different notations and conventions that can be found in the literature.

(A13)
Up to this point the derivation has been exact. In order to obtain the working expression for the BSE kernel, K, we now make use of the GW approximation to the self energy. In this case both the Hartree potential v H and the self energy Σ can be expressed in terms of G: v H (1) = d(2)v(1, 2)ρ(2) = −i d(2)v(1, 2)G(2, 2 + ) , (A16) Here we note that the bare Coulomb interaction is instantaneous v(1, 2) = v(1, 2)δ(t 2 − t 1 ), but this is not in general the case for W. Equation (A13) still depends on four times. For our purposes, we want to look at the response at time t from a perturbation at time t , that is our perturbations are local in time. In this case we can express L in terms of the density matrix ρ(1, 2, t) as L(1, 2, 3, 4) = δρ(1, 2, t 1 ) δU(3, 4, t 3 ) δ(t 1 − t 2 )δ(t 3 − t 4 ) , where we identify t = t 1 and t = t 3 . Since the initial time is arbitrary for a system in equilibrium -the state of the system does not change in time when we are in the ground state -we furthermore only have to worry about the difference t − t. As in the Dyson equation for the Green's function G, we can then Fourier transform to get a dependence of only one frequency, thus giving L(1, 2, 3, 4 | ω) = L 0 (1, 2, 3, 4 | ω) + d(5678)L 0 (1, 2, 5, 6 | ω)K(5, 6, 7, 8 | ω)L(7, 8, 3, 4 | ω) .
(A18) which leads to one singlet solution, where H ex is included with a factor of 2, and three triplet solutions where H ex is absent. Knowing this, we work just with the real space quantities, remembering to include the correct scaling factor in front of H ex depending on if we want a singlet or a triplet solution The time-ordered four-point polarizability can be written in matrix form as From this expression it looks like we have to use all pairs, that is not only particle-hole and hole-particle pairs but also particle-particle and hole-hole pairs. But actually, only the particle-hole and hole-particle pairs contribute to L. To see this we set up the H BSE , and F matrices in blocks corresponding to the {vc}, {cv},{vv} and {cc} sectors (B10) This gives the following matrix to be inverted in equation (B8) The inverse of a matrix with this block structure is Looking at the polarizability L = [(ω + iγF)I − H BSE ] −1 F we see that due to the leftmost F matrix only the A-block, that is the {vc} and {cv}-sectors of H BSE , contribute to L. Diagonalizing H BSE and expanding in left and right eigenvectors gives the following expression Note that this is the time-ordered polarizability, the retarded one that we need for the response, that is equation (24), is obtained by setting the sign of the small imaginary part in the denominator to always be positive. We can also use the relations Im L t (ω) = sgn(ω) Im L r (ω) and Re L t (ω) = Re L r (ω), where the superscripts "t" denotes time-ordered and "r" retarded.