New interface for dynamic and static files representing the model

added block decomposition perfect foresight performance preprocessor labels

changed the description

The general framework that we work with is

f(y_{t-1}^*,y_t,y_{t+1}^{**},u_t)=0

where

{y_t}

are all n endogenous variables,

y_{t-1}^*

are predetermined and mixed variables (states), and

y_{t+1}^{**}

are mixed and forward variables (jumpers). If I understand this issue correctly, the problem we want to tackle with the new interface is, that states and jumpers do not necessarily occur in period t, so any partial derivative with respect to those variables in

{y_t}

will be zero and this makes dealing with the

{y_t}

columns in the static and dynamic Jacobians a bit tedious, right?

BUT: why do we want to increase the columns to 3n and not to just (nspred+n+nsfwrd)? I know sparse matrices are designed to work great with many columns of zeros, but in almost all models we would typically have that (nspred+n+nsfwrd) is much smaller than 3n, and this would most likely be also of benefit for sparse algorithms.

But I am definitely all in to return those Jacobians as sparse!

@wmutschl when the matrices are sparse using 3n columns instead of nspred+n+nsfwrd makes only the vector containing the column pointers larger in the sparse matrix representation if stored by column

Using 3n column simplifies greatly calculus over the columns avoiding to have to track for a given variable its index in

y^*_{t-1}

and in

y_t

and in

y^{**}_{t+1}

and almost surely makes the building of the perfect foresight problem big Jacobian faster. The same is true for cycle reduction of the first order problem.

mentioned in commit preprocessor@d67f5690

I realize that using

3n

column indices will make it more likely to hit the 32-bit limit for multi-indices in tensors for higher order derivatives (see preprocessor#89 (closed) for the details). I don’t think this is a blocker for the move we’re considering, but that needs to be kept in mind as a drawback of that new approach.

I just tried to create a 2^40 x 2^40 empty sparse matrix in Julia. I didn't encounter a problem with the indices. They should be Int64 anyway, but didn't have enough memory to finish the task. At some point, we should use 64 bit integers for these indices in the preprocessor and rethink what we put in the JSON for very large models

Sure, using 64-bit indices everywhere would be a significant improvement. But as detailed in preprocessor#89 (closed), this is a big project.

I wasn't thinking every where only in the part manipulating the indices of derivative matrices

Sure, but this is still a big project.

Essentially the problem comes from the way we represent higher-order derivatives. They are stored in matrices whose columns are multi-indices (so for example, if you have a model with

n=10

equations, then the 4th order derivatives matrix will have

(3×10)^4 = 810\,000

columns. The 32-bit limit, corresponding to about 2 billions, is easy to reach). If we were to store those as true higher-order tensors, then the problem would mostly disappear.

I didn't start anything yet about higher order in Julia. We could decide that for Julia we switch to true tensors. Now would be a good time to make the decision

If Julia has such an object type, then for sure you should opt for tensors. This could even begin with the Hessian (which is fundamentally a 3D tensor).

Actually, given the discussion about 32-bit limit for multi-indices, I’m wondering whether we should take the opportunity of this change of model representation to move to tensors (a.k.a. ND arrays) for representing higher order derivatives.

The difficulty is that MATLAB does not have a native object type for sparse ND arrays (there however exists a contributed code that apparently does this, see https://fr.mathworks.com/matlabcentral/fileexchange/29832-n-dimensional-sparse-arrays).

Alternatively, the higher order derivatives could be represented as a matrix whose lines correspond to nonzero elements and which has as many columns as there are dimensions in the tensor (plus one column for the value itself). This is what is currently done for derivatives with respect to parameters. There could however be performance issues, since that representation is not efficient.

I think we need to consider the various representation and interfaces to prevent having to switch representations from one step to the next. After all, for higher order we go from the preprocessor to either Julia or Dynare++ and subsequently Matlab/Octave and potentially Fortran. Our goal in #1825 (closed) is to replace the Dynare++ routines by Fortran. Also, I don't see where in our Matlab codes we actually handle the big arrays at higher order. It seems we simply pass them through to dedicated mex-files.

After discussing with @normann, here is the sparse representation that we agreed on for higher-order derivatives of order k:

one matrix of int32 for representing indices of nonzero elements: k+1 columns and as many lines as nonzero elements
one vector of double floats for representing the values: as many as nonzero elements
the lines of the two structures are sorted according to the lexicographic order of indices
symmetric elements are not repeated (i.e. this is the “folded” representation in the Dynare++ jargon)

I intend to use that representation even for the Hessian.

What are the arguments in favor of these choices?

Speed should be primary criteria. These functions are used only once in the case of simulation, but many times for estimation.
Is the argument for not repeating the symmetric arguments that all following operations are conducted with folded tensors or that it is cheap to form the full tensor?
I imagine that lexicographic order of indices is the one used for Compressed Sparse Column (CSC) Sparse Matrix Storage.

The rationale is indeed speed (from the point of view of the k-order solver).

Since the only consumer of higher-order derivatives (⩾3) is the k-order solver (that @normann is currently rewriting in Fortran), I am confident that this representation is a good choice. I am a bit less sure about the Hessian, since it is used from other places (at least the 2nd order solver in pure MATLAB/Octave). But if needed we can probably write a tool (maybe a MEX) that transforms it into another representation.

When Ondra wrote Dynare++ 20 years ago, we observed that folded tensors were slower than full ones because the necessity to keep track of the order of multiplicity of the symmetric elements in linear operations involving these tensors. I'm curious about what changed that in the new Fortran code

What the preprocessor outputs is not necessarily what is used internally for computations. My understanding is @normann finds it easier to have a folded tensor as an input, but I don’t know how he uses it afterwards. I let him comment further.

The Fortran replacement of current code is still work in progress. The idea is to compute the orders of multiplicity of symmetric elements once for all before calling any tensor-vector multiplication.

You will need to benchmark it because even in this case, it is hard to beat a straight matrix multiplication with BLAS gemm even with repeated elements.

In Dynare++, tensors are stored in dense matrices (with multi-indices of variables in columns, which incidentally is vulnerable to the 32-bit limit). This also explains why it was using BLAS. But I don’t think that dense storage was a good choice. My understanding is that @normann is going to use only sparse storage, so the constraints are different.

@MichelJuillard I plan to extensively test and profile the new code. The purpose of the Fortran code is to have simpler code to maintain and faster execution as well.

Also note that the preprocessor already outputs higher-order derivatives in folded mode (except for the Hessian where the symmetric elements are written).

@sebastien dyn_second_order_solver.m never actually handles the Hessian. That's done in sparse_hessian_times_B_kronecker_C.mex. @MichelJuillard Is there any reason for keeping dyn_second_order_solver.m as opposed to directly going for the k_order_solver at order>1?

As far as I remember dyn_second_order_solver.m is faster than the k_order_solver as it uses optimized routines.

second order computations for second order are a bit simpler than for k > 2. If the Fortran k-order for k=2 is faster than dyn_second_order_solver.m, there is no reason to keep it.

identification(order=3) is another consumer of order=4 derivatives

We should have benchmarks to help up with theses design choices

Unfortunately benchmarking is a lot of work, because the code for using the new (sparse) representation does not yet exists. In any case I’m not going to remove the current representation. The new (sparse) representation comes in addition to the current one (even though I hope we can get rid of the current representation at some point, if the new representation turns out to be faster/more convenient as I hope).

@MichelJuillard Yes, we need some benchmarking. But I remember going for a Fortran replacement of the k_order_solver because it saves the considerable Dynare++ overhead. dyn_second_order_solver calls mex-files for all serious computations, which makes it a good target for a full replacement. @normann Is the goal of the Fortran routine to also do second order efficiently?

The Fortran replacement of current code will be able to solve the model at any order. It relies on sparse matrices only. The question is whether the resort to sparse matrices and optimized tensor-vector operations is enough to beat the current specialized code at the second (and soon third order). I don't know yet. Still, the ideas put forward in the k-order code may be of some use to improve the second-order and third-order specialized codes.

Another (minor) issue. The Jacobian is 2D sparse tensor, i.e. a sparse matrix, so the preprocessor can directly construct a sparse matrix in the native MATLAB/Octave/Julia format (i.e. Compressed Sparse Column a.k.a. CSC).

There are basically two ways of creating that matrix (at least for the use_dll case under MATLAB/Octave, and for Julia):

directly construct the CSC representation (using mxCreateSparseArray in the MEX, or constructing a SparseMatrixCSC object in Julia); in pure MATLAB/Octave, there does not seem a way to achieve that
create the 2 vectors of indices and the vector of nonzero values, and call the sparse function

The first option is probably faster at construction time. But the second one will give slightly better compressed matrices, because elements which are symbolically nonzero but numerically zero will be also eliminated (since the preprocessor only knows about the symbolic values, not the numerical ones).

I’m inclined to go for the first option. Your comments welcome.

Note that in Julia, the function dropzeros! can be used to remove the stored zeros. So maybe that offers a compromise between the two options above (by directly constructing the CSC then calling dropzeros!). In the MEX case, I could add code that does a similar compression at runtime.

For iterative solutions and estimation it is better to focus on structural zero because the sparsity patterns won't depend from the particular value of some parameters and the analysis of these patterns can be done once per algorithm call and not once per iteration

Apparently numerical zeros matter for perfect foresight on big models, at least this is what I observed in the past: https://git.dynare.org/Dynare/dynare/-/blob/master/mex/sources/perfect_foresight_problem/perfect_foresight_problem.cc#L285

For Julia, it would be better to return the matrix/vectors of non zero values rather than the sparse matrix. We want as little code as possible in the functions loaded at run time rather than in the code of DynareJulia proper. For perfect foresight, it is better to concatenate the matrix/vectors of non-zero elements for each period and then build the large sparse Jacobian, than to concatenate sparse matrices.

What I mean is that the preprocessor can construct the vectors colptr, rowval and nzval of the SparseMatrixCSC type, and then construct that object. So there will be no computation at runtime. This is more efficient than returning the matrix of non-zero values and calling sparse later.

See https://docs.julialang.org/en/v1/stdlib/SparseArrays/#man-csc

As another change in the new representation, I intend to remove the it_ argument of the dynamic functions (and thus the argument x will be a vector instead of a matrix). This is because leads and lags on exogenous are now always replaced by auxiliary variables, and those on exogenous deterministic are now forbidden, so it_ no longer has a purpose.