The preprocessor currently creates dynamic and static files for representing the model (forgetting about the bytecode case which is sligthly different), which have the following two characteristics:
the Jacobian matrix that they return is a dense one
in the dynamic file, the column indices for the dynamic (endogenous) variables are organized so as to avoid columns of zeros; said otherwise, if there are n variables, there are less than 3×n columns (for endogenous) in the dynamic Jacobian. This design decision makes sense since the matrix is dense, but it complicates the computations with the Jacobian matrix.
The proposal is to change the design and have the dynamic and the static files:
return a sparse Jacobian
for the dynamic file, use exactly 3×n columns for the endogenous in the Jacobian (and also for higher order derivatives)
It is expected that having a sparse Jacobian will improve performance, at least for large models.
Since the static and dynamic files are used at many places in the code, the transition cannot be done overnight, and the idea is to have the old and the new interface coexist for some time. The old interface would be removed at some point (unless it turns out that there is a real use case for dense Jacobians).
The interaction with block decomposition also needs to be taken into account when doing this transition. Currently, when block decomposition is requested, the preprocessor outputs static and dynamic files that have the same name as in the case without block decomposition, but a different interface. This is bad design. The preprocessor should use different names for those files when using block decomposition. We could even go as far as always providing the two versions of those files (with and without block decomposition), so that the block option of the model block would no longer affect preprocessor output (but only the algorithms used for computations). An even further step could be to always use block decomposition for perfect foresight, and drop the other code paths (but we probably don’t want to do that for the stochastic case). In particular, this could help with large backward models, for which we only need (dynamic) derivatives with respect to contemporaneous variables, an optimization which is already automatically done when using block decomposition.
Steps:
Change the preprocessor so that it produces the new static and dynamic files in parallel to the old ones
Change the preprocessor so that it always outputs the block decomposed files (with the new interface) in parallel to the non-block decomposed ones? (see also preprocessor#41 (closed))
Start using the new interface in the MATLAB/Octave code
Drop the perfect foresight algorithms that do not use block decomposition? (i.e. block becomes the default, and only choice, for perfect foresight). This would probably require #1858 to be done first.
Drop the old interface (if it is confirmed that dense Jacobians have no real use case)
3 of 5 checklist items completed
· Edited
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
are predetermined and mixed variables (states), and
yt+1∗∗
are mixed and forward variables (jumpers). If I understand this issue correctly, the problem we want to tackle with the new interface is, that states and jumpers do not necessarily occur in period t, so any partial derivative with respect to those variables in
{y_t}
will be zero and this makes dealing with the
{y_t}
columns in the static and dynamic Jacobians a bit tedious, right?
BUT: why do we want to increase the columns to 3n and not to just (nspred+n+nsfwrd)? I know sparse matrices are designed to work great with many columns of zeros, but in almost all models we would typically have that (nspred+n+nsfwrd) is much smaller than 3n, and this would most likely be also of benefit for sparse algorithms.
@wmutschl when the matrices are sparse using 3n columns instead of nspred+n+nsfwrd makes only the vector containing the column pointers larger in the sparse matrix representation if stored by column
Using 3n column simplifies greatly calculus over the columns avoiding to have to track for a given variable its index in
yt−1∗
and in
yt
and in
yt+1∗∗
and almost surely makes the building of the perfect foresight problem big Jacobian faster. The same is true for cycle reduction of the first order problem.
column indices will make it more likely to hit the 32-bit limit for multi-indices in tensors for higher order derivatives (see preprocessor#89 (closed) for the details). I don’t think this is a blocker for the move we’re considering, but that needs to be kept in mind as a drawback of that new approach.
I just tried to create a 2^40 x 2^40 empty sparse matrix in Julia. I didn't encounter a problem with the indices. They should be Int64 anyway, but didn't have enough memory to finish the task. At some point, we should use 64 bit integers for these indices in the preprocessor and rethink what we put in the JSON for very large models
Essentially the problem comes from the way we represent higher-order derivatives. They are stored in matrices whose columns are multi-indices (so for example, if you have a model with
n=10
equations, then the 4th order derivatives matrix will have
(3×10)4=810000
columns. The 32-bit limit, corresponding to about 2 billions, is easy to reach). If we were to store those as true higher-order tensors, then the problem would mostly disappear.
I didn't start anything yet about higher order in Julia. We could decide that for Julia we switch to true tensors. Now would be a good time to make the decision
If Julia has such an object type, then for sure you should opt for tensors. This could even begin with the Hessian (which is fundamentally a 3D tensor).
Actually, given the discussion about 32-bit limit for multi-indices, I’m wondering whether we should take the opportunity of this change of model representation to move to tensors (a.k.a. ND arrays) for representing higher order derivatives.
Alternatively, the higher order derivatives could be represented as a matrix whose lines correspond to nonzero elements and which has as many columns as there are dimensions in the tensor (plus one column for the value itself). This is what is currently done for derivatives with respect to parameters. There could however be performance issues, since that representation is not efficient.
I think we need to consider the various representation and interfaces to prevent having to switch representations from one step to the next. After all, for higher order we go from the preprocessor to either Julia or Dynare++ and subsequently Matlab/Octave and potentially Fortran. Our goal in #1825 (closed) is to replace the Dynare++ routines by Fortran. Also, I don't see where in our Matlab codes we actually handle the big arrays at higher order. It seems we simply pass them through to dedicated mex-files.
Speed should be primary criteria. These functions are used only once in the case of simulation, but many times for estimation.
Is the argument for not repeating the symmetric arguments that all following operations are conducted with folded tensors or that it is cheap to form the full tensor?
I imagine that lexicographic order of indices is the one used for Compressed Sparse Column (CSC) Sparse Matrix Storage.
Since the only consumer of higher-order derivatives (⩾3) is the k-order solver (that @normann is currently rewriting in Fortran), I am confident that this representation is a good choice. I am a bit less sure about the Hessian, since it is used from other places (at least the 2nd order solver in pure MATLAB/Octave). But if needed we can probably write a tool (maybe a MEX) that transforms it into another representation.
When Ondra wrote Dynare++ 20 years ago, we observed that folded tensors were slower than full ones because the necessity to keep track of the order of multiplicity of the symmetric elements in linear operations involving these tensors.
I'm curious about what changed that in the new Fortran code
What the preprocessor outputs is not necessarily what is used internally for computations. My understanding is @normann finds it easier to have a folded tensor as an input, but I don’t know how he uses it afterwards. I let him comment further.
The Fortran replacement of current code is still work in progress. The idea is to compute the orders of multiplicity of symmetric elements once for all before calling any tensor-vector multiplication.
You will need to benchmark it because even in this case, it is hard to beat a straight matrix multiplication with BLAS gemm even with repeated elements.
In Dynare++, tensors are stored in dense matrices (with multi-indices of variables in columns, which incidentally is vulnerable to the 32-bit limit). This also explains why it was using BLAS. But I don’t think that dense storage was a good choice. My understanding is that @normann is going to use only sparse storage, so the constraints are different.
@MichelJuillard I plan to extensively test and profile the new code. The purpose of the Fortran code is to have simpler code to maintain and faster execution as well.
Also note that the preprocessor already outputs higher-order derivatives in folded mode (except for the Hessian where the symmetric elements are written).
@sebastiendyn_second_order_solver.m never actually handles the Hessian. That's done in sparse_hessian_times_B_kronecker_C.mex.
@MichelJuillard Is there any reason for keeping dyn_second_order_solver.m as opposed to directly going for the k_order_solver at order>1?
second order computations for second order are a bit simpler than for k > 2. If the Fortran k-order for k=2 is faster than dyn_second_order_solver.m, there is no reason to keep it.
Unfortunately benchmarking is a lot of work, because the code for using the new (sparse) representation does not yet exists. In any case I’m not going to remove the current representation. The new (sparse) representation comes in addition to the current one (even though I hope we can get rid of the current representation at some point, if the new representation turns out to be faster/more convenient as I hope).
@MichelJuillard Yes, we need some benchmarking. But I remember going for a Fortran replacement of the k_order_solver because it saves the considerable Dynare++ overhead. dyn_second_order_solver calls mex-files for all serious computations, which makes it a good target for a full replacement.
@normann Is the goal of the Fortran routine to also do second order efficiently?
The Fortran replacement of current code will be able to solve the model at any order. It relies on sparse matrices only. The question is whether the resort to sparse matrices and optimized tensor-vector operations is enough to beat the current specialized code at the second (and soon third order). I don't know yet. Still, the ideas put forward in the k-order code may be of some use to improve the second-order and third-order specialized codes.
Another (minor) issue. The Jacobian is 2D sparse tensor, i.e. a sparse matrix, so the preprocessor can directly construct a sparse matrix in the native MATLAB/Octave/Julia format (i.e. Compressed Sparse Column a.k.a. CSC).
There are basically two ways of creating that matrix (at least for the use_dll case under MATLAB/Octave, and for Julia):
directly construct the CSC representation (using mxCreateSparseArray in the MEX, or constructing a SparseMatrixCSC object in Julia); in pure MATLAB/Octave, there does not seem a way to achieve that
create the 2 vectors of indices and the vector of nonzero values, and call the sparse function
The first option is probably faster at construction time. But the second one will give slightly better compressed matrices, because elements which are symbolically nonzero but numerically zero will be also eliminated (since the preprocessor only knows about the symbolic values, not the numerical ones).
I’m inclined to go for the first option. Your comments welcome.
Note that in Julia, the function dropzeros! can be used to remove the stored zeros. So maybe that offers a compromise between the two options above (by directly constructing the CSC then calling dropzeros!).
In the MEX case, I could add code that does a similar compression at runtime.
For iterative solutions and estimation it is better to focus on structural zero because the sparsity patterns won't depend from the particular value of some parameters and the analysis of these patterns can be done once per algorithm call and not once per iteration
For Julia, it would be better to return the matrix/vectors of non zero values rather than the sparse matrix. We want as little code as possible in the functions loaded at run time rather than in the code of DynareJulia proper. For perfect foresight, it is better to concatenate the matrix/vectors of non-zero elements for each period and then build the large sparse Jacobian, than to concatenate sparse matrices.
What I mean is that the preprocessor can construct the vectors colptr, rowval and nzval of the SparseMatrixCSC type, and then construct that object. So there will be no computation at runtime. This is more efficient than returning the matrix of non-zero values and calling sparse later.
As another change in the new representation, I intend to remove the it_ argument of the dynamic functions (and thus the argument x will be a vector instead of a matrix). This is because leads and lags on exogenous are now always replaced by auxiliary variables, and those on exogenous deterministic are now forbidden, so it_ no longer has a purpose.