This is a rather philosophical matter. Historically, models were calibrated to long-run growth facts and then cross-validated by looking at the implies short- to medium-run implications for the business cycle, which is in a sense a different dataset.
When estimating a model, the parameters are chosen by looking at the same dataset for which you try to match second moments. You could argue that this is not a rigid "out of sample" test.
People nevertheless do this, because when estimating, you try to minimize the forecast error. Thus, it is not guaranteed that selected second moments are well-matched. Looking whether the model matches them is is a sensible test (not meant to denote a statistical test).
What you could do, is perform a test of the overidentifying restriction, see e.g.
https://ideas.repec.org/a/eee/dyncon/v31y2007i8p2599-2636.html. This test will be a lot stricter than the eyeball econometrics performed on second moments. See also
http://delong.typepad.com/sdj/2011/10/calibration-and-econometric-non-practice.html If you do Bayesian estimation you should not be testing at all. Rather, you do model comparison and only reject your current model if you found a better one (the idea being that a poor model is still better than no model at all)