Address #412 holistic re-audit residuals (R2)#430
Conversation
A holistic codex review of the merged #412 + cleanup #422 state surfaced three documentation/test gaps that the per-PR cleanup review path could not see (it only scopes to the cleanup diff). All three are at-most P3 in severity but each is real claim-vs-coverage drift. 1. REGISTRY's top-level `Note (Phase 3 by_path ...)` `to_dataframe( level="by_path")` schema list omits the `het_*` columns (`het_beta`, `het_se`, `het_t_stat`, `het_p_value`, `het_conf_int_lower`, `het_conf_int_upper`) that `_to_dataframe` has always emitted since the Phase 5 heterogeneity wave landed. Add them to the schema list so the registry contract matches the implementation. 2. The two new parity tests (`TestDCDHDynRParityHeterogeneity`, `TestDCDHDynRParityByPathHeterogeneity`) assert only `beta` and `se` from the R golden payload, leaving `t_stat`, `p_value`, `conf_int`, and `n_obs` unpinned. A regression in the inference extraction or final-`df_survey` refresh could ship while parity still passes. Pin `t_stat` at `SE_RTOL` (invariant to critical- value distribution since `t = beta / se`) and `n_obs` exactly. 3. While extending the parity assertions, surfaced a real Python-vs-R structural deviation that was undocumented: `_compute_heterogeneity_test` passes `df=None` to `safe_inference`, so Python uses the normal Z critical value (~1.96) for `p_value` and `conf_int`. R `did_multiplegt_dyn(..., predict_het)` uses the t-distribution with df = n - k from the WLS regression. The structural gap produces ~0.1-2% rtol gaps on CIs and p-values that exceed `SE_RTOL` (verified empirically on the parity fixture: CI gap ~0.17% on h=1). Document the deviation in the heterogeneity R-parity Note. Pin only `beta`, `se`, `t_stat`, `n_obs` in the parity tests; `p_value` and `conf_int` parity intentionally skipped. Add a TODO row tracking the optional df-threading work. No estimator behavior, weighting, variance/SE computation, or default-statistical surface changed - documentation accuracy plus expanded regression coverage only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Restored CI reviewer R0 on PR #430 flagged two follow-on items: 1. REGISTRY's refreshed heterogeneity Note described the non-survey variance branch in WLS terms ("standard WLS vcov for non-survey", "df = n - k from the WLS regression"). The implementation at `chaisemartin_dhaultfoeuille.py:5095-5106` is plain OLS via `solve_ols(design, dep_arr, return_vcov=True)` with no weights on the non-survey path; only the survey branch uses WLS. Adjust the wording so the deviation paragraph and the variance-machinery description both correctly attribute the df to the OLS regression. Same wording fix applied to the matching TODO row. 2. By intentionally skipping R parity for heterogeneity `p_value` and `conf_int` (Python Z vs R t deviation), the suite no longer covers a regression isolated to those fields' extraction or `_refresh_path_inference` ordering. Backfill that gap with a Python-side invariant: assert that stored `t_stat` / `p_value` / `conf_int` equal `safe_inference(beta, se, df=None)` on every populated horizon. Two tests: - `TestHeterogeneityTesting::test_heterogeneity_inference_matches_safe_inference` (global surface, all populated horizons) - `TestByPathHeterogeneity::test_per_path_heterogeneity_inference_matches_safe_inference` (by_path surface, all populated (path, horizon) entries) Both close the regression-coverage gap without re-introducing the Z-vs-t structural mismatch in R parity. No runtime behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…aces R1 review caught three residual "WLS" references in the heterogeneity deviation paragraphs that escaped the first R1 commit: 1. REGISTRY R-parity note still attributed the SE tolerance to "small WLS denominator-and-cohort-recentering numerical drift". The non- survey path is OLS; the recentering observation applies to the OLS denominator. Adjust the wording. 2. Parity-test inline comment still said "uses t-distribution with df = n - k from the WLS regression". Same fix. 3. TODO row still said "Either thread the WLS df into safe_inference". Same fix. The legitimate "uses WLS" reference in the heterogeneity Note (under the "Note (survey support)" sub-paragraph) is intentionally retained - the SURVEY path IS WLS; only the non-survey heterogeneity branch is OLS. The remaining WLS occurrence at REGISTRY.md:1271 is in an unrelated estimator's note. No runtime behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R2 review caught one more stale "WLS" reference in the result-class
docstring: `path_heterogeneity_effects` described `beta` as a WLS
coefficient, but the non-survey path is plain OLS (only the survey
branch is WLS). Clarify that `beta` is OLS on the non-survey path
and WLS-on-pweights only under `survey_design`.
The global `heterogeneity_effects` docstring is already generic
("Per-horizon heterogeneity test results") and needs no change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
The R1 wording fix replaced "WLS" with "plain OLS vcov" in the non-survey heterogeneity description, but `solve_ols(..., return_ vcov=True)` defaults to `vcov_type="hc1"` (linalg.py:464), so the actual non-survey variance is HC1 heteroskedasticity-robust, not classical OLS. Adjust the registry wording to "HC1-robust OLS vcov for non-survey via solve_ols(..., return_vcov=True) (vcov_type= \"hc1\" default)". The `beta` field is still plain OLS (the point estimate solves the normal equations); only the vcov/SE side uses HC1. The result-class docstring's "plain OLS on the non-survey path" wording refers to the point estimate and is correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology No findings. The diff does not change estimator math, weighting, variance/SE computation, or defaults; the updated registry/TODO text is consistent with the current implementation and with the project’s documented handling of the heterogeneity R divergence. Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. The newly added deferred-work entry is correctly tracked in Security No findings. Documentation/Tests No findings. The registry schema list now matches |
Summary
A holistic codex review of the merged #412 + cleanup #422 state surfaced three documentation/test gaps that the per-PR cleanup review path could not see (it only scopes to the cleanup diff). All three are at-most P3 in severity but each is real claim-vs-coverage drift.
REGISTRY schema list omits `het_*` columns — the top-level `Note (Phase 3 by_path ...)` `to_dataframe(level="by_path")` schema list still describes the pre-heterogeneity column set. Add the six `het_*` columns the `_to_dataframe` path has always emitted since Phase 5.
Parity tests under-asserted golden payload — `TestDCDHDynRParityHeterogeneity` and `TestDCDHDynRParityByPathHeterogeneity` assert only `beta` and `se`, leaving `t_stat`, `p_value`, `conf_int`, and `n_obs` unpinned. A regression in the inference extraction or final-`df_survey` refresh could ship while parity still passes. Pin `t_stat` at `SE_RTOL` (`t = beta / se` is invariant to the Wald-test critical-value distribution) and `n_obs` exactly.
Real undocumented Z-vs-t structural deviation — while extending the parity assertions, surfaced a real Python-vs-R deviation that was undocumented: `_compute_heterogeneity_test` passes `df=None` to `safe_inference`, so Python uses the normal Z critical value (~1.96) for `p_value` and `conf_int`. R `did_multiplegt_dyn(..., predict_het)` uses the t-distribution with df = n - k from the WLS regression. The structural gap produces ~0.1-2% rtol gaps on CIs and p-values that exceed `SE_RTOL` (verified empirically: CI gap ~0.17% on h=1). Document the deviation in the heterogeneity R-parity Note; pin only `beta`, `se`, `t_stat`, `n_obs` in the parity tests; `p_value` / `conf_int` parity intentionally skipped. Add a TODO row tracking the optional df-threading work.
No estimator behavior, weighting, variance/SE computation, or default-statistical surface changed - documentation accuracy plus expanded regression coverage only.
Methodology references
Test plan
🤖 Generated with Claude Code