Skip to content

dCDH by_path Wave 5 #11: + heterogeneity (predict_het per-by_level)#412

Merged
igerber merged 4 commits into
mainfrom
dcdh-by-path-heterogeneity
May 11, 2026
Merged

dCDH by_path Wave 5 #11: + heterogeneity (predict_het per-by_level)#412
igerber merged 4 commits into
mainfrom
dcdh-by-path-heterogeneity

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 10, 2026

Summary

  • Lift gate at chaisemartin_dhaultfoeuille.py:1230-1234 so per-path event-study disaggregation composes with heterogeneity="<col>".
  • New _compute_path_heterogeneity_test helper + path_groups: Optional[Set[int]] filter on _compute_heterogeneity_test. Per-path beta / SE / inference computed on the path-restricted switcher subsample; cohort dummies absorb baseline by construction (no R-divergence warning needed).
  • Result schema: new path_heterogeneity_effects: Dict[Tuple[int, ...], Dict[int, Dict[str, Any]]] field; to_dataframe(level="by_path") extended with always-present het_* columns (NaN for placebo rows / when not requested).
  • R parity: introduces the FIRST predict_het parity baseline in the repo. Two new R generator scenarios (multi_path_reversible_predict_het global anchor + multi_path_reversible_by_path_predict_het per-path) using dont_drop_larger_lower=TRUE to match Python's drop_larger_lower=False requirement under reversal paths. Per-path beta / SE match R within rtol=1e-6.
  • Inference refresh hooks into the final R2 P1b block (per feedback_late_if_site_inference_refresh.md) so all per-path entries see the final df_survey after replicate-weight n_valid appends.
  • Multiplier bootstrap (n_bootstrap > 0) under by_path + heterogeneity + survey_design inherits the existing per-path multiplier-bootstrap-survey gate from PR Compose by_path / paths_of_interest with survey_design (Wave 4 #10) #408.

Methodology references (required if estimator / math changes)

  • Method name(s): de Chaisemartin & D'Haultfœuille reversible-treatment heterogeneity test (Web Appendix Section 1.5, Lemma 7) under per-path event-study disaggregation
  • Paper / source link(s): de Chaisemartin & D'Haultfœuille (2024) "Difference-in-Differences Estimators of Intertemporal Treatment Effects" — Web Appendix Section 1.5 (heterogeneity test); R did_multiplegt_dyn(..., by_path, predict_het) per-by_level dispatch (R/R/did_multiplegt_dyn.R:226-257)
  • Any intentional deviations from the source (and why): None on the per-path methodology — Python mirrors R's per-path predict_het exactly. The inherited cross-path cohort-sharing SE deviation from R for the OVERALL path_effects SE is unchanged and does not apply here (heterogeneity uses standard WLS coefficient IF, not the cohort-recentered IF allocator). R's predict_het dont_drop_larger_lower=TRUE flag is matched in fixture scenarios so reversal paths preserve cohort variation under heterogeneity testing.

Validation

  • Tests added/updated:
    • tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity — 12 tests across gate dispatch (5), behavior (4 incl. single-path telescope at atol=rtol=1e-14, zero-signal anti-regression, multi-baseline UserWarning anti-regression), DataFrame integration (1), edge cases (2)
    • tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityHeterogeneity — global anchor (FIRST predict_het parity baseline)
    • tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathHeterogeneity — per-path
    • benchmarks/R/generate_dcdh_dynr_test_values.R — 2 new scenarios (Add comprehensive test coverage for utils module #20 + Update roadmap with current implementation limitations #21), 2 new extraction helpers (extract_dcdh_predict_het, extract_dcdh_by_path_predict_het)
    • benchmarks/data/dcdh_dynr_golden_values.json regenerated (21 scenarios, 619 KB)
  • Backtest / simulation / notebook evidence (if applicable): N/A (additive estimator extension; full R parity in fixture suite)

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

PR Review Report

Overall Assessment

⚠️ Needs changes: one unmitigated P1 finding from the required sibling-surface mirror audit.

Executive Summary

  • Affected method: ChaisemartinDHaultfoeuille dCDH reversible-treatment heterogeneity test, Web Appendix Section 1.5 / Lemma 7, under by_path and paths_of_interest.
  • Methodology registry now documents the per-path heterogeneity contract and the R-parity scope; I did not find an undocumented math or SE deviation in the estimator path.
  • Inference uses safe_inference() on the new path; pattern-wide grep found no new inline t_stat = effect / se anti-pattern.
  • P1: the new path_heterogeneity_effects surface is wired into the result dataclass and to_dataframe(level="by_path"), but not into summary(), unlike sibling per-path surfaces and global heterogeneity.
  • Could not run tests: pytest is not installed in this environment.

Methodology

Finding M-1: P3 Informational — documented partial/deviation surface, no action required

Severity: P3
Impact: The global heterogeneity implementation remains a documented partial predict_het port: post-treatment regressions only, no placebo regressions or joint null test. This is explicitly documented in docs/methodology/REGISTRY.md:L637, so it is not a defect under the review rules. The new per-path extension is also documented in the by-path note, including path-restricted switchers, unchanged variance machinery, survey composition, final df_survey refresh, and result schema at docs/methodology/REGISTRY.md:L643.
Concrete fix: None required.

Code Quality

Finding CQ-1: P1 — summary() does not render the new per-path heterogeneity surface

Severity: P1
Impact: The PR adds results.path_heterogeneity_effects and exposes it in to_dataframe(level="by_path"), but the sibling report surface is missing. Global heterogeneity has a dedicated summary renderer at diff_diff/chaisemartin_dhaultfoeuille_results.py:L1253-L1283, and existing per-path sibling surfaces render placebos, cumulated effects, and sup-t bands inside _render_path_effects_section at diff_diff/chaisemartin_dhaultfoeuille_results.py:L1308-L1431. The new heterogeneity field is only consumed by the DataFrame path at diff_diff/chaisemartin_dhaultfoeuille_results.py:L1823-L1906. This violates the required sibling-surface mirror audit (schema ↔ renderer, summary ↔ full) and makes summary() silently omit a computed public result.
Concrete fix: Extend _render_path_effects_section() to read self.path_heterogeneity_effects.get(path, {}) and render a per-path heterogeneity sub-block with beta, se, t_stat, and p_value, mirroring _render_heterogeneity_section(). Add a regression test asserting res.summary() contains the per-path heterogeneity block and path/horizon rows when by_path + heterogeneity is fit.

Performance

No P0/P1/P2 findings.

Maintainability

Finding MA-1: P1 — same as CQ-1

Severity: P1
Impact: The result schema and renderer are now out of sync for one public surface, increasing the chance future users or downstream tests treat summary() as complete when it is not.
Concrete fix: Same as CQ-1.

Tech Debt

No mitigating TODO entry found under TODO.md “Tech Debt from Code Reviews” for the missing summary surface. The P1 remains unmitigated.

Security

No findings.

Documentation/Tests

Finding DT-1: P2 — summary coverage missing for the new public surface

Severity: P2
Impact: The new tests cover gate dispatch, behavior, DataFrame integration, and R parity, but do not cover summary() for path_heterogeneity_effects. Existing sibling features have summary tests, so this gap allowed CQ-1 through.
Concrete fix: Add a TestByPathHeterogeneity::test_per_path_heterogeneity_renders_in_summary test.

Verification note: I could not run the targeted tests because pytest is not installed (/bin/bash: pytest: command not found). I did run static greps for the inference anti-pattern and sibling surfaces.

Path to Approval

  1. Add summary rendering for path_heterogeneity_effects inside _render_path_effects_section().
  2. Add a regression test that fits by_path + heterogeneity and asserts summary() includes the per-path heterogeneity rows.
  3. Run the new test plus the existing TestByPathHeterogeneity and dCDH parity heterogeneity tests in an environment with pytest installed.

igerber added a commit that referenced this pull request May 10, 2026
Extends `_render_path_effects_section` in
chaisemartin_dhaultfoeuille_results.py to render a per-path
"Heterogeneity Test (Section 1.5, partial)" sub-block when
`path_heterogeneity_effects` is populated for the path. Mirrors the
global `_render_heterogeneity_section` block scoped to each per-path
section, alongside the existing per-path placebo / cumulated /
sup-t blocks.

Adds anti-regression test
`TestByPathHeterogeneity::test_per_path_heterogeneity_renders_in_summary`
that asserts both the section header and at least one rendered beta
value appear in `summary()` output when `by_path + heterogeneity` is
fit. Locks the schema↔renderer mirror so future per-path surfaces
don't silently omit from `summary()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 10, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7f7e3d5bec9fb80aadcb80f66f0fb4a408514077


Overall Assessment

Looks good — no unmitigated P0 or P1 findings in this re-review.

Executive Summary

  • Prior P1 is resolved: summary() now renders per-path heterogeneity rows at diff_diff/chaisemartin_dhaultfoeuille_results.py:L1421-L1443, with a regression test at tests/test_chaisemartin_dhaultfoeuille.py:L10331-L10370.
  • Affected method: dCDH reversible-treatment heterogeneity test, Web Appendix Section 1.5 / Lemma 7, under by_path / paths_of_interest.
  • Methodology is documented in docs/methodology/REGISTRY.md:L637 and docs/methodology/REGISTRY.md:L643; I found no undocumented estimator, weighting, or SE deviation.
  • New inference paths use safe_inference() and preserve NaN-consistent inference fields.
  • Completed sibling-surface, pattern-wide grep, and by_path ↔ paths_of_interest symmetry audits. No workflow paths: changes were in scope.
  • I could not run tests locally because pytest is unavailable in this environment.

Methodology

M-1: P3 Informational — documented partial R predict_het surface

Severity: P3
Impact: The global heterogeneity implementation remains a documented partial port: post-treatment regressions only, without R’s placebo regressions or joint null test. This is explicitly documented as a deviation in docs/methodology/REGISTRY.md:L637, so it is not a defect under the review rules. The new per-path extension is also documented, including path-restricted switchers, variance machinery, survey composition, df refresh, and result schema at docs/methodology/REGISTRY.md:L643.
Concrete fix: None required.

No P0/P1 methodology findings. The new per-path helper restricts eligibility through path_groups inside _compute_heterogeneity_test at diff_diff/chaisemartin_dhaultfoeuille.py:L5025-L5027, and dispatches per selected path at diff_diff/chaisemartin_dhaultfoeuille.py:L6459-L6535.

Code Quality

No P0/P1/P2 findings.

The prior missing renderer is fixed in the sibling report surface at diff_diff/chaisemartin_dhaultfoeuille_results.py:L1421-L1443. Pattern-wide grep found no new inline t_stat = effect / se inference anti-pattern in the changed dCDH path; the new and refreshed fields route through safe_inference() at diff_diff/chaisemartin_dhaultfoeuille.py:L5106-L5223 and diff_diff/chaisemartin_dhaultfoeuille.py:L4058-L4070.

Performance

No findings.

Maintainability

No P0/P1/P2 findings.

Sibling surfaces are now aligned: result schema at diff_diff/chaisemartin_dhaultfoeuille_results.py:L597-L607, summary rendering at diff_diff/chaisemartin_dhaultfoeuille_results.py:L1421-L1443, and to_dataframe(level="by_path") het_* columns at diff_diff/chaisemartin_dhaultfoeuille_results.py:L1781-L1932.

Tech Debt

No untracked blocking tech debt found for this PR. Existing TODO.md review debt entries appear unrelated to the new per-path heterogeneity surface.

Security

No findings. A lightweight secret-pattern scan over the changed text did not find candidate secrets.

Documentation/Tests

DT-1: P3 Informational — local test execution unavailable

Severity: P3
Impact: I could not execute the targeted tests because this environment lacks pytest (/usr/bin/python: No module named pytest). Static review shows coverage was added for gates, behavior, DataFrame output, summary rendering, and R parity at tests/test_chaisemartin_dhaultfoeuille.py:L9977-L10370 and tests/test_chaisemartin_dhaultfoeuille_parity.py:L1324-L1472.
Concrete fix: Run the targeted dCDH heterogeneity and parity tests in CI or a full dev environment with test dependencies installed.

@igerber igerber added ready-for-ci Triggers CI test workflows and removed ready-for-ci Triggers CI test workflows labels May 10, 2026
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 10, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6c8a68ca40f85d2b2cbc6155e5ad0c9e5149de0f


Overall Assessment

Looks good — no unmitigated P0/P1 findings in this re-review.

Executive Summary

  • Prior P1 is resolved: summary() now renders per-path heterogeneity rows at diff_diff/chaisemartin_dhaultfoeuille_results.py:L1421-L1443, with coverage at tests/test_chaisemartin_dhaultfoeuille.py:L10534-L10572.
  • Affected method: dCDH reversible-treatment heterogeneity test, Web Appendix Section 1.5 / Lemma 7, now under by_path and paths_of_interest.
  • Methodology is documented in docs/methodology/REGISTRY.md:L637 and docs/methodology/REGISTRY.md:L643; I found no undocumented estimator, weighting, or SE deviation.
  • New inference paths use safe_inference() and refresh replicate-weight df for per-path heterogeneity at diff_diff/chaisemartin_dhaultfoeuille.py:L4058-L4070.
  • Completed sibling-surface, pattern-wide grep, by_path ↔ paths_of_interest symmetry, and workflow-scope audits. No GH Actions workflow paths: changes are in scope.
  • I could not run tests locally because this environment lacks numpy, pandas, and pytest.

Methodology

M-1: P3 Informational — documented partial R predict_het surface

Severity: P3
Impact: The global heterogeneity implementation remains a documented partial port: post-treatment regressions only, without R’s placebo regressions or joint null test. This is explicitly documented in docs/methodology/REGISTRY.md:L637, so it is not a defect. The new per-path extension is documented at docs/methodology/REGISTRY.md:L643.
Concrete fix: None required.

No P0/P1 methodology findings. The path restriction is implemented through path_groups in _compute_heterogeneity_test() at diff_diff/chaisemartin_dhaultfoeuille.py:L5025-L5027 and dispatched by _compute_path_heterogeneity_test() at diff_diff/chaisemartin_dhaultfoeuille.py:L6459-L6535.

Code Quality

No P0/P1/P2 findings.

The new result schema is wired into the result dataclass at diff_diff/chaisemartin_dhaultfoeuille_results.py:L597-L607, summary rendering at L1421-L1443, and to_dataframe(level="by_path") at L1819-L1937.

Performance

No findings.

Maintainability

No P0/P1/P2 findings.

Sibling surfaces are aligned: estimator output, summary rendering, DataFrame columns, docs, registry, and parity tests all reference the same {path: {l: ...}} path_heterogeneity_effects contract.

Tech Debt

No untracked blocking tech debt found for this PR. Existing TODO/registry limitations are unrelated or explicitly documented.

Security

No findings. A lightweight secret-pattern scan over the changed files found no candidate secrets.

Documentation/Tests

DT-1: P3 Informational — local test execution unavailable

Severity: P3
Impact: I could not execute the targeted tests because this environment lacks required test dependencies. Static review shows coverage for gates, behavior, summary, DataFrame output, survey composition, and R parity.
Concrete fix: Run the targeted dCDH heterogeneity and parity tests in CI or a full dev environment.

DT-2: P3 Informational — multi-baseline no-warning test is not actually multi-baseline

Severity: P3
Impact: tests/test_chaisemartin_dhaultfoeuille.py:L10276-L10302 claims to cover the multi-baseline no-warning anti-regression, but _by_path_het_data() uses paths that all start from baseline 0. This does not block approval because the behavior is documented and the methodology path is not silently wrong.
Concrete fix: Add a small fixture with switchers on at least two baseline values and assert no multi-baseline warning under by_path + heterogeneity.

igerber and others added 3 commits May 10, 2026 20:55
Lifts the gate at chaisemartin_dhaultfoeuille.py:1230-1234 so per-path
event-study disaggregation composes with heterogeneity="<col>" (Web
Appendix Section 1.5, Lemma 7), mirroring R did_multiplegt_dyn(...,
by_path, predict_het) per-by_level dispatch.

Per-path heterogeneity is computed by re-running the Lemma 7 regression
on each path-restricted switcher subsample. New `path_groups`
(Optional[Set[int]]) parameter on _compute_heterogeneity_test restricts
eligibility to switchers ON path p; the variance machinery (standard
WLS vcov for non-survey, cell-period IF allocator for Binder TSL,
group-level allocator for Rao-Wu replicate) is unchanged from the
global heterogeneity path. Cohort dummies absorb baseline by
construction, so multi-baseline switcher panels do not produce
R-divergence (no parallel UserWarning like controls / trends_linear).

Surfaces on results.path_heterogeneity_effects keyed
{path: {l: {beta, se, t_stat, p_value, conf_int, n_obs}}} and on
to_dataframe(level="by_path") via new always-present het_* columns,
populated for positive horizons and NaN otherwise (mirrors cband_* /
cumulated_* convention). Per-(path, horizon) inference is refreshed
in the final R2 P1b block so all surfaces use the same df_survey
after replicate-weight n_valid appends.

R parity: introduces the FIRST predict_het R-parity baseline in the
repo. Two new scenarios (multi_path_reversible_predict_het global
anchor + multi_path_reversible_by_path_predict_het per-path) use
dont_drop_larger_lower=TRUE to match drop_larger_lower=False and
provide cohort variation under reversal paths. Per-path beta and SE
match R within rtol=1e-6.

Multiplier bootstrap (n_bootstrap > 0) under by_path + heterogeneity
+ survey_design inherits the existing per-path multiplier-bootstrap
gate from PR #408.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends `_render_path_effects_section` in
chaisemartin_dhaultfoeuille_results.py to render a per-path
"Heterogeneity Test (Section 1.5, partial)" sub-block when
`path_heterogeneity_effects` is populated for the path. Mirrors the
global `_render_heterogeneity_section` block scoped to each per-path
section, alongside the existing per-path placebo / cumulated /
sup-t blocks.

Adds anti-regression test
`TestByPathHeterogeneity::test_per_path_heterogeneity_renders_in_summary`
that asserts both the section header and at least one rendered beta
value appear in `summary()` output when `by_path + heterogeneity` is
fit. Locks the schema↔renderer mirror so future per-path surfaces
don't silently omit from `summary()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds 3 slow regression tests for `by_path + heterogeneity +
survey_design` composition that the original PR claimed via
REGISTRY/CHANGELOG but lacked automated coverage:

1. `test_per_path_heterogeneity_under_survey_finite` — analytical
   Binder TSL SE finite per (path, l) under PSU=group survey design.
2. `test_per_path_heterogeneity_replicate_weights_propagates_n_valid`
   — BRR replicate-weight fit drives `_replicate_n_valid_list`
   through per-(path, l) heterogeneity calls and final
   `survey_metadata.df_survey == n_replicates - 1`.
3. `test_survey_design_plus_n_bootstrap_with_heterogeneity_still_raises`
   — confirms heterogeneity composition does not accidentally re-route
   around the existing per-path multiplier-bootstrap-survey gate.

Also extends `to_dataframe(level="by_path")` docstring at
`chaisemartin_dhaultfoeuille_results.py:1527` to list the new `het_*`
columns alongside `cband_*` and `cumulated_*`, mirroring the
always-present NaN-when-None convention. The empty-DataFrame schema
already includes them; this is a docstring-only sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the dcdh-by-path-heterogeneity branch from 6c8a68c to a57f583 Compare May 11, 2026 00:55
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 11, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a57f5834601f5b31e7561a7db1e45094acd3323b


Overall Assessment

⚠️ Needs changes — one unmitigated P2 claim-vs-test mismatch remains.

Executive Summary

  • Affected method: ChaisemartinDHaultfoeuille dCDH heterogeneity test (Web Appendix Section 1.5 / Lemma 7), newly composed with by_path / paths_of_interest.
  • Methodology implementation is broadly consistent with the visible REGISTRY contract: per-path heterogeneity filters switchers by selected path and reuses the existing Lemma 7 regression / inference path.
  • Prior summary-rendering concern appears resolved: summary() now renders per-path heterogeneity rows.
  • Public surfaces are mostly wired: result dataclass field, summary(), and to_dataframe(level="by_path") all expose path_heterogeneity_effects / het_*.
  • Remaining blocker is a test-claim mismatch: the PR/CHANGELOG/REGISTRY claim multi-baseline per-path heterogeneity behavior is regression-tested, but the visible “multi-baseline” test fixture uses only baseline-0 switcher paths.

Methodology

M-1 — P3 Informational: documented partial heterogeneity implementation

Severity: P3
Impact: The dCDH heterogeneity test remains a partial implementation relative to R / the paper: post-treatment regressions only, no placebo regressions or joint null test. This is explicitly documented in the Methodology Registry under the dCDH heterogeneity notes, so it is not a defect.
Concrete fix: None required.

M-2 — No unmitigated methodology mismatch found in implementation

Severity: N/A
Impact: The new implementation path matches the visible REGISTRY claim for per-path heterogeneity: _compute_path_heterogeneity_test() enumerates selected paths and calls _compute_heterogeneity_test(..., path_groups=...), while _compute_heterogeneity_test() restricts eligible switchers via if path_groups is not None and g not in path_groups: continue (diff_diff/chaisemartin_dhaultfoeuille.py:_compute_heterogeneity_test, _compute_path_heterogeneity_test). This is consistent with the documented “path-restricted switcher subsample” contract.
Concrete fix: None.

Code Quality

No P0/P1/P2 code-quality findings in the visible changed source.

The new field is threaded through:

  • Fit result construction: diff_diff/chaisemartin_dhaultfoeuille.py near path_heterogeneity_effects=...
  • Result dataclass: diff_diff/chaisemartin_dhaultfoeuille_results.py near path_heterogeneity_effects
  • Summary rendering: ChaisemartinDHaultfoeuilleResults._render_path_effects_section
  • DataFrame rendering: ChaisemartinDHaultfoeuilleResults.to_dataframe(level="by_path")

Performance

No findings.

The path heterogeneity helper re-enumerates treatment paths after path_effects already did so. This is documented in-code and is likely acceptable given the feature scope; no performance blocker.

Maintainability

No unmitigated maintainability findings.

The schema shape {path: {l: {...}}} is documented in source docstrings and handled separately from path_effects[path]["horizons"], which avoids overloading the existing path-effect schema.

Tech Debt

No untracked blocking tech debt found in the visible context.

Security

No findings in the visible diff. I did not run a secret scanner or shell grep; this is based only on the loaded prompt content.

Documentation/Tests

DT-1 — P2: claimed multi-baseline behavior is not actually tested

Severity: P2
Impact: The PR/CHANGELOG/REGISTRY explicitly claim that per-path heterogeneity on multi-baseline switcher panels does not produce the R-divergence warning used for controls / trends_linear, because cohort dummies absorb baseline. The visible test named test_per_path_heterogeneity_no_multi_baseline_warning does not actually construct a multi-baseline switcher panel: _by_path_het_data() defines paths [(0, 1, 1, 1), (0, 1, 0, 0), (0, 1, 1, 0)], all starting from baseline 0 (tests/test_chaisemartin_dhaultfoeuille.py:_by_path_het_data). As a result, the shipped test does not cover the explicitly claimed multi-baseline behavior.

This is a claim-vs-test mismatch under the single-pass audit rules. No TODO mitigation is visible in the prompt, and shipped-behavior test gaps are P2.

Concrete fix: Add or correct a test with at least two switcher baseline values, e.g.:

  • Include baseline-0 joiner paths such as (0, 1, 1, 1).
  • Include baseline-1 leaver paths such as (1, 0, 0, 0) or another valid reversible path beginning with 1.
  • Fit with ChaisemartinDHaultfoeuille(drop_larger_lower=False, by_path=2) or paths_of_interest=[(0,1,1,1), (1,0,0,0)], L_max=3, heterogeneity="het_x".
  • Assert:
    1. res.path_heterogeneity_effects contains both selected paths or the expected observed subset.
    2. No warning matching the controls/trends-linear multi-baseline divergence text is emitted.
    3. At least one per-path heterogeneity entry is finite where the design has sufficient observations.

Path to Approval

  1. Add a real multi-baseline per-path heterogeneity test as described in DT-1, using switcher paths with different first elements / baseline treatment values.
  2. Ensure the new test filters out unrelated expected warnings, but fails on any warning containing the multi-baseline R-divergence language used by by_path + controls / by_path + trends_linear.
  3. Re-run the targeted dCDH heterogeneity tests and parity tests in CI.

The prior `test_per_path_heterogeneity_no_multi_baseline_warning`
used `_by_path_het_data` whose paths all start with `D=0`, so it
never exercised the multi-baseline switcher panel regime claimed in
REGISTRY ("cohort dummies absorb baseline by construction"). The
new CI reviewer flagged this as a P2 claim-vs-test mismatch.

Replaces the test with a TRUE multi-baseline DGP via new
`_multi_baseline_het_data` static fixture: 60 joiners (D_{g,1}=0,
path (0,1,1,1)) + 60 leavers (D_{g,1}=1, path (1,0,0,0)), F_g in
{3,4,5} for both baselines. Selected via
`paths_of_interest=[(0,1,1,1), (1,0,0,0)]`. Test now asserts:

1. Fixture sanity: switcher baselines actually span {0, 1}.
2. Both selected paths populate `path_heterogeneity_effects`.
3. Each path has ≥1 finite (path, l) entry — confirms the regression
   is non-degenerate under multi-baseline switchers.
4. No multi-baseline UserWarning emitted (cohort dummies absorb).
5. No verbatim `by_path / paths_of_interest + controls/trends_linear`
   divergence warning emitted (strict text-fragment check).

Empirically verified: this fixture produces finite beta/SE on every
in-window horizon for both paths and emits zero baseline warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 11, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cdbc6bb85f445d8b115de093d6a36bff1f5bb183


Overall Assessment

Looks good — no unmitigated P0/P1/P2 findings in the visible context.

Executive Summary

  • Affected method: ChaisemartinDHaultfoeuille dCDH heterogeneity test (Web Appendix Section 1.5 / Lemma 7), newly composed with by_path / paths_of_interest.
  • The prior P2 test-gap finding appears resolved: the new test_per_path_heterogeneity_no_multi_baseline_warning fixture now uses both baseline-0 joiners and baseline-1 leavers.
  • Implementation matches the visible Methodology Registry contract: per-path heterogeneity reuses the existing Lemma 7 regression with a path-restricted switcher filter, and public result/rendering surfaces are wired.
  • Claim-vs-shipped surfaces are covered in the visible diff: implementation, public docstrings, summary(), to_dataframe(level="by_path"), R parity tests, survey/replicate tests, and no-warning multi-baseline test are all present.
  • No security issues or accidental secrets are visible in the loaded diff. I did not run shell scans or inspect files outside the prompt.

Methodology

M-1 — P3 Informational: documented partial heterogeneity implementation

Severity: P3
Impact: The heterogeneity test remains a partial implementation relative to the full R/paper surface: post-treatment regressions are implemented, while placebo regressions and a joint null test are not. This is explicitly documented in the Methodology Registry under dCDH heterogeneity testing, so it is not a defect.
Concrete fix: None required.

M-2 — Previous P2 resolved: real multi-baseline behavior is now tested

Severity: Resolved / N/A
Impact: The previous review flagged that the “multi-baseline” no-warning test only used baseline-0 switcher paths. The updated test now constructs a true multi-baseline panel: joiners with baseline 0 and path (0, 1, 1, 1), plus leavers with baseline 1 and path (1, 0, 0, 0) (tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity._multi_baseline_het_data, test_per_path_heterogeneity_no_multi_baseline_warning). It also asserts both paths surface finite heterogeneity entries and no multi-baseline warning is emitted.
Concrete fix: None required.

M-3 — Methodology implementation matches visible Registry contract

Severity: N/A
Impact: The Registry’s new per-path heterogeneity contract says the path filter restricts eligible switchers inside the Lemma 7 regression while leaving the variance machinery unchanged. The implementation does this by adding path_groups to _compute_heterogeneity_test() and skipping groups not in that set (diff_diff/chaisemartin_dhaultfoeuille.py::_compute_heterogeneity_test), with _compute_path_heterogeneity_test() enumerating paths and calling the same heterogeneity helper per selected path. This is consistent with the documented “path-restricted switcher subsample” semantics.
Concrete fix: None required.


Code Quality

No P0/P1/P2 code-quality findings in the visible changed source.

Notable positive checks:

  • The old NotImplementedError gate for by_path / paths_of_interest + heterogeneity was removed while retaining gates for design2 and honest_did (diff_diff/chaisemartin_dhaultfoeuille.py:fit).
  • The new path_groups filter is localized and does not duplicate the Lemma 7 regression logic.
  • The final replicate-weight df refresh includes the new {path: {l: ...}} schema separately, avoiding accidental misuse of the existing path-effect schema refresher.

Performance

No blocking performance findings.

P3 Informational: path enumeration is repeated for heterogeneity

Severity: P3
Impact: _compute_path_heterogeneity_test() re-enumerates treatment paths even though _compute_path_effects() already did so earlier in fit(). The helper documents this and suppresses duplicate warnings. Given the feature scope and path enumeration cost relative to estimation/bootstrap work, this is acceptable.
Concrete fix: Optional future refactor: pass the already-computed selected paths and masks into _compute_path_heterogeneity_test() to avoid repeated enumeration.


Maintainability

No unmitigated maintainability findings.

The new result schema is documented consistently:

  • ChaisemartinDHaultfoeuilleResults.path_heterogeneity_effects dataclass field and class docstring (diff_diff/chaisemartin_dhaultfoeuille_results.py).
  • Public estimator docstrings for by_path, paths_of_interest, and heterogeneity (diff_diff/chaisemartin_dhaultfoeuille.py).
  • Registry and API docs updated (docs/methodology/REGISTRY.md, docs/api/chaisemartin_dhaultfoeuille.rst).
  • Rendering surfaces added in summary() and to_dataframe(level="by_path").

Tech Debt

No untracked blocking tech debt found in the visible context.

P3 Informational: existing documented heterogeneity limitations remain

Severity: P3
Impact: The dCDH heterogeneity feature is still explicitly documented as partial: no placebo heterogeneity and no joint null test. This limitation is already captured in the Registry with a Note/Deviation-style explanation and is not introduced as an undocumented silent behavior here.
Concrete fix: None required for this PR.


Security

No findings in the visible diff.

I did not run a secret scanner or shell grep; this assessment is limited to the prompt-provided diff and full changed source files.


Documentation/Tests

No unmitigated documentation/test findings.

Claim-vs-shipped audit based on visible context:

  • Implementation: Present via _compute_path_heterogeneity_test() and path_groups filtering in _compute_heterogeneity_test().
  • Tests: Present for gate removal, mutex preservation, finite signal, single-path telescope, zero-signal anti-regression, true multi-baseline no-warning behavior, survey TSL, replicate-weight df propagation, survey+bootstrap gate, DataFrame integration, summary rendering, unobserved path omission, and R parity (tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity, tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityHeterogeneity, ::TestDCDHDynRParityByPathHeterogeneity).
  • Public docstrings: Updated in estimator and result dataclass docstrings.
  • Rendering surfaces: summary() now renders per-path heterogeneity rows; to_dataframe(level="by_path") includes always-present het_* columns.
  • Cross-doc consistency: CHANGELOG, Registry, API docs, guide, source docstrings, implementation, and visible tests are aligned for the new per-path heterogeneity behavior.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 11, 2026
@igerber igerber merged commit 2f77f55 into main May 11, 2026
28 of 29 checks passed
@igerber igerber deleted the dcdh-by-path-heterogeneity branch May 11, 2026 23:32
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 12, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cdbc6bb85f445d8b115de093d6a36bff1f5bb183


Overall Assessment

⚠️ Needs changes — highest unmitigated finding is P1 (the new Registry note overstates per-path predict_het parity with R), with additional P2 claim-vs-test gaps.

Executive Summary

  • Affected method: dCDH heterogeneity testing (Web Appendix Section 1.5 / Lemma 7) newly composed with by_path / paths_of_interest.
  • The estimator plumbing itself is mostly complete: the old fit-time gate is gone, path_heterogeneity_effects is wired into results, and both summary() and to_dataframe(level="by_path") were extended.
  • The feature-specific Methodology Registry note currently overclaims R parity on the placebo side.
  • The new replicate-weight test does not actually prove the final-df_survey refresh path added for path_heterogeneity_effects.
  • The public SE-parity claim (rtol=1e-6) is stronger than what the new parity tests enforce (SE_RTOL = 1e-5).
  • Survey coverage is only exercised through by_path, not the reciprocal paths_of_interest selector that the docs also advertise.

Methodology

  • M1 (P1) docs/methodology/REGISTRY.md (new “Per-path heterogeneity testing” paragraph) says placebo rows stay NaN because “R does not ship per-path predict_het on placebos either,” while the implementation hard-codes het_* placebo rows to NaN in diff_diff/chaisemartin_dhaultfoeuille_results.py:L1885-L1896. The visible DIDmultiplegtDYN docs say predict_het also shows placebo regression tables, and the by_path dispatcher forwards both placebo and predict_het into each per-path did_multiplegt_main call. Impact: the Methodology Registry currently presents this as parity-preserving when the visible upstream contract suggests it is still a Python-side limitation/deviation. Concrete fix: either implement per-path placebo heterogeneity / joint-null output, or rewrite the Registry / changelog / API text to call this an explicit deviation from R rather than parity. (rdrr.io)
  • M2 (P3) diff_diff/chaisemartin_dhaultfoeuille.py:L950-L957 still documents heterogeneity as a partial implementation (post-treatment regressions only, no placebo regressions or joint null test). Impact: this limitation itself is already documented and is not a blocker. Concrete fix: none, beyond keeping the new per-path docs aligned with that existing caveat.

Code Quality

No P0/P1/P2 findings in the changed Python implementation.

Performance

  • PF1 (P3) _compute_path_heterogeneity_test() re-enumerates treatment paths even though fit() already did that work for path_effects (diff_diff/chaisemartin_dhaultfoeuille.py:L6454-L6500). Impact: extra work and duplicate warning-suppression state, but not correctness-critical. Concrete fix: thread the selected path list / masks into the helper instead of recomputing them.

Maintainability

No P0/P1/P2 findings beyond the documentation/test issues below.

Tech Debt

No separate P0/P1/P2 tech-debt finding beyond the untracked test gaps in this PR.

Security

No security or secret-exposure issue is visible in the provided diff.

Documentation/Tests

  • T1 (P2) diff_diff/chaisemartin_dhaultfoeuille.py:L4047-L4065 adds a dedicated final-df_survey refresh loop for path_heterogeneity_effects, but tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity.test_per_path_heterogeneity_replicate_weights_propagates_n_valid only checks survey_metadata.df_survey and finite se. Impact: this test would still pass if the new heterogeneity branch failed to append n_valid or left stale t_stat / p_value / conf_int, because earlier survey code paths already populate the shared accumulator. Concrete fix: assert that at least one (path, horizon) entry matches safe_inference(beta, se, df=res.survey_metadata.df_survey) after fit, ideally in a fixture where the final df differs from the compute-time snapshot.
  • T2 (P2) CHANGELOG.md:9 says per-path SE matches R within rtol=1e-6, but the new parity classes tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityHeterogeneity and ::TestDCDHDynRParityByPathHeterogeneity both set SE_RTOL = 1e-5. Impact: a 1e-6-to-1e-5 SE drift would still ship despite the public claim. Concrete fix: either tighten the tests to 1e-6, or relax the published claim to the tested tolerance.
  • T3 (P2) The code and docs claim both by_path and paths_of_interest compose with heterogeneity under survey_design (diff_diff/chaisemartin_dhaultfoeuille.py:L3872-L3895, CHANGELOG.md:9), but all new survey-specific tests in TestByPathHeterogeneity use by_path=...; none use paths_of_interest=[...]. Impact: the reciprocal selector surface is advertised but not regression-tested under analytical Binder TSL, replicate weights, or the survey+bootstrap gate. Concrete fix: add at least one analytical survey test and one replicate/gate test using paths_of_interest.

Path to Approval

  1. Correct the per-path heterogeneity Registry/changelog wording around R placebo parity, or implement the missing per-path placebo/joint-test surface.
  2. Add a replicate-weight regression that proves path_heterogeneity_effects inference fields are recomputed from the final survey_metadata.df_survey.
  3. Align the stated SE-parity tolerance and the tests.
  4. Add paths_of_interest + heterogeneity + survey_design regression coverage.

igerber added a commit that referenced this pull request May 13, 2026
Address residual P1+P2s from re-audit of PR #412
igerber added a commit that referenced this pull request May 13, 2026
The CHANGELOG paths_of_interest entry referenced
`chaisemartin_dhaultfoeuille.py:1118` for the by_path precondition
gate, but later edits (lifting the survey_design mutex in #408 and
composing in heterogeneity in #412) have drifted the actual gate
location to ~L1216. Drop the numeric anchor entirely and describe
the gate textually so future edits do not silently invalidate the
line reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDL77 pushed a commit to TDL77/diff-diff that referenced this pull request May 13, 2026
The restored CI reviewer surfaced findings the degraded reviewer
missed across all 5 prior rounds on PR igerber#412:

P1 (REGISTRY + code comment): the claim that "R does not ship per-path
predict_het on placebos either, so parity is preserved by deferral"
contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)`
dispatcher actually does - it forwards `predict_het` into each
per-path `did_multiplegt_main` call along with `placebo`, so R may
emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite
both surfaces (chaisemartin_dhaultfoeuille_results.py code comment and
REGISTRY.md DataFrame-integration paragraph) as an explicit Python-
side deferral rather than a verified R-parity. Add a TODO row to
track validating R's actual placebo predict_het output and either
implementing parity or documenting the deviation explicitly.

P2 (REGISTRY rtol claim): the per-path heterogeneity R-parity
paragraph claimed "rtol ~1e-6 on point estimates AND SE", but the
parity tests use BETA_RTOL=1e-6 and SE_RTOL=1e-5 (one decade looser
on SE). Split the claim into the two separate tolerances and note
the WLS-denominator/cohort-recentering numerical drift that motivates
the looser SE bound.

P2 (replicate-weight df_survey refresh): the existing test only
checked finite SE; it would have passed if the new dedicated
heterogeneity refresh loop failed to recompute t_stat / p_value /
conf_int at the final df_survey. Strengthen the test to call
`safe_inference(beta, se, df=df_survey)` on the first finite entry
and assert the stored inference fields match - this anti-regression
covers the dedicated post-call refresh added for path_heterogeneity_
effects.

P2 (paths_of_interest survey gap): the documented composability of
`paths_of_interest + heterogeneity + survey_design` was not regression-
locked - all existing survey-specific tests used `by_path=k`. Add
test_paths_of_interest_heterogeneity_survey_design_analytical (verify
analytical Binder TSL fits, selector ordering preserved, finite SE
per populated (path, l)) and test_paths_of_interest_heterogeneity_
survey_n_bootstrap_gate (verify the multiplier-bootstrap gate applies
under paths_of_interest too).

No estimator behavior, weighting, variance/SE, identification, or
default statistical surface changed in source - documentation
accuracy plus expanded regression coverage only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDL77 pushed a commit to TDL77/diff-diff that referenced this pull request May 14, 2026
Restored CI reviewer caught one actionable docs-drift item: stale
mentions of "plug-in-only SE" and a `survey_design` mutex that igerber#408
itself lifted.

1. CHANGELOG paths_of_interest entry listed `survey_design` in the
   by_path precondition gate alongside `heterogeneity` / `design2` /
   `honest_did`. Both `survey_design` (lifted by igerber#408) and
   `heterogeneity` (composed in by igerber#412) have been removed from that
   gate. Update the parenthetical to reflect the shipped gate and
   note the lifted mutexes explicitly to avoid misleading readers
   about current behavior.

2. `_compute_path_effects` docstring described SE only as
   "plug-in via _plugin_se(...)". Add a dedicated SE-formation
   section listing the non-survey plug-in path and the new survey
   path (path-restricted per-period IF routed through
   `_survey_se_from_group_if`, replicate-weight df propagation via
   `replicate_n_valid_list`, and the shared post-call
   `_refresh_path_inference` reconciling stored inference fields
   against the final `df_survey`).

3. `_compute_path_placebos` docstring mirrored the same plug-in-only
   description. Add a parallel SE-formation section pointing back
   at `_compute_path_effects` for the shared survey contract.

No runtime behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 14, 2026
Address #412 holistic re-audit residuals (R2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant