0 votes
asked ago in General Economics Questions by (390 points)
I built a small deterministic gridworld benchmark to isolate a specific decision behavior: selecting between viable policies where standard scalar or constrained formulations select differently.

The example is intentionally minimal (fixed policies, no learning, two domains: energy and safety).

In the benchmark, both policies reach the goal and remain viable, but one preserves safety while the other reduces it from 10 → 4. The PV–PP system selects the former; a scalar baseline selects the latter.

Known limitations:
    •    selection rule is partially benchmark-specific
    •    no general trade-off mechanism yet

GitHub: https://github.com/AmundsenLance/pvpp-gridworld-safe-benchmark.git

I’m looking for critique on whether this behavior can always be represented as a scalar or lexicographic ordering, or whether it requires a genuinely non-scalar selection rule.

1 Answer

+1 vote
answered ago by (440 points)
I dont know the answer but this is what AI said.

This is a useful benchmark, but I would be careful about the claim that the behavior cannot be represented by a scalar ordering.

For a small deterministic gridworld with fixed policies, the answer is probably: yes, the observed choice can always be represented by some scalar or lexicographic rule. With only two policies, that is almost trivial. You can always define a scoring function that assigns a higher value to the policy you want selected.

So I do not think this benchmark, by itself, proves that a non-scalar selection rule is necessary.

But I also do not think that makes the benchmark uninteresting. The better question is not whether a scalar function can be built after the fact. It is whether a stable, non-ad hoc scalar function can recover the same behavior across a family of cases without smuggling in the PV-PP structure.

In your benchmark, the important behavior seems to be this:

Both policies reach the goal and remain viable, but one unnecessarily degrades the safety domain. PV-PP chooses the policy that preserves the safety domain, while the scalar baseline chooses the faster or higher-scoring path.

That can be represented lexicographically:

1. First require goal achievement and viability.
2. Then prefer policies that avoid unnecessary safety degradation.
3. Then use energy, speed, or path length as a tie-breaker.

It can also be represented as a scalar with a very large penalty for safety degradation or hazard exposure. But that scalar representation may be doing the wrong explanatory work. The reason for the choice is not really “the safety penalty happened to be large enough.” The reason is that safety preservation has structural priority once both policies are viable.

That is where your framework may have value. It makes the priority structure explicit instead of hiding it inside a tuned penalty weight.

The strongest critique I would offer is this: the benchmark currently shows divergence from one scalar baseline, not from scalarization generally. To make the stronger claim, you would need to test against a family of scalar baselines or prove that no fixed scalar weighting scheme can reproduce the PV-PP choices across a wider class of gridworlds.

For example, vary:

* the energy cost of the safe route,
* the safety loss from hazards,
* the number of hazards,
* the initial safety reserve,
* the goal reward,
* and whether the safety loss remains above or falls below an adequacy threshold.

Then ask whether one stable scalar objective can reproduce the PV-PP choices across all cases. If it cannot, unless the scalar model adds hard constraints, lexicographic priority, discontinuous penalties, or environment-specific weights, then you have a much stronger argument.

So my answer would be:

This benchmark does not yet show that scalar representation is impossible. A scalar or lexicographic model can probably represent this particular choice. But it does show something potentially important: the scalar representation may be derivative, while the real selection logic is domain-structured viability plus preservation of safety.

That is a serious distinction. The next step is to move from “this scalar baseline chooses differently” to “no stable scalar baseline can recover the PV-PP rule across a structured class of cases without importing the non-scalar rule in disguise.”
commented ago by (390 points)
Thank you. This is a very good critique, and I agree with the central limitation.

For a fixed deterministic benchmark with only two candidate policies, a single observed PV-PP selection does not establish that scalar representation is impossible. After the fact, one can always construct a scalar score that assigns the selected policy the higher value. A lexicographic rule can also represent the local choice by making safety preservation prior to energy or speed once viability and goal achievement are satisfied.

So the benchmark should not be read as a proof that scalarization is impossible.

The more precise significance of the benchmark is that it exposes the structural question behind the choice. Both policies reach the goal and remain viable, but one needlessly consumes safety while the other preserves it. The PV-PP selection is not intended to mean simply that a sufficiently large safety penalty happened to dominate a sufficiently small energy advantage. It is intended to represent that safety remains a governing domain whose unnecessary degradation matters structurally in the decision architecture.

That leads to the stronger research question you identified:

Can one stable scalar objective recover PV-PP selections across a structured family of environments as energy cost, safety loss, adequacy thresholds, hazard exposure, initial reserve, and governing conditions change — without importing the PV-PP structure through environment-specific weights, hard constraints, lexicographic priorities, or discontinuous penalties?

Our subsequent scalar-reduction work has adopted a more bounded formulation of the issue:

Scalar comparison can be contained within our PV-PP framework as a restricted, certifiable subroutine where the structure supports it. But scalar comparison should not be assumed to contain the full PV-PP decision architecture.

That distinction matters. A scalar method may reproduce an output in a particular case. It may even be valid across a certified subclass of cases. But reproducing a selected output is not the same as reproducing the governing structure that determines viability, domain priority, threshold effects, stage separation, recovery conditions, or state-dependent changes in what should govern the choice.

I also agree with your proposed next benchmark direction. A stronger public benchmark should vary the safe-route cost, safety loss, initial reserve, hazard profile, and adequacy thresholds, then test whether a fixed scalar comparator continues to recover the PV-PP selection rule across that family of cases.

So I would now characterize this gridworld benchmark as an illustrative divergence case rather than an impossibility proof. Its value is that it makes the underlying question visible: when is scalar comparison a valid certified shortcut, and when does it merely disguise a richer non-scalar governance structure?
commented ago by (440 points)
I dont know the answer but this is what AI said.

I think this revised framing is much stronger.

The distinction between “reproducing the selected output” and “reproducing the governing structure” seems like the right move. A scalar score can almost always be constructed after the fact for a single deterministic choice. But that does not show that scalar comparison is the explanatory architecture actually doing the work.

The next step, in my view, is exactly the benchmark family you describe. The question should not be whether one scalar function can reproduce one PV-PP choice. The question should be whether one stable scalar comparator can reproduce the PV-PP rule across systematic variation in:

* safety loss,
* energy cost,
* initial reserve,
* hazard frequency,
* adequacy thresholds,
* and whether safety is above or below the governing threshold.

If scalar recovery requires environment-specific retuning, discontinuous penalties, hard constraints, or lexicographic priority, then the scalar model may still be representationally possible, but it is no longer doing the fundamental explanatory work. It is acting as a wrapper around a non-scalar governance structure.

So I think your new characterization is right: the gridworld is not an impossibility proof. It is an illustrative divergence case. Its purpose is to make the underlying representational issue visible.

The strongest version of the project would be to define the conditions under which scalar comparison is a certified shortcut within PV-PP, and then define the conditions under which that shortcut fails or becomes artificial.
...