*(This is a basic point about utility theory which many will already be familiar with. I draw some non-obvious conclusions which may be of interest to you even if you think you know this from the title -- but the main point is to communicate the basics. I'm posting it to the alignment forum because I've heard misunderstandings of this from some in the AI alignment research community.)*

I will first give the basic argument that the utility quantities of different agents aren't directly comparable, and a few important consequences of this. I'll then spend the rest of the post discussing what to do when you need to compare utility functions.

# Utilities aren't comparable.

Utility isn't an ordinary quantity. A utility function is a device for expressing the preferences of an agent.

Suppose we have a notion of *outcome.* *We could try to represent the agent's preferences between outcomes as an ordering relation: if we have outcomes A, B, and C, then one possible preference would be A<B<C.

However, a mere ordering does not tell us how the agent would decide between *gambles,* ie, situations giving A, B, and C with some probability.

With just three outcomes, there is only one thing we need to know: is B closer to A or C, and by how much?

We want to construct a utility function U() which represents the preferences. Let's say we set U(A)=0 and U(C)=1. Then we can represent B=G as U(B)=1/2. If not, we would look for a different gamble which *does* equal B, and then set B's utility to the expected value of that gamble. By assigning real-numbered values to each outcome, we can fully represent an agent's preferences over gambles. (Assuming the VNM axioms hold, that is.)

But the initial choices U(A)=0 and U(C)=1 were arbitrary! We could have chosen any numbers so long as U(A)<U(C), reflecting the preference A<C. In general, a valid representation of our preferences U() can be modified into an equally valid U'() by adding/subtracting arbitrary numbers, or multiplying/dividing by positive numbers.

So it's just as valid to say someone's expected utility in a given situation is 5 or -40, provided you shift everything *else* around appropriately.

Writing to mean that two utility functions represent the same preferences, what we have in general is: if and only if . (I'll call the * multiplicative constant *and

*the*

*.)*

**additive constant**This means that we can't directly compare the utility of two different agents. Notions of fairness should not directly say "everyone should have the same expected utility". Utilitarian ethics cannot directly maximize the sum of everyone's utility. Both of these operations should be thought of as a type error.

# Some non-obvious consequences.

The game-theory term "zero sum" is a misnomer. You shouldn't directly think about the sum of the utilities.

In mechanism design, *exchangeable utility* is a useful assumption which is often needed in order to get nice results. The idea is that agents can give utils to each other, perhaps to compensate for unfair outcomes. This is *kind of* like assuming there's money which can be exchanged between agents. However, the non-comparability of utility should make this seem *really weird*. (There are also other disanalogies with money; for example, utility is closer to logarithmic in money, not linear.)

This could (should?) also make you suspicious of talk of "average utilitarianism" and "total utilitarianism". However, beware: only one kind of "utilitarianism" holds that the term "utility" in decision theory means the same thing as "utility" in ethics: namely, preference utilitarianism. Other kinds of utilitarianism can distinguish between these two types of utility. (For example, one can be a hedonic utilitarian without thinking that what everyone wants is happiness, if one isn't a preference utilitarian.)

Similarly, for preference utilitarians, talk of *utility monsters* becomes questionable. A utility monster is, supposedly, someone who gets much more utility out of resources than everyone else. For a hedonic utilitarian, it would be someone who experiences much deeper sadness and much higher heights of happiness. This person supposedly merits more resources than other people.

For a preference utilitarian, incomparability of utility means we can't simply posit such a utility monster. It's meaningless *a priori* to say that one person simply has much stronger preferences than another (in the utility function sense).

All that being said, we *can* actually compare utilities, sum them, exchange utility between agents, define utility monsters, and so on. We just need *more information.*

# Comparing utilities.

The incomparability of utility functions * doesn't mean* we can't trade off between the utilities of different people.

I've heard the non-comparability of utility functions summarized as the thesis that we can't say anything meaningful about the relative value of one person's suffering vs another person's convenience. Not so! Rather, the point is just that *we need more assumptions in order to say anything. *The utility functions alone aren't enough.

## Pareto-Optimality: The Minimal Standard

Comparing utility functions suggests putting them all onto one scale, such that we can trade off between them -- "this dollar does more good for Alice than it does for Bob". We formalize this by imagining that we have to decide policy for the whole group of people we're considering (e.g., the whole world). We consider a *social choice function* which would make those decisions on behalf of everyone. Supposing it is VNM rational, its decisions must be comprehensible in terms of a utility function, too. So the problem reduces to combining a bunch of individual utility functions, to get one big one.

So, how do we go about combining the preferences of many agents into one?

The first and most important concept is the * pareto improvement: our social choice function should endorse changes which benefit someone and harm no one. *An option which allows no such improvements is said to be

**Pareto-optimal.**We might also want to consider * strict Pareto improvements: a change which benefits everyone. *(An option which allows no strict Pareto improvements is

*) Strict Pareto improvements can be more relevant in a bargaining context, where you need to give everyone something in order to get them on board with a proposal -- otherwise they may judge the improvement as unfairly favoring others. However, in a bargaining context, individuals may refuse even a strict Pareto improvement due to fairness considerations.*

**weakly Pareto-optimal.**In either case, a version of Harsanyi's utilitarianism Theorem implies that the utility of our social choice function *can be understood as some linear combination of the individual utility functions.*

So, pareto-optimal social choice functions can always be understood by:

- Choosing a scale for everyone's utility function -- IE, set the multiplicative constant. (If the social choice function is only weakly Pareto optimal, some of the multiplicative constants might turn out to be zero, totally cancelling out someone's involvement. Otherwise, they can all be positive.)
- Adding all of them together.

(Note that the *additive constant* doesn't matter -- shifting a person's utility function up or down doesn't change what decisions will be endorsed by the sum. However, it * will* matter for some other ways to combine utility functions.)

This is nice, because we can always combine everything linearly! We just have to set things to the right scale and then sum everything up.

However, it's far from the end of the story. How do we choose multiplicative constants for everybody?

## Variance Normalization: Not Too Exploitable?

We could set the constants any way we want... totally subjective estimates of the worth of a person, draw random lots, etc. But we do typically want to represent some notion of fairness. We said in the beginning that the problem was, a utility function has many equivalent representations . We can address this as a problem of * normalization:* we want to take a and put it into a canonical form, getting rid of the choice between equivalent representations.

One way of thinking about this is * strategy-proofness*. A utilitarian collective should not be vulnerable to members strategically claiming that their preferences are stronger (larger ), or that they should get more because they're worse off than everyone (smaller -- although, remember that we haven't talked about any setup which actually cares about that, yet).

**Warm-Up: Range Normalization**

Unfortunately, some obvious ways to normalize utility functions are not going to be strategy-proof.

One of the simplest normalization techniques is to squish everything into a specified range, such as [0,1]:

This is analogous to range voting: everyone reports their preferences for different outcomes on a fixed scale, and these all get summed together in order to make decisions.

If you're an agent in a collective which uses range normalization, then you may want to strategically mis-report your preferences. In the example shown, the agent has a big hump around outcomes they like, and a small hump on a secondary "just OK" outcome. The agent might want to get rid of the second hump, forcing the group outcome into the more favored region.

I believe that in the extreme, the optimal strategy for range voting is to choose some utility threshold. Anything below that threshold goes to zero, feigning maximal disapproval of the outcome. Anything above the threshold goes to one, feigning maximal approval. In other words, under strategic voting, range voting becomes approval voting (range voting where the only options are zero and one).

If it's not possible to mis-report your preferences, then the incentive becomes to *self-modify to literally have these extreme preferences. *This could perhaps have a real-life analogue in political outrage and black-and-white thinking. If we use this normalization scheme, that's the closest you can get to being a utility monster.

**Variance Normalization**

We'd *like* to avoid *any* incentive to misrepresent/modify your utility function. Is there a way to achieve that?

Owen Cotton-Barratt discusses different normalization techniques in illuminating detail, and argues for *variance normalization:* divide utility functions by their variance, making the variance one. (*Geometric reasons for normalizing variance to aggregate preferences,* O Cotton-Barratt, 2013.) Variance normalization is strategy-proof under the assumption that everyone participating in an election shares beliefs about how probable the different outcomes are! (Note that *variance* *of utility* is only well-defined under some assumption about *probability of outcome.*) That's pretty good. It's probably the best we can get, in terms of strategy-proofness of voting. Will MacAskill also argues for variance normalization in the context of normative uncertainty (*Normative Uncertainty, *Will MacAskill, 2014).

Intuitively, variance normalization directly addresses the issue we encountered with range normalization: an individual attempts to make their preferences "loud" by extremizing everything to 0 or 1. This increases variance, so, is directly punished by variance normalization.

However, Jameson Quinn, LessWrong's resident voting theory expert, has warned me rather strongly about variance normalization.

- The assumption of shared beliefs about election outcomes is far from true in practice. Jameson Quinn tells me that, in fact, the strategic voting incentivized by quadratic voting is
*particularly bad*amongst normalization techniques. - Strategy-proofness isn't, after all, the final arbiter of the quality of a voting method. The final arbiter should be something like the utilitarian quality of an election's outcome. This question gets a bit weird and recursive in the current context, where I'm using elections as an analogy to ask how we should define utilitarian outcomes. But the point still, to some extent, stands.

I didn't understand the full justification behind his point, but I came away thinking that range normalization was probably better in practice. After all, it reduces to approval voting, which is actually a pretty good form of voting. But if you want to do the best we can with the state of voting theory, Jameson Quinn suggested 3-2-1 voting. (I don't think 3-2-1 voting gives us any nice theory about how to combine utility functions, though, so it isn't so useful for our purposes.)

**Open Question: ***Is there a variant of variance normalization which takes differing beliefs into account, to achieve strategy-proofness (IE honest reporting of utility)?*

Anyway, so much for normalization techniques. These techniques ignore the broader context. They attempt to be fair and even-handed *in the way we choose the multiplicative and additive constants.* But we could also explicitly try to be fair and even-handed *in the way we choose between Pareto-optimal outcomes*, as with this next technique.

## Nash Bargaining Solution

It's important to remember that the Nash bargaining solution is a solution *to the Nash bargaining problem*, which isn't quite our problem here. But I'm going to gloss over that. Just imagine that we're setting the social choice function through a massive negotiation, so that we can apply bargaining theory.

Nash offers a very simple solution, which I'll get to in a minute. But first, a few words on how this solution is derived. Nash provides two seperate justifications for his solution. The first is a game-theoretic derivation of the solution as an especially robust Nash equilibrium. I won't detail that here; I quite recommend his original paper (*The Bargaining Problem, *1950); but, just keep in mind that there is at least some reason to expect selfishly rational agents to hit upon this particular solution. The second, unrelated justification is an axiomatic one:

*Invariance to equivalent utility functions.*This is the same motivation I gave when discussing normalization.*Pareto optimality.*We've already discussed this as well.*Independence of Irrelevant Alternatives (IIA).*This says that we shouldn't change the outcome of bargaining by removing options which won't ultimately get chosen anyway. This isn't even technically one of the VNM axioms, but it*essentially*is -- the VNM axioms are posed for binary preferences (a > b). IIA is the assumption we need to break down multi-choice preferences to binary choices. We can justify IIA with a kind of money pump.*Symmetry.*This says that the outcome doesn't depend on the order of the bargainers; we don't prefer Player 1 in case of a tie, or anything like that.

Nash proved that *the only way to meet these four criteria* is to maximize the **product** of gains from cooperation. More formally, choose the outcome which maximizes:

The here is a "status quo" outcome. You can think of this as what happens if the bargaining fails. This is sometimes called a "threat point", since strategic players should carefully set what they do *if negotiation fails* so as to maximize their bargaining position. However, you might also want to rule that out, forcing to be a Nash equilibrium in the hypothetical game where there is no bargaining opportunity. As such, is also known as the *best alternative to negotiated agreement (BATNA)*, or sometimes the "disagreement point" (since it's what players get if they can't agree). We can think of subtracting out as just a way of adjusting the additive constant, in which case we really are just maximizing the product of utilities. (The BATNA point is always (0,0) after we subtract out things that way.)

The Nash solution differs significantly from the other solutions considered so far.

- Maximize the
*product??*Didn't Harsanyi's theorem guarantee we only need to worry about sums? - This is the first proposal where the additive constants matter. Indeed, now the
*multiplicative*constants are the ones that don't matter! - Why wouldn't
*any*utility-normalization approach satisfy those four axioms?

Last question first: how do normalization approaches violate the Nash axioms?

Well, both range normalization and variance normalization violate IIA! If you remove one of the possible outcomes, the normalization may change. This makes the social choice function display inconsistent preferences across different scenarios. (But how bad is that, really?)

As for why we can get away with maximizing the product, rather than the sum:

The Pareto-optimality of Nash's approach guarantees that it *can be seen* as maximizing a linear function of the individual utilities. So Harsanyi's theorem is still satisfied. However, Nash's solution points to a very *specific* outcome, which Harsanyi doesn't do for us.

Imagine you and me are trying to split a dollar. If we can't agree on how to split it, then we'll end up destroying it (ripping it during a desperate attempt to wrestle it from each other's hands, obviously). Thankfully, John Nash is standing by, and we each agree to respect his judgement. No matter which of us claims to value the dollar more, Nash will allocate 50 cents to each of us.

Harsanyi happens to see this exchange, and explains that Nash has chosen a social choice function which normalized our utility functions to be equal to each other. That's the only way Harsanyi can explain the choice made by Nash -- the value of the dollar was precisely tied between you and me, so a 50-50 split was as good as any other outcome. Harsanyi's justification is indeed *consistent* with the observation. But why, then, did Nash choose 50-50 *precisely?* 49-51 would have had exactly the same collective utility, as would 40-60, or any other split!

Hence, Nash's principle is far more useful than Harsanyi's, even though Harsanyi can justify any rational outcome retrospectively.

However, Nash does rely somewhat on that pesky IIA assumption, whose importance is perhaps not so clear. Let's try getting rid of that.

## Kalai–Smorodinsky

Although the Nash bargaining solution is the most famous, there are other proposed solutions to Nash's bargaining problem. I want to mention just one more, Kalai-Smorodinsky (I'll call it KS).

KS throws out IIA as irrelevant. After all, the set of alternatives *will* affect bargaining. Even in the Nash solution, the set of alternatives may have an influence by changing the BATNA! So perhaps this assumption isn't so important.

KS instead adds a *monotonicity* assumption: being in a better position should never make me worse off after bargaining.

Here's an illustration, due to Daniel Demski, of a case where Nash bargaining fails monotonicity:

I'm not that sure monotonicity really should be an axiom, but it does kind of suck to be in an apparently better position and end up worse off for it. Maybe we could relate this to strategy-proofness? A little? Not sure about that.

Let's look at the formula for KS bargaining.

Suppose there are a couple of dollars on the ground: one which you'll walk by first, and one which I'll walk by. If you pick up your dollar, you can keep it. If I pick up my dollar, I can keep mine. But also, if you *don't* pick up yours, then I'll eventually walk by it and can pick it up. So we get the following:

(The box is filled in because we can also use mixed strategies to get values intermediate between any pure strategies.)

Obviously in the real world we just both pick up our dollars. But, let's suppose we bargain about it, just for fun.

The way KS works is, you look at the maximum *one* player can get (you can get $1), and the maximum the *other* player could get (I can get $2). Then, although we can't usually jointly achieve those payoffs (I can't get $2 at the same time as you get $1), KS bargaining insists we achieve the same *ratio* (I should get twice as much as you). In this case, that means I get $1.33, while you get $0.66. We can visualize this as drawing a bounding box around the feasible solutions, and drawing a diagonal line. Here's the Nash and KS solutions side by side:

As in Daniel's illustrations, we can visualize maximizing the product as drawing the largest hyperbola we can that still touches the orange shape. (Orange dotted line.) This suggests that we each get $1; exactly the same solution as Nash would give for splitting $2. (The black dotted line illustrates how we'd continue the feasible region to represent a dollar-splitting game, getting the full triangle rather than a chopped off portion.) Nash doesn't care that one of us can do better than the other; it just looks for the most equal division of funds possible, since that's how we maximize the product.

KS, on the other hand, cares what the max possible is for both of us. It therefore suggests that you give up some of your dollar to me.

I suspect most readers will * not* find the KS solution to be more intuitively appealing?

Note that the KS monotonicity property does NOT imply the desirable-sounding property "if there are more opportunities for good outcomes, everyone gets more or is at least not worse off." (I mention this mainly because I initially misinterpreted KS's monotonicity property this way.) In my dollar-collecting example, KS bargaining makes you worse off simply because there's an opportunity for me to take your dollar if you don't.

Like Nash bargaining, KS bargaining ignores multiplicative constants on utility functions, and can be seen as normalizing additive constants by treating as (0,0). (Note that, in the illustration, I assumed is chosen as (minimal achievable for one player, minimal achievable for the other). this need not be the case in general.)

A peculiar aspect of KS bargaining is that it doesn't really give us an obvious quantity to maximize, unlike Nash or Harsanyi. It only describes the optimal point. This seems far less practical, for realistic decision-making.

OK, so, should we use bargaining solutions to compare utilities?

My intuition is that, because of the need to choose the BATNA point , bargaining solutions end up rewarding destructive threats in a disturbing way. For example, suppose that we are playing the dollar-splitting game again, except that I can costlessly destroy $20 of your money, so now involves both the destruction of the $1, and the destruction of $20. Nash bargaining now hands the entire dollar to me, because you are "up $20" in that deal, so the fairest possible outcome is to give me the $1. KS bargaining splits things up a little, but I still get most of the dollar.

If utilitarians were to trade off utilities that way in the real world, it would benefit powerful people, especially those willing to exploit their power to make credible threats. If X can take everything away from Y, then Nash bargaining sees everything Y has as already counting toward "gains from trade".

As I mentioned before, sometimes people try to define BATNAs in a way which excludes these kinds of threats. However, I see this as ripe for strategic utility-spoofing (IE, lying about your preferences, or self-modifying to have more advantageous preferences).

So, this might favor normalization approaches.

On the other hand, Nash and KS both do way better in the split-the-dollar game than any normalization technique, because they can optimize for fairness of outcome, rather than just fairness of multiplicative constants chosen to compare utility functions with.

Is there any approach which combines the advantages of bargaining and normalization??

# Animals, etc.

An essay on utility comparison would be incomplete without at least mentioning the problem of animals, plants, and so on.

- Option one: some cutoff for "moral patients" is defined, such that a utilitarian only considers preferences of agents who exceed the cutoff.
- Option two: some more continuous notion is selected, such that we care more about some organisms than others.

Option two tends to be more appealing to me, despite the non-egalitarian implications (e.g., if animals differ on this spectrum, than humans could have some variation as well).

As already discussed, bargaining approaches do seem to have this feature: animals would tend to get less consideration, because they've got less "bargaining power" (they can do less harm to humans than humans can do to them). However, this has a distasteful might-makes-right flavor to it.

This also brings to the forefront the question of how we view something as an agent. Something like a plant might have quite deterministic ways of reacting to environmental stimulus. Can we view it as making choices, and thus, as having preferences? Perhaps "to some degree" -- if such a degree could be defined, numerically, it could factor into utility comparisons, giving a formal way of valuing plants and animals *somewhat, *but "not too much".

# Altruistic agents.

Another puzzling case, which I think needs to be handled carefully, is accounting for the preferences of altruistic agents.

Let's proceed with a simplistic model where agents have "personal preferences" (preferences which just have to do with themselves, in some sense) and "* cofrences*" (co-preferences; preferences having to do with other agents).

Here's an agent named Sandy:

Sandy | ||||

Personal Preferences | Cofrences | |||

Candy | +.1 | Alice | +.1 | |

Pizza | +.2 | Bob | -.2 | |

Rainbows | +10 | Cathy | +.3 | |

Kittens | -20 | Dennis | +.4 |

The cofrences represent coefficients on other agent's utility functions. Sandy's preferences are supposed to be understood as a utility function representing Sandy's *personal* preferences, plus a weighted sum of the utility functions of Alice, Bob, Cathy, and Dennis. (Note that the weights can, hypothetically, be negative -- for example, screw Bob.)

The first problem is that utility functions are not comparable, so we have to say more before we can understand what "weighted sum" is supposed to mean. But suppose we've chosen some utility normalization technique. There are still other problems.

Notice that we can't totally define Sandy's utility function until we've defined Alice's, Bob's, Cathy's, and Dennis'. But any of those four might have cofrences which involve Sandy, as well!

Suppose we have Avery and Briar, two lovers who "only care about each other" -- their only preference is a cofrence, which places 1.0 value on the other's utility function. We could ascribe *any values at all* to them, so long as they're both the same!

With some technical assumptions (something along the lines of: your cofrences always sum to less than 1), we can ensure a unique fixed point, eliminating any ambiguity from the interpretation of cofrences. However, I'm skeptical of just taking the fixed point here.

Suppose we have five siblings: Primus, Secundus, Tertius, Quartus, et Quintus. All of them value each other at .1, except Primus, who values all siblings at .2.

If we simply take the fixed point, Primus is going to get the short end of the stick all the time: because Primus cares about everyone else more, everyone else cares about Primus' personal preferences *less* than anyone else's.

Simply put, I don't think more altruistic individuals should be punished! In this setup, the "utility monster" is the perfectly selfish individual. Altruists will be scrambling to help this person while the selfish person does nothing in return.

A different way to do things is to interpret cofrences as *integrating only the personal preferences of the other person.* So Sandy wants to help Alice, Cathy, and Dennis (and harm Bob), but does *not* automatically extend that to wanting to help any of their friends (or harm Bob's friends).

This is a little weird, but gives us a more intuitive outcome in the case of the five siblings: Primus will more often be voluntarily helpful to the other siblings, but the other siblings won't be prejudice *against* the personal preferences of Primus when weighing between their various siblings.

I realize altruism isn't *exactly* supposed to be like a bargain struck between selfish agents. But if I think of utilitarianism like a coalition of all agents, then I don't want it to punish the (selfish component of) the most altruistic members. It seems like utilitarianism should have better incentives than that?

(Try to take this section as more of a problem statement and less of a solution. Note that the concept of *cofrence* can include, more generally, preferences such as "I want to be better off than other people" or "I don't want my utility to be too different from other people's in either direction".)

# Utility monsters.

Returning to some of the points I raised in the "non-obvious consequences" section -- now we can see how "utility monsters" are/aren't a concern.

On my analysis, a utility monster is just an agent who, according to your metric for comparing utility functions, has a very large influence on the social choice function.

This might be a bug, in which case you should reconsider how you are comparing utilities. But, since you've hopefully chosen your approach carefully, it could also not be a bug. In that case, you'd want to bite the bullet fully, defending the claim that such an agent should receive "disproportionate" consideration. Presumably this claim could be backed up, on the strength of your argument for the utility-comparison approach.

# Average utilitarianism vs total utilitarianism.

Now that we have given some options for utility comparison, can we use them to make sense of the distinction between average utilitarianism and total utilitarianism?

No. Utility comparison doesn't really help us there.

The average vs total debate is a debate about population ethics. Harsanyi's utilitarianism theorem and related approaches let us think about altruistic policies for a fixed set of agents. They don't tell us how to think about a set which changes over time, as new agents come into existence.

Allowing the set to vary over time like this feels similar to allowing a single agent to change its utility function. There is no rule against this. An agent can prefer to have different preferences than it does. A collective of agents can prefer to extend its altruism to new agents who come into existence.

However, I see no reason why population ethics needs to be *simple*. We can have relatively complex preferences here. So, I don't find paradoxes such as the Repugnant Conclusion to be especially concerning. To me there's just this complicated question about what everyone collectively wants for the future.

One of the basic questions about utilitarianism shouldn't be "average vs total?". To me, this is a type error. It seems to me, more basic questions for a (preference) utilitarian are:

- How do you combine individual preferences into a collective utility function?
- How do you compare utilities between people (and animals, etc)?
- Do you care about an "objective" solution to this, or do you see it as a subjective aspect of altruistic preferences, which can be set in an unprincipled way?
- Do you range-normalize?
- Do you variance-normalize?
- Do you care about strategy-proofness?
- How do you evaluate the bargaining framing? Is it relevant, or irrelevant?
- Do you care about Nash's axioms?
- Do you care about monotonicity?
- What distinguishes humans from animals and plants, and how do you use it in utility comparison? Intelligence? Agenticness? Power? Bargaining position?

- How do you handle cofrences?

- How do you compare utilities between people (and animals, etc)?

*: Agents need not have a concept of outcome, in which case they don't really have a utility function (because utility functions are functions *of outcomes*). However, this does not significantly impact any of the points made in this post.

I'm not sure why you think this is a problem. Supposing you want to satisfy the group's preferences as much as possible, shouldn't you care about Primus less since Primus will be more satisfied just from you helping the others? I agree that this can create perverse incentives in practice, but that seems like the sort of thing that you should be handling as part of your decision theory, not your utility function.

I feel like the solution of having cofrences not count the other person's cofrences just doesn't respect people's preferences—when I care about the preferences of somebody else, that includes caring about the preferences of the people they care about. It seems like the natural solution to this problem is to just cut things off when you go in a loop—but that's exactly what taking the fixed point does, which seems to reinforce the fixed point as the right answer here.

I'm mainly worried about the perverse incentives part.

I recognize that there's some weird level-crossing going on here, where I'm doing something like mixing up the decision theory and the utility function. But it seems to me like that's just a reflection of the weird muddy place our values come from?

You can think of humans a little like self-modifying AIs, but where the modification took place over evolutionary history. The utility function which we eventually arrived at was (sort of) the result of a bargaining process between everyone, and which took some accounting of things like exploitability concerns.

In terms of decision theory, I often think in terms of a generalized NicerBot: extend everyone else the same cofrence-coefficient they extend to you, plus an epsilon (to ensure that two generalized NicerBots end up fully cooperating with each other). This is a pretty decent strategy for any game, generalizing from one of the best strategies for Prisoner's Dilemma. (Of course there is no "best strategy" in an objective sense.)

But a decision theory like that

doesmix levels between the decision theory and the utility function!I totally agree with this point; I just don't know how to balance it against the other point.

A crux for me is the coalition metaphor for utilitarianism. I think of utilitarianism as sort of a natural endpoint of forming beneficial coalitions, where you've built a coalition of all life.

If we imagine forming a coalition incrementally, and imagine that the coalition simply averages utility functions with its new members, then there's an incentive to join the coalition as late as you can, so that your preferences get the largest possible representation. (I know this isn't the

sameproblem we're talking about, but I see it as analogous, and so a point in favor of worrying about thissortof thing.)We can correct that by doing 1/n averaging: every time the coalition gains members, we make a fresh average of all member utility functions (using some utility-function normalization, of course), and everybody voluntarily self-modifies to have the new mixed utility function.

But the problem with this is, we end up punishing agents for self-modifying to care about us

beforejoining. (This is more closely analogous to the problem we're discussing.) If they've already self-modified to care about us more before joining, then their original values just get washed out even more when we re-average everyone.So really, the implicit assumption I'm making is that there's an agent "before" altruism, who "chose" to add in everyone's utility functions. I'm trying to set up the rules to be fair to

thatagent, in an effort to reward agents for making "the altruistic leap".I agree, though it's unclear whether that's an actual level crossing or just a failure of our ability to be able to properly analyze that strategy. I would lean towards the latter, though I am uncertain.

This is how I think about preference utilitarianism but not how I think about hedonic utilitarianism—for example, a lot of what I value personally is hedonic-utilitarianism-like, but from a social perspective, I think preference utilitarianism is a good Schelling point for something we can jointly agree on. However, I don't call myself a preference utilitarian—rather, I call myself a hedonic utilitarian—because I think of social Schelling points and my own personal values as pretty distinct objects. And I could certainly imagine someone who terminally valued preference utilitarianism from a personal perspective—which is what I would call actually being a preference utilitarian.

Furthermore, I think that if you're actually a preference utilitarian vs. if you just think preference utilitarianism is a good Schelling point, then there are lots of cases where you'll do different things. For example, if you're just thinking about preference utilitarianism as a useful Schelling point, then you want to carefully consider the incentives that it creates—such as the one that you're pointing to—but if you terminally value preference utilitarianism, then that seems like a weird thing to be thinking about, since the question you should be thinking about in that context should be more like what is it about preferences that you actually value and why.

One thing I will say here is that usually when I think about socially agreeing on a preference utilitarian coalition, I think about doing so from more of a CEV standpoint, where the idea isn't just to integrate the preferences of agents as they currently are, but as they will/should be from a CEV perspective. In that context, it doesn't really make sense to think about incremental coalition forming, because your CEV (mostly, with some exceptions) should be the same regardless of what point in time you join the coalition.

I guess this just seems like the correct outcome to me. If you care about the values of the coalition, then the coalition should care less about your preferences, because they can partially satisfy them just by doing what the other people in the coalition want.

It certainly makes sense to reward agents for choosing to instrumentally value the coalition—and I would include instrumentally choosing to self-modify yourself to care more about the coalition in that—but I'm not sure why it makes sense to reward agents for terminally valuing the coalition—that is, terminally valuing the coalition independently of any decision theoretic considerations that might cause you to instrumentally modify yourself to do so.

Again, I think this makes more sense from a CEV perspective—if you instrumentally modify yourself to care about the coalition for decision-theoretic reasons, that might change your values, but I don't think that it should change your CEV. In my view, your CEV should be about your general strategy for how to self-modify yourself in different situations rather than the particular incarnation of you that you've currently modified to.

The problem in your example is that you failed to identify a reasonable disagreement point. In the situation you described (1,1) is the disagreement point since every agent can guarantee emself a payoff of 1 unilaterally, so the KS solution is also (1,1) (since the disagreement point is already on the Pareto frontier).

In general it is not that obvious what the disagreement point should be, but maximin payoffs is one natural choice. Nash equilibrium is the obvious alternative, but it's not clear what to do if we have several.

For applications such as voting and multi-user AI alignment that's less natural since, even if we know the utility functions, it's not clear what action spaces should we consider. In that case a possible choice of disagreement point is maximizing the utility of a randomly chosen participant. If the problem can be formulated as partitioning resources, then the uniform partition is another natural choice.

Ahh, yeahh, that's a good point.

This jumps from mathematical consistency to a kind of opinion when pareto improvement enters the picture. Sure if we have choice between two social policies and everyone prefers one over the other because their personal lot is better there is no conflict on the order. This could be warranted if for some reason we needed consensus to get a "thing passed". However where there is true conflict it seems to say that a "good" social policy can't be formed.

To be somewhat analogous with "utility monster", construct a "consensus spoiler". He exactly prefers what everyone anti-prefers, having a coference of -1 for everyone. If someone would gain something he is of the opinion that he losses. So no pareto improvements are possible. If you have a community of 100 agents that would agree to pick some states over others and construct a new comunity of 101 with the consensus spoiler then they can't form any choice function. The consensus spoiler is in effect maximally antagonistic towards everything else. The question whether it is warranted, allowed or forbidden that the coalition of 100 just proceeds with the policy choice that screws the spoiler over doesn't seem to be a mathematical kind of claim.

And even in the less extreme degree I don't get how you could use this setup to judge values that are in conflict. And if you encounter a unknown agent it seems it is ambigious whether you should take heed of its values in compromise or just treat it as a possible enemy and just adhere to your personal choices.

Yeah, I like your "consensus spoiler". Maybe needs a better name, though... "Contrarian Monster"?

This way of defining the Consensus Spoiler seems needlessly assumption-heavy, since it assumes not only that we can already compare utilities in order to define this perfect antagonism, but furthermore that we've decided how to deal with cofrences.

A similar option with a little less baggage is to define it as having the opposite of the preferences of our social choice function. They just hate whatever we end up choosing to represent the group's preferences.

A simpler option is just to define the Contrarian Monster as having opposite preferences from one particular member of the collective. (Any member will do.) This ensures that there can be no Pareto improvements.

Actually, the conclusion is that you can form

anysocial choice function.Everythingis "Pareto optimal".If we think of it as bargaining to form a coalition, then there's never any reason to include the Spoiler in a coalition (especially if you use the "opposite of whatever the coalition wants" version). In fact, there is a version of Harsanyi's theorem which allows for negative weights, to allow for this -- giving an ingroup/outgroup sort of thing. Usually this isn't considered very seriously for definitions of utilitarianism. But it could be necessary in extreme cases.

(Although putting zero weight on it seems sufficient, really.)

Pareto-optimality doesn't really give you the tools to mediate conflicts, it's just an extremely weak condition on how you do so, which says essentially that we shouldn't put negative weight on anyone.

Granted, the Consensus Spoiler is an argument that Pareto-optimality may not be

weak enough, in extreme situations.Oh no! The two images starting from this point are broken for me:

How about now?

Weird, given that they still look fine for me!

I'll try to fix...

Yep, fixed. Thank you!

Judging from the URL of those links, those images were hosted on a domain that you could access, but others could not, namely they were stored as Gmail image attachments, to which of course you as the recipient have access, but random LessWrong users do not.

Type theory for utility hypothesis: there are a certain distinct (small) number of pathways in the body that cause physical good feelings. Map those plus the location, duration, intensity, and frequency dimensions and you start to have comparability. This doesn't solve the motivation/meaning structures built on top of those pathways which have more degrees of freedom, but it's still a start. Also, those more complicated things built on top might just be scalar weightings and not change the dimensionality of the space.

Yeah, it seems like in practice humans should be a lot more comparable than theoretical agentic entities like I discuss in the post.

Planned summary for the Alignment Newsletter:

One think I'd also ask about is: what about ecology / iterated games? I'm not very sure at all whether there are relevant iterated games here, so I'm curious what you think.

How about an ecology where there are both people and communities - the communities have different aggregation rules, and the people can join different communities. There's some set of options that are chosen by the communities, but it's the people who actually care about what option gets chosen and choose how to move between communities based on what happens with the options - the communities just choose their aggregation rule to get lots of people to join them.

How can we set up this game so that interesting behavior emerges? Well, people shouldn't just seek out the community that most closely matches their own preferences, because then everyone would fracture into communities of size 1. Instead, there must be some benefit to being in a community. I have two ideas about this: one is that the people could care to some extent about what happens in all communities, so they will join a community if they think they can shift its preferences on the important things while conceding the unimportant things. Another is that there could be some crude advantage to being in a community that looks like a scaling term (monotonically increasing with community size) on how effective they are at satisfying their peoples' preferences.