That is an incredibly tall order though. I don't share Vebyast's optimism about it. *shrugs*No real point in delving further about it though, since we'd just be arguing in circles.
I think that "optimism" is very much the wrong word for it, and I think that you're probably misunderstanding salty's misrepresentation of my position.
I think that most of the people here have mental models of Kyubey that are absolutely godawful. Salty and Godwinson and even Karne all fall back on ultimately
human terminology for predictions of Kyubey's behavior. And I think that that's a cause of a great deal of the uncertainty that people have about the topic, uncertainty that breeds fear. Your models are all uncertain because you're still trying to fit the incubator into a human-shaped mold. Like, that word you use, "trust". Trying to ask if we can "trust" Kyubey is like asking if we can "trust" a rock. What does that even mean? The same with "backstabbing" - you're projecting fundamentally human
patterns of thought onto Kyubey rather than trying to figure out how Kyubey actually thinks. And you are obviously failing, because Kyubey does not think in ways for which "trust" and "backstabbing" are useful concepts.
Like, let me go all the way down and talk about representations. The representation that you use to describe a problem
fundamentally alters how you approach the problem. As an example, let's take calculators. Most calculators use an infix system and parentheses to represent math: "3 + (4 * 2)". Some calculators use a postfix scheme called "reverse polish notation": "4 2 * 3 +". These are clearly mathematically equivalent notations, in that there are no bits of math that you can write with one but not the other; it is in fact a common exercise in computer science algorithms classes to prove that the two are equivalently powerful. The thing is, while they are equivalently
powerful, they are not equivalently
easy to use for any given problem, nor is one strictly superior - there are some problems which are easier in one and problems which are easier in the other. Like, if you're just adding a ton of numbers, it's a lot easier to punch them all in one after another and then hammer on the + key until you're done, so RPN is better. If you're doing a finicky thing with lots of grouped terms, it's easier to use infix so you can visually separate them. The key insight is that, while vocabulary doesn't change
what concepts you can work with, it can change
how easy it is to work with a concept, and that when working with a problem you want to choose a notation that represents that problem more
naturally or
conveniently.
The notation we use for states of mind and decisionmaking is built for talking about humans. It has to be! The words and languages that represented human cognition naturally and easily gave their speakers a competitive advantage in reasoning about other people that ensured that those speakers, and therefore those words and languages, succeeded and propagated. Our notation for human cognition is fantastically powerful, as you'd expect of a system designed for helping people represent and reason about other humans. But it is also fantastically
optimized, because people are, when you really get down to it, dumb, panicky, and relatively predictable animals. And that optimization means that it is optimized
against being useful for non-human cognition. You can brute-force it, of course; it's an incredibly powerful notation overall. But there
are things that it can't represent at all - this is why we build things like medical vocabulary - and there are things that it can't represent
easily. Kyubey is, in large part, both of those.
So I don't recommend talking about Kyubey's "goals" or whether he's going to "lie" to us or "backstab" us. In some circumstances he might think things that those terms might describe. But you'd be introducing errors if you used those terms, in the same way that you'd be introducing errors if you tried to describe how hard a random lump of metal by calling it "hard" or "soft". You need to choose a representation that is more effective. In the case of the random lump of metal, that'd be things like the vickers or rockwell scales or more modern alternatives. In the case of kyubey, I did a bunch of theoretical AI research and read a
ton of material about superintelligence and AI friendliness and safety, and it turns out that the vocabulary you'd use to reason about a powerful general optimizer is also useful for reasoning about Kyubey. Regardless of what that better scale is, as long as you're trying to describe Kyubey using human concepts, you're going to have a bad time.