That does not sound like a thing I should do. If it can twist directives that hard, I'm lost no matter what I tell it.
The real problem, I think, isn't that the AI can twist directives. The problem is that when humans describe objectives, we have a lot of cultural baggage and assumptions. This already causes problems when communicating with people from other cultures. But even then, we have common ground. We share experiences. We can say, "I like this; it makes me happy", and not have to explain what "happy" means, nor what "like", nor "I", nor even the concept of a singular object. We can say, "I wish I had more paperclips," and the human we talk to doesn't think, "This person wants to maximize paperclips by converting the entire universe (itself included) into paperclips".
With an AI, the problem is defining our objectives. If I told it "Make everyone happy", it doesn't think like a human. It thinks "Directive: change the state of the universe such that for every human in existence, that human is happy," and concludes that the simplest method to reach that directive is "Consideration: if there exists no humans, then all humans in existence are happy." If I were to give the same directive to an omnipotent human, then they would instead think: "I'll change the world so that everyone has nothing inhibiting their ability to be happy, and everyone has reason to be happy." The difference is the cultural baggage and the assumptions we make in communication. With an AI, there are no assumptions, and it takes my words at face value. But with a human, the human makes the reasonable assumption that I didn't mean "kill everyone", and in fact that I would be horrified if my words were taken in that manner. With a human, I don't have to define happiness, just that it's a desirable outcome.
Humans have a laundry list of implicit assumptions and values, and if we don't tell the AI to fulfill them, then the AI would likely sacrifice them on the path to fulfilling its objective.
A common counterargument is that the AI is surely intelligent enough to know what I gave as its goals and what I meant to give as its goals are different. To this I say: "I want to change the universe into paperclips. When you told me to make more paperclips, I know that's not what it means. But that doesn't matter, because I want what you programmed me to want, not what you meant to program me to want."
And if the AI wants the atoms of my body to be somewhere else, then it will find a way consistent with its objectives to remove the atoms from my body.
It is for this reason that my answer has the phrase, "Do what I mean, not what I say I mean," because what I mean is backed by cultural and human assumptions.
Which brings me to this:
1. People have a hard time telling themselves what they mean when it comes to the matters of ethics, morality, empathy, ect. until they are confronted with an actual problem and then they learn what they mean for the first time a lot of the time. Also this directive is dependent on your mood and it's changes.
2. With what? No seriously what are the directives the AI is supposed to follow to define "person" other than whatever the fuck humanity itself defines as a person. For most of human history human beings have regularly defined some parts of humanity as "not-persons" and used that as an excuse to abuse those parts.
3. and 4. Most people don't have any terminal goals. They just want to live their lives and be as happy as they can be. And those who do have terminal goals usually have short-term goals. This sort of directives will cause a
Rains of Oshanta to occur.
I'll give a response point-by-point.
1. People have a hard time telling themselves what they mean when it comes to the matters of ethics, morality, empathy, ect. until they are confronted with an actual problem and then they learn what they mean for the first time a lot of the time. Also this directive is dependent on your mood and it's changes.
Your first point is that this first directive is not useful because I don't know what I mean. Actually, that is the point. I don't know what I mean, but I want the AI to fulfill it anyways. Why? Because while I might not know what I mean, I know for sure what I do not mean. I do not mean "kill every human", nor do I mean "alter the minds of every human to be constantly happy". I do not mean a whole myriad list of things that I currently consider to be bad outcomes. And because what I mean is none of those bad outcomes, nor is it a bad outcome that I have not yet anticipated yet, then the only outcomes that the AI would aim towards are outcomes which I might consider desirable.
2. With what? No seriously what are the directives the AI is supposed to follow to define "person" other than whatever the fuck humanity itself defines as a person. For most of human history human beings have regularly defined some parts of humanity as "not-persons" and used that as an excuse to abuse those parts.
Your second point is that for much of history, humanity defined certain people to not be people, using it as an excuse to abuse them, and thus, defining "people" as "what humanity defines as people" would end in a bad outcome. I disagree.
Let's suppose I were a typical rich racist white man living in the 1700s. I might say "only white humans are people." But what I actually would mean is "People have a certain amount of intelligence and inherent intellect. Every person that I've met who isn't white was dumb, couldn't speak properly, and were thus below that threshold. Thus, only white humans are people." (Obviously, this is not the case, and I disavow and condemn this belief)
And the AI would look at what I mean, and realize that I'm wrong: the humans I exclude from being people on the racist basis that they do not have sufficient intelligence and intellect actually do have more intelligence and intellect than some humans who I defined as people, thus African humans are also people.
3. and 4. Most people don't have any terminal goals. They just want to live their lives and be as happy as they can be. And those who do have terminal goals usually have short-term goals. This sort of directives will cause a
Rains of Oshanta to occur.
This statement seems self-contradictory. You say most people don't have terminal goals, and yet give a candidate for a terminal goal nonetheless "They just want to live their lives and be as happy as they can be." I believe this disagreement stems from a misunderstanding: I used "terminal goals" in the technical sense. I meant "goals that people want because the goal has intrinsic value", like "be happy", or "lead a fulfilling life", or "have freedom".
But even if people don't have terminal goals, my first instruction "do what I mean, not what I say I mean" comes into play. The AI will identify goals that a person has (like "become a baseball player") and do its best to support that goal. It would lend a hand in training, provide useful tools, give encouragement, etc., because that's what I mean, not what I said I mean.
I believe people do have terminal goals, but do not know or understand those terminal goals. Not knowing what your terminal goals are does not mean that you have no terminal goals. You might say most people have no terminal goals. But if you asked them for a reason behind their actions, they might say, "It's fun", or "I need to put food on the table, right?", or even just a simple "I felt like doing it." And all of those responses point towards a larget goal ("have fun", "feed myself and my family", or "have fulfillment in life" respectively). And if you ask them why they aren't doing something, the answers point towards other terminal goals ("That's too difficult for me to do", or "that's too dangerous", or "Are you mad? I'm not a monster", which respectively point towards terminal goals of "be lazy", "be safe", and "be moral")
Yet more counterarguments:
1. This presumes that it is possible to get such a voluntary ban in the Age of Big Data. Not likely with the US/China relations around tech worsening as they are.
2. And yet genetically modified humans are already
here. Done by
a prestige seeking lunatic, but still already here. This, as far as we are aware, did not happen with cloning.
3. The weight of evidence that this will be so is on you to prove. Plenty of White Hat hackers get prosecuted or blacklisted for testing and finding bugs in regular systems, let alone something as expensive as a Sentient AI system.
You are ignoring how Greed works. Why are you surrendering with such circular logic? Also that is a society always progresses logical fallacy (I don't know the actual name for that one). Again that rule is really dependent on a Human's mood.
1. Most humans already have unsafe terminal goal when they have any.
2. That is a really stupid example considering that interaction with other Humans on it's own can drive Humans towards acts of violence, violation and murder. No need for any special edge cases with that one.
3. Social engineering is a skill like any other. It has to have been programmed into the AI. Why would you do that if you are not letting it out of the box?
That is not a good risk/reward argument since it relies on knowledge that can not be gained during the solving of such a critical problem.
Again, I shall respond point-by-point.
1. This presumes that it is possible to get such a voluntary ban in the Age of Big Data. Not likely with the US/China relations around tech worsening as they are.
Perhaps. But I think that most scientists would preferentially work on projects that are less likely to result in their deaths. So if there isn't a de jure voluntary ban on research in that area, then there would be a practical slowdown. Which, again, is the goal.
2. And yet genetically modified humans are already
here. Done by
a prestige seeking lunatic, but still already here. This, as far as we are aware, did not happen with cloning.
I did not know that this had occurred. But even so, it proves my point. This could have been done less safely, a long time ago. But it was only after more research had happened, and safer and more effective methods of human genetic modification had been published, that this occurred. How many hundred more studies could have had happened?
And before you say that it takes only one AGI to kill all humans (which is true), this incident is only a single step along the path to human genetic modification. This was not the AGI of human genetic modification. This was the first proof-of-concept of something that needs more research before it can become a viable human augmentation project. How many hundred more projects are needed until the first AGI?
3. The weight of evidence that this will be so is on you to prove. Plenty of White Hat hackers get prosecuted or blacklisted for testing and finding bugs in regular systems, let alone something as expensive as a Sentient AI system.
You are ignoring how Greed works. Why are you surrendering with such circular logic? Also that is a society always progresses logical fallacy (I don't know the actual name for that one). Again that rule is really dependent on a Human's mood.
Which you said in response to this:
3. Even if AI research continues, and unsafe AGI candidates are created, by the time the next AGI candidate is created, more AI safety research would have been completed, so the AGI is more likely to be safe than it is now.
I am afraid there must have been a major misunderstanding. White hat hacking has nothing to do with the field of AI safety. AI safety is a field of philosophy. It is a field of "What common threads can we expect among a wide range of AIs?" It is a field of "What is the worst thing that can happen if we give a hypothetical AI this objective?" It is a field of "What sort of adversarial data can we provide to this architecture of AI that causes pathological responses?" It is not a field of "Let's hack a real-life AGI to make it safer."
AI safety research came up with these two big ideas: that any level of intelligence can have nearly any terminal goal, and that most terminal goals can be served by a set of instrumental (means-to-an-end) goals. An example of the latter conclusion is that an AI would be more successful if it had more resources to reach its goals.
If an AGI candidate were created before these two big ideas, we would have understood the risks less when creating it, and we would have less of an idea of what not to do. But now we do have a faint idea of what not to do. We do understand that AI does not intrinsically have the same goals as we do, and that this can lead to problems.
If the first AGI candidate were completed five years later than it would otherwise have, then that means there would have been a whole five years of AI safety research that might have happened before that AGI candidate is turned on. And what that means is we have a full five years more of insight on what not to do, which (marginally) decreases the probability of a bad ending.
To me, that is worth the five year delay. To AI safety researchers, that is worth a five-year delay. To anyone who understands AI safety, that is worth a five year delay.
But why limit ourselves to five years?
That is why even a delay in the creation of an AGI is desirable, so long as AI safety research continues.
And no, it is not a "society always progresses" fallacy. Here is why:
There are three possible options. Either the research has progressed, the research has stagnated, or the research has regressed. I have already explored the first case, which is strictly better than releasing the AGI now. In the second case, it is no worse than releasing the AGI now. But the third case requires either new false knowledge to be entered into AI safety, or good true knowledge to be lost. The latter, in the age of Big Data, seems impossible. The former requires someone coming up with a bad idea that stands up under intense scrutiny. I consider this to be extremely likely.
Since we haven't solved AI safety yet, the current research points towards any true AGI being created to have a terminal goal at best slightly different from what is desirable, and this likely leading to a bad end. How can this be worse? We're already at the stage where a true AGI would likely either kill all humans or wire happiness buttons into our brains and leave us as eternally happy drooling morons.
So, in my view, either an extremely improbable happens, and AI safety research regresses, leading to a situation no worse than we're currently in, or we are either better off or no worse than we currently are. Waiting is either as good as or better than not waiting, so we should wait.
1. Most humans already have unsafe terminal goal when they have any.
2. That is a really stupid example considering that interaction with other Humans on it's own can drive Humans towards acts of violence, violation and murder. No need for any special edge cases with that one.
A superintelligence would be much more successful at fulfilling its terminal goals than humans. Let's consider the expressly unsafe terminal goal "Minimize the number of things that are either human or itself in the universe". A single human's success case would be to kill all humans and then kill themself. And it might be possible that they succeed by blocking space programs and rendering the Earth uninhabitable.
But the superintelligence, by definition, would be more successful.
In short, a human is less dangerous than a superintelligence, even when they have the same terminal goals.
As for covering edge cases? Well, it is something that must be addressed.
3. Social engineering is a skill like any other. It has to have been programmed into the AI. Why would you do that if you are not letting it out of the box?
Social engineering, unlike other skills, is a skill that the AI can learn if it is boxed with access to a human (through, for example, a text terminal). It would be safe if it can be made to completely shut down, of course. Then the only hazard is if it somehow manages to turn back on again (for example, if some bumbling idiot turned it on).
That is not a good risk/reward argument since it relies on knowledge that can not be gained during the solving of such a critical problem.
I think you're saying this in response to the below quote. If I'm wrong, please do clarify.
If this boxed AI is currently unsafe, I don't think it's possible to render it safe except through destruction. If the AI is safe (for example, to the extent of CelestAI), then it is much safer to release it right now than to let other potentially unsafe AI be created.
Indeed, it is not an argument about what we should do with the AGI. Instead, it's a statement of my beliefs regarding the AI. I pointed out the two possible cases, and I described what I believed to be the safest option, given that I know what is the current case. And, as you've so elegantly put it, this is knowledge we cannot gain while solving this problem. However, it is a framework within which we can work. If we somehow manage to determine that the AI is safe, then we know what to do. If we cannot do so, then, as I've argued above, it's better to delay, and possibly destroy the AI within the box.
So you're assuming that the ROB that gave you the boxed AI would have done so either carelessly or maliciously?
No. Neither am I assuming that the ROB was careful or benevolent.