which is barely applicable to image generators
Well, that statement lasted all of
half a day. Actually it was incorrect when I wrote it.
GPT-4o (released yesterday) is a large-world-model AI with the native modalities of text (obviously), speech and
images. In other words, it doesn't* need DALL-E to create or recognise images; it makes pictures the same way it makes text. What this means—what they've demonstrated—is that it can...
- Go from a longwinded text description of a character, to a picture of that character. "Longwinded description" could also mean "Ten chapters of a book."
- Use other pictures as reference inputs.
- Style transfer. "Like this picture, but cartoon."
- Character reference inputs, of course. "This character, but sitting at a table talking with this other character."
- Generalised edits. "Make the vase blue."
- Perfect consistency, up to and including generating GIFs orbiting some imagined object.
Basically everything I thought GPT-4-Turbo might have, but which it didn't. This time it's demonstrated capabilities; you can see it in their demo.
All of this is in-context learning, and its context window is 128k tokens at a minimum; at somewhere between a few hundred to a few thousand tokens per image, that means you can't fit an entire manga in the context window... but you certainly can fit references for all the characters, locations, and any other elements that need to remain consistent, and with a little bit of programming it would be almost trivially easy to build a manga-generator on this.
I... would have liked for this to be something that could be run locally. But as it stands, it looks like I'll be getting much better illustrations for my stories, moving forwards.
Oh and it's also about 5 times cheaper to run than GPT-4.
*: Though the image-output modality isn't available to anyone yet; they say it'll be rolled out over the next 2-3 weeks. If you ask 4o for an image right now it'll still use DALL-E, which is high quality but does almost none of this.