378. Three forms of drawing AI

Style: Romance Author: CloseAIWords: 1936Update Time: 24/01/11 09:49:09
[377 is the selected content and will definitely not be released. The first half of 378 is too harmonious and cannot be written. I will put the second half as a free chapter here. 】

No matter how outrageous everyone’s views are, it is an indisputable fact that interest in the field of AI is rising with the election.

This popularity reached a new peak after Meng Fanqi announced that he would soon release a real artificial intelligence capable of drawing based on text.

Because nearly half a year ago, the trial version Clip released by Meng Fanqi had already demonstrated excellent drawing capabilities and multi-modal understanding capabilities.

It’s so good that everyone thinks this thing was developed specifically for AI drawing.

Unexpectedly, just by adding the correspondence between images and text, the model quickly and spontaneously acquired such strong image generation capabilities.

And it was already so amazing half a year ago, how can it be done now?

Regarding the much-anticipated AI drawing, internal research and development is actually not going smoothly, which can be seen from the release time.

Meng Fanqi also hesitated for quite some time as to which route he should choose.

The most famous AI image generators in the past were StableDiffusion, Midjourney and DALLE.

The SD diffusion model is a Clip-based text generation image model. Its method is to start from a noisy situation and gradually improve the image until there is no noise at all, gradually approaching the provided text description.

Its training method has also been refined through many studies. It first samples a picture and gradually increases the noise over time until the data cannot be recognized. The model then attempts to revert the image back to its original form, learning how to generate the image or other data in the process.

This route, as its name is stable, is very stable, but if you want to generate very high-quality images, the computational cost is very large.

Technically it has been achieved, but in terms of cost, it seems that it is not very suitable to be put into the market at the moment.

In the previous life, Midjourney was better at various artistic styles, and the images generated often had very beautiful results.

The "Space Opera" that won the gold medal in the painting competition incognito was Midjourney's work.

Logically speaking, this route is more aesthetic, can not only have a shocking publicity effect, but also attract a large number of users, so it should be the best choice.

However, compared to the open source diffusion model approach, Midjourney uses a public platform robot to handle user requests.

Due to its closed profit model, Meng Fanqi knew very little about the specific technical details of this AI, and he did not know what its core technical key was, so he had to abandon this route.

"If you look at the popularity and popularity of previous lives, the diffusion model and Midjourney will be more stable. However, DALLE has been combined with ChatGPT before I was reborn, and it has great potential. Considering the future development, I need to The two routes are integrated.”

It is precisely because of the need to combine the strengths of the two companies that Meng Fanqi's diffusion mapping AI is several months later than expected.

In the end, a relatively mature three-step system of compression, diffusion, and hidden space re-diffusion was formed.

The experimentation, discussion, and finalization of this holistic approach took even longer than formal training.

"I don't know when something like a quantum computer with an order-of-magnitude improvement in computing performance will be available. If the computing power is fast enough, it can actually save a lot of trouble." Meng Fanqi still felt tired when he thought about this.

The biggest reason why so many modules need to be split is the consumption of computing resources.

The resolution of the image is square, and the operation in the T method also involves the square operation of the dimension. Users feel that 256 and 512 resolution images are similar, but when reflected on the overall situation, the improvement is often an order of magnitude.

For this reason, the learning steps of the diffusion model have to be placed in low-latitude space for sampling.

To put it bluntly, the resolution is lowered first, thereby greatly reducing the calculation amount of the steps before and after diffusion.

"Will this hurt the performance? Will the generated images not be good enough?" CloseAI also raised such concerns internally when it decided to release this version of the diffusion model, which is somewhat emasculated in terms of computing power.

After all, the algorithm can actually do better, although the cost will be higher.

"It's not just a matter of computing time, it's also a matter of video memory. Without this kind of splitting and image resolution, the same card will not only be an order of magnitude slower in computing speed, but also can perform fewer tasks at the same time. Several times." Meng Fanqi insisted on solving the problem of the number of users first, and the performance and effects can be slowly optimized.

It's like a huge fat man coming to eat. Not only does the meal take several times longer than others, he can also sit four seats by himself.

In Meng Fanqi's view, the drawing AI first released before ControlNet was proposed was just a toy.

It doesn’t matter if its performance fluctuates up and down, because the success rate of early high-quality drawings is not high, and it often requires a lot of testing to select one that can be viewed.

This is mainly because no matter whether it is a Wenshengtu or a Tushengtu, there is a lack of a particularly good control method in the early stage.

"The specific usage of the diffusion model we are launching now is to use a large amount of text input to control the output of the image. But it is very difficult for text to clearly describe a specific image, even if a large number of attempts are made and a large number of generated, you may not be able to get the results you want.”

"This generation model also needs to use a combination of graphics and text. We also need to find specific ways to control the behavior of the diffusion model by adding additional conditions and tell it what to adjust and what not to adjust. To generate image content Being as controllable as possible is far more important and has a higher priority than making the image look more beautiful and beautiful.”

Meng Fanqi is very aware of the biggest problem with early AI drawing, which is that generating images is like chanting black magic.

In order to get a satisfactory picture, you may need to chant a hundred keywords.

At that time, many people joked that playing AI drawing was like a cyber cult, mumbling a lot of things that others didn't understand.

There are even packages that package a large number of high-quality image keywords and sell them directly.