"In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. We analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt.
With this observation, we propose to control the attention maps of the edited image by injecting the attention maps of the original image along the diffusion process."