Yeah, I think the language description problem is that the text-conditioned training (CLIP) is pretty rudimentary, when it comes to describing images. If, for example, you describe what you want to a designer, they have a rich contextual understanding of the task that goes far beyond simple verbal descriptions of the content of the image. They understand abstractions like composition, patterns of visual scanning, visual-social cues, and all the rest. They also generally have some degree of explicit instruction in their training, to fall back on when making decisions or solving design problems—i.e., their knowledge isn't entirely dependent upon making inferences from observed data. So sure, you also use language to "describe" what you want to a designer, but the designer has a far richer and more complex world model through which to interpret your description. No doubt this stuff will get better as annotations get stronger and/or as semi-supervised approaches to training improve.