• fiat_lux@kbin.social
    link
    fedilink
    arrow-up
    11
    ·
    edit-2
    1 year ago

    it utilizes the power of attention mechanisms to weigh the relevance of input data

    By applying a technique called supervised fine-tuning across modalities, Meta was able to significantly boost CM3leon’s performance at image captioning, visual QA, and text-based editing. Despite being trained on just 3 billion text tokens, CM3leon matches or exceeds the results of other models trained on up to 100 billion tokens.

    That’s a very fancy way to say they deliberately focussed it on a small set of information they chose, and that they also heavily configure the implementation. Isn’t it?

    This sounds like hard-wiring in bias by accident, and I look forward to seeing comparisons with other models on that…

    From the Meta paper:

    “The ethical implications of image data sourcing in the domain of text-to-image generation have been a topic of considerable debate. In this study, we use only licensed images from Shutterstock.
    As a result, we can avoid concerns related to images ownership and attribution, without sacrificing performance.”

    Oh no. That… that was the only ethical concern they considered? They didn’t even do a language accuracy comparison? Data ethics got a whole 3 sentences?

    For all the self-praise in the paper about state of the art accuracy on low input, and insisting I pronounce “CM3Leon” as Chameleon (no), it would have been interesting to see how well it describes people, not streetlights and pretzels. And how it evaluates text/images generated from outside the cultural context of its pre-defined dataset.