Obviously there’s not a lot of love for OpenAI and other corporate API generative AI here, but how does the community feel about self hosted models? Especially stuff like the Linux Foundation’s Open Model Initiative?

I feel like a lot of people just don’t know there are Apache/CC-BY-NC licensed “AI” they can run on sane desktops, right now, that are incredible. I’m thinking of the most recent Command-R, specifically. I can run it on one GPU, and it blows expensive API models away, and it’s mine to use.

And there are efforts to kill the power cost of inference and training with stuff like matrix-multiplication free models, open source and legally licensed datasets, cheap training… and OpenAI and such want to shut down all of this because it breaks their monopoly, where they can just outspend everyone scaling , stealiing data and destroying the planet. And it’s actually a threat to them.

Again, I feel like corporate social media vs fediverse is a good anology, where one is kinda destroying the planet and the other, while still niche, problematic and a WIP, kills a lot of the downsides.

  • brucethemoose@lemmy.worldOP
    link
    fedilink
    arrow-up
    16
    arrow-down
    1
    ·
    edit-2
    15 days ago

    Oh, and if your hardware is AMD or Nvidia, you should really give exllama a shot.

    If it’s Apple, you should investigate kobold.cpp and more “nitty gritty” llama.cpp backends.

    I have largely negative feelings towards ollama for a lot of reasons, but one of them is that it hides a lot of the knobs to get the absolute best out of LLMs, and understand how they work.

    • tkw8
      link
      fedilink
      English
      arrow-up
      6
      ·
      15 days ago

      I’m running Nvidia on Ubuntu. I’ll give exllama a shot.

      • brucethemoose@lemmy.worldOP
        link
        fedilink
        arrow-up
        7
        ·
        edit-2
        15 days ago

        I’d recommend TabbyAPI with your favorite frontend, anything that works with OpenAI.

        Or exui (which is what I tend to use) but is a bit more manual. text-gen-web-ui has better samplers, but its IMO more clanky and crufty, and really slow at long context.

        Also, uh, you’ll have to be careful about picking a model, you have to fit it to your GPU instead of letting ollama do it for you. I view this as a positive, as it forces you to search more a more optimal fit.

        • tkw8
          link
          fedilink
          English
          arrow-up
          5
          ·
          15 days ago

          I manually specify what models to pull. I’m not running anything too crazy. My largest model is gemma27B. But I’ve worked with dolphin-mistral which was fun.

          • brucethemoose@lemmy.worldOP
            link
            fedilink
            arrow-up
            6
            ·
            15 days ago

            If you have a 24GB card, just go straight to the most recent Command R, a 3.75bpw-4bpw quantization. It’s incredible, and you can do the full 131K context on a 24GB GPU easy.

            Gemma 27B Is actually quite good, but “narrow.” Its super low context and seems to be hyper optimized for short chatbot-arena style questions.

            • tkw8
              link
              fedilink
              English
              arrow-up
              4
              ·
              edit-2
              15 days ago

              Gemma 27B Is actually quite good, but “narrow.” Its super low context and seems to be hyper optimized for short chatbot-arena style questions.

              This is the stuff I love to know so thanks for sharing. I will be pulling Command R tomorrow.

              • brucethemoose@lemmy.worldOP
                link
                fedilink
                arrow-up
                3
                ·
                15 days ago

                Good! So Command-R excels at “RAG” style tasks like asking questions about a huge document, continuing a long story or so on. You should also read up on its super intricate system prompt format, which can steer it quite well.

                I dunno about code, I tend to use Mistral Code 22B (or deepseek v2 API) for that.

                I am happy to ramble on about this stuff, just ask.