One I used before is runpod.io, but it is a pay per time platform, not API.
ThisGonBHard
3090 might be faster/around the same speed, as they have NV-Link.
I think it becomes faster to run on CPU than that.
Pretty much not at all. The main bottleneck is memory speed.
I barely see a difference between 4 and 12 cores on 5900X when running on CPU.
When running multi GPU, the lanes are the biggest bottleneck.
On single GPU, CPU does not matter.
I think it means no display in.
While the benchmarks then to be cheated, especially by small models, I honestly think something is wrong with how you run it.
Yi-34B trades blows with Lllama 2 70B from my personal tests, making it do novel tasks invented by me, not the gamed benchmarks.
ALL 7B models are like putting a 7 year old vs an renowned professor when they are compared to 34B and 70B.
Why the hell would you get a 2 gen old 16 GB GPU for 7.7K when you can get 3-4 4090s, each will rofl stomp it ANY use case, let alone running 3.
Get either an A6000 (Ampere 48GB card), A6000 ADA, 3 4090s and the a AMD TR system with it or something like that. It will still run laps around the V100 and be cheaper.
https://github.com/oobabooga/text-generation-webui
How much ram do you have? It matters a lot.
For a BIF simplification, think of the models you can run as the size (billion parameter, for example 13B means 13 billion) = 50-60% of your RAM.
If you have 16 GB, you can run a 7B model for example.
If you have 128GB, you can run 70B,
closed-source model
You gave your own answer:
Not monitored
Not controlled
Uncensored
Private
Anonymous
Flexible
The whole AI ecosystem was pretty much designed for python from the ground up.
I am guessing you can run C# as the front end, and python as back end.
I dont know if Exllama 2 supports Mac, but if it does, 70B.
Nothing, sadly.
Models are trained on the questions, to improve performance, making the tests moot.