Here's my LLM benchmark for Eudaimonia. The general idea is to evaluate how different LLMs (especially dedicated ERP models) work with EstimAI. To what extent they are able to add multiple, well-formatted, JSON-structured estims to their output.
This list does NOT evaluate the model's intelligence or writing skills, but only its “compatibility” with eudaimonia.
Hopefully it might help you pick between the dozens of existing models.
The main problem I encountered when developing EstimAI was that models forgot or refused to add stims to their first responses, in which case the session was done.
That's why I decided to calculate the probability of a model generating stims in one of its first two responses.
I also decided to calculate two other quality indices: the number of stims per response and their validity (ability to follow the JSON format).
The first two inputs are sent to the model 20 times, always using the same scenario, inputs and parameters. This allows me to calculate the average number of stims generated per answer.
All tests were done with the same settings and in the same environment: on a runpod's A40 with KoboldCPP. Settings: min_p preset, T°1, DRY, 12k context
The scenario used is a standard character card from chub.ai with 1 stim example in the first messages. The total context size of the card + the system prompt and the first messages is ~2.7k tokens.
The system prompt used is the standard one (emulating interactions happening in the scene).
Name | Template | Quant | First rate | Combined rate | Valid rate | Avg Stims |
---|