|
main: n_kv_max = 65536, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 14, n_threads_batch = 14
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 500 | 1500 | 1 | 2000 | 10.114 | 49.43 | 20.584 | 72.87 | 30.698 | 65.15 |
| 500 | 1500 | 2 | 4000 | 0.275 | 3632.59 | 23.171 | 129.47 | 23.447 | 170.60 |
| 500 | 1500 | 4 | 8000 | 0.565 | 3537.18 | 26.563 | 225.88 | 27.128 | 294.90 |
| 500 | 1500 | 8 | 16000 | 1.150 | 3478.43 | 32.223 | 372.40 | 33.373 | 479.43 |
| 500 | 1500 | 16 | 32000 | 2.397 | 3337.18 | 53.382 | 449.59 | 55.779 | 573.69 |
| 500 | 1500 | 32 | 64000 | 5.422 | 2950.76 | 73.411 | 653.85 | 78.834 | 811.84 |
llama_perf_context_print: load time = 38099.04 ms
llama_perf_context_print: prompt eval time = 263440.83 ms / 124516 tokens ( 2.12 ms per token, 472.65 tokens per second)
llama_perf_context_print: eval time = 20580.05 ms / 1500 runs ( 13.72 ms per token, 72.89 tokens per second)
llama_perf_context_print: total time = 287359.37 ms / 126016 tokens
main: n_kv_max = 98304, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 14, n_threads_batch = 14
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 500 | 2500 | 1 | 3000 | 0.170 | 2932.69 | 24.793 | 100.84 | 24.963 | 120.18 |
| 500 | 2500 | 2 | 6000 | 0.286 | 3494.29 | 38.868 | 128.64 | 39.154 | 153.24 |
| 500 | 2500 | 4 | 12000 | 0.567 | 3525.79 | 46.865 | 213.38 | 47.432 | 252.99 |
| 500 | 2500 | 8 | 24000 | 1.139 | 3510.81 | 58.651 | 341.00 | 59.791 | 401.40 |
| 500 | 2500 | 16 | 48000 | 2.415 | 3312.20 | 77.793 | 514.19 | 80.208 | 598.44 |
| 500 | 2500 | 32 | 96000 | 5.410 | 2957.52 | 145.522 | 549.74 | 150.932 | 636.05 |
llama_perf_context_print: load time = 2742.79 ms
llama_perf_context_print: prompt eval time = 377652.87 ms / 186516 tokens ( 2.02 ms per token, 493.88 tokens per second)
llama_perf_context_print: eval time = 24784.80 ms / 2500 runs ( 9.91 ms per token, 100.87 tokens per second)
llama_perf_context_print: total time = 405224.93 ms / 189016 tokens |
|