|
楼主 |
发表于 2025-8-14 21:04
|
显示全部楼层
改CUDA有用
FP8 可以了
但FP4报错 vllm serve /home/Qwen3-235B-A22B-Thinking-2507-FP4 --served-model-name Qwen3-235B-A22B-Thinking-2507-FP4 --max-model-len 201000 --tensor-parallel-size 2 --gpu-memory-utilization 0.9
(APIServer pid=15764) INFO 08-14 20:58:03 [api_server.py:1805] vLLM API server version 0.10.1.dev628+g00e3f9da4.d20250814
(APIServer pid=15764) Value error, Unknown quantization method: . Must be one of ['aqlm', 'awq', 'deepspeedfp', 'tpu_int8', 'fp8', 'ptpc_fp8', 'fbgemm_fp8', 'modelopt', 'modelopt_fp4', 'marlin', 'bitblas', 'gguf', '**q_marlin_24', '**q_marlin', '**q_bitblas', 'awq_marlin', '**q', 'compressed-tensors', 'bitsandbytes', 'qqq', 'hqq', 'experts_int8', 'neuron_quant', 'ipex', 'quark', 'moe_wna16', 'torchao', 'auto-round', 'rtn', 'inc', 'mxfp4']. [type=value_error, input_value=ArgsKwargs((), {'model': ...attention_dtype': None}), input_type=ArgsKwargs]
(APIServer pid=15764) For further information visit https://errors.pydantic.dev/2.11/v/value_error
|
|