Local LLM 실험: RTX 3080Ti 에서 bench mark 결과
RTX 3080TI 를 사용해서 LLM 모델 llama-bench 로 벤치마크 테스트를 수행했다.
- llama-3-korean-bllossom-8B
- llama-3.1-korean-reasoning-8B
- UNIVA-Deepseek-llama3.1-Bllossom-8B
- Deepseek-r1-distill-llama-8B
- DeepSeek-R1-Distill-Qwen-14B
- DeepSeek-R1-Distill-Qwen-32B
벤치마크 결과는 아래 테이블 같이 나온다.
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | pp512 | 3730.08 ± 65.93 |
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg1000 | 91.75 ± 1.07 |
컬럼의 의미는 다음 같다.
- Prompt processing (pp): processing a prompt in batches (-p)
- Text generation (tg) :
- n-gpu-layers (ngl) : GPU offload layers
llama-3-Korean-Bllossom-8B-Q4_K_M.gguf
8B 파라미터 크기를 가진 Llama3 fintuned 모델
- MLP-KTLim/llama-3-Korean-Bllossom-8B-Q4_K_M.gguf
ngl 을 변경하며 벤치마킹,
1 | llama-bench -m llama-3-Korean-Bllossom-8B-Q4_K_M.gguf -ngl 10,20,30,40,50 -n 1000 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 10 | pp512 | 1303.36 ± 16.36 |
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 10 | tg1000 | 10.85 ± 0.02 |
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 20 | pp512 | 1719.75 ± 69.73 |
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 20 | tg1000 | 16.87 ± 0.04 |
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 30 | pp512 | 2906.49 ± 23.43 |
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 30 | tg1000 | 39.91 ± 0.16 |
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 40 | pp512 | 3483.66 ± 259.95 |
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 40 | tg1000 | 89.85 ± 2.06 |
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 50 | pp512 | 3419.22 ± 348.84 |
llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 50 | tg1000 | 89.79 ± 0.37 |
정리:
- RTX3080TI 에서 ngl=40 개 정도에서 꽤, 쓸만하게 반응한다. (시간적으로)
lemon-mint/LLaMa-3.1-Korean-Reasoning-8B-Instruct-Q8
llama3.1-8B 는 32 layers 를 가진 모델이다.
- https://huggingface.co/lemon-mint/LLaMa-3.1-Korean-Reasoning-8B-Instruct
- https://huggingface.co/lemon-mint/LLaMa-3.1-Korean-Reasoning-8B-Instruct-Q8_0-GGUF
여기서는 lemon-mint/llama-3.1-korean-reasoning-8b-instruct-q8_0.gguf 모델을 사용ㅇ했다.
1 | llama-bench -m Bllossom/lemon-mint/llama-3.1-korean-reasoning-8b-instruct-q8_0.gguf -ngl 25,30,35,40,45 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 25 | pp512 | 1784.23 ± 93.34 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 25 | tg1000 | 14.80 ± 0.06 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | pp512 | 2786.34 ± 31.32 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | tg1000 | 26.87 ± 0.30 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 35 | pp512 | 3733.38 ± 187.10 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 35 | tg1000 | 73.87 ± 3.13 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 40 | pp512 | 3797.38 ± 166.76 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 40 | tg1000 | 74.09 ± 3.33 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 45 | pp512 | 3791.58 ± 82.35 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 45 | tg1000 | 74.12 ± 3.20 |
정리
- RTX3080TI 는 8B 모델들은 적당히 잘 돌아 간다.
- bllossom 8B 와 비슷하게 ngl=40 가 적당하다.
UNIVA-Deepseek-llama3.1-Bllossom-8B
DeepSeek-Bllossom Series는 기존 DeepSeek-R1-Distill Series 모델의 language mixing, 다국어 성능 저하 문제를 해결하기 위해 추가로 학습된 모델입니다.
DeepSeek-llama3.1-Bllossom-8B는 DeepSeek-R1-distill-Llama-8B 모델을 베이스로 구축된 모델로, 한국어 환경에서의 추론 성능 향상을 목표로 개발되었습니다.
6Bit
1 | llama-bench -m UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q6_K.gguf -ngl 20,23,25,27,30 -n 1000 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 20 | pp512 | 1543.16 ± 24.32 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 20 | tg1000 | 13.13 ± 0.11 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 23 | pp512 | 1765.23 ± 58.73 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 23 | tg1000 | 16.08 ± 0.07 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 25 | pp512 | 2027.43 ± 43.47 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 25 | tg1000 | 19.04 ± 0.30 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 27 | pp512 | 2249.32 ± 57.11 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 27 | tg1000 | 23.01 ± 0.82 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | pp512 | 3001.55 ± 29.89 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | tg1000 | 33.67 ± 0.20 |
1 | (Deepseek_R1) qkboo:~$ llama-bench -m /mnt/e/LLM_Run/UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q6_K.gguf -ngl 30,33,35,37,40 -n 1000 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | pp512 | 3011.60 ± 50.04 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | tg1000 | 34.08 ± 1.11 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 33 | pp512 | 3895.08 ± 25.09 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 33 | tg1000 | 76.81 ± 4.94 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 35 | pp512 | 3933.71 ± 32.81 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 35 | tg1000 | 77.27 ± 6.96 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 37 | pp512 | 3883.86 ± 20.62 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 37 | tg1000 | 77.30 ± 4.44 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 40 | pp512 | 3909.77 ± 14.13 |
^[[C | llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 40 | tg1000 |
8bit
1 | $ llama-bench -m UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q8_0.gguf -ngl 17,23,27,30,33 -n 1000 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 17 | pp512 | 1152.58 ± 20.30 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 17 | tg1000 | 8.79 ± 0.06 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 23 | pp512 | 1653.79 ± 44.44 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 23 | tg1000 | 12.79 ± 0.08 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 27 | pp512 | 2170.69 ± 66.22 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 27 | tg1000 | 18.02 ± 0.10 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | pp512 | 2997.54 ± 36.25 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | tg1000 | 26.93 ± 0.28 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 33 | pp512 | 4311.76 ± 17.63 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 33 | tg1000 | 80.54 ± 2.72 |
1 | $ llama-bench -m /mnt/e/LLM_Run/UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q8_0.gguf -ngl 47,53,57,60,65 -n 1000 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 47 | pp512 | 4252.55 ± 170.94 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 47 | tg1000 | 79.03 ± 8.48 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 53 | pp512 | 4341.45 ± 181.79 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 53 | tg1000 | 80.21 ± 8.60 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 57 | pp512 | 4470.11 ± 27.91 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 57 | tg1000 | 80.12 ± 6.18 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 60 | pp512 | 4542.52 ± 23.46 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 60 | tg1000 | 80.92 ± 9.37 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 65 | pp512 | 4502.80 ± 57.29 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 65 | tg1000 | 81.02 ± 10.89 |
DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf
그 유명한 deepseek r1 으로 unsloth 의 distill 버전을 사용했다.
- unsloth.ai/blog/deepseek-r1
- https://unsloth.ai/blog/deepseekr1-dynamic
- https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF
deepseek r1 은 61 layers 를 사용한다.
1 | $ llama-bench -m DeepSeek-R1-Distill-Llama-8B-GGUF/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -ngl 10,20,30 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 10 | pp512 | 849.57 ± 12.77 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 10 | tg1000 | 6.34 ± 0.06 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 20 | pp512 | 1279.56 ± 22.85 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 20 | tg1000 | 10.41 ± 0.08 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | pp512 | 2712.69 ± 96.48 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | tg1000 | 26.45 ± 0.42 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 40 | pp512 | 3581.72 ± 261.82 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 40 | tg1000 | 72.33 ± 1.53 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 50 | pp512 | 3653.35 ± 292.75 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 50 | tg1000 | 73.69 ± 2.39 |
정리
- RTX3080TI 에서 ngl=40 에서 잘 반응한다.
- 역시 8B 파라미터라서 앞의 llama 3- bllossom, llama-3.1 8B 모델과 비슷하다.
DeepSeek-R1-Distill-Llama-8B_korean_reasoning
https://huggingface.co/mradermacher/DeepSeek-R1-Distill-Llama-8B_korean_reasoning-GGUF
6bit
1 | $ llama-bench -m DeepSeek_R1_Distill/Llama-8B/DeepSeek-R1-Distill-Llama-8B_korean_reasoning.Q6_K.gguf -ngl 17,25,30,35,40,45 -n 1000 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 17 | pp512 | 1420.67 ± 56.23 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 17 | tg1000 | 10.87 ± 0.45 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 25 | pp512 | 2126.29 ± 80.18 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 25 | tg1000 | 18.29 ± 0.83 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | pp512 | 3136.95 ± 97.13 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | tg1000 | 33.18 ± 1.54 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 37 | pp512 | 3670.82 ± 41.77 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 37 | tg1000 | 77.20 ± 1.17 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 40 | pp512 | 3711.66 ± 33.40 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 40 | tg1000 | 77.59 ± 1.12 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 42 | pp512 | 3725.29 ± 18.83 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 42 | tg1000 | 77.39 ± 1.52 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 45 | pp512 | 3690.92 ± 26.38 |
llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 45 | tg1000 | 77.49 ± 1.37 |
RTX3080ti 에서
- pp 는 bllossom 버전보다 꽤 빠르다고 생각된다
- tg 는 유사하다.
- ngl은 40 정도가 적당할 듯.
8bit
l$ llama-bench -m DeepSeek_R1_Distill/Llama-8B/DeepSeek-R1-Distill-Llama-8B_korean_reasoning.Q8_0.gguf -ngl 25,29,35,39,42 -n 1000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 25 | pp512 | 1811.10 ± 51.70 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 25 | tg1000 | 14.25 ± 0.66 |
^C |
RTX3080ti 에서
DeepSeek-R1-Distill-Qwen-14B
8bit Quantitization
1 | $ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-14B-Q8_0.gguf -ngl 25,28,30,33,35 -n 1000 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
qwen2 14B Q8_0 | 14.62 GiB | 14.77 B | CUDA | 25 | pp512 | 649.52 ± 7.97 |
qwen2 14B Q8_0 | 14.62 GiB | 14.77 B | CUDA | 25 | tg1000 | 4.73 ± 0.03 |
qwen2 14B Q8_0 | 14.62 GiB | 14.77 B | CUDA | 28 | pp512 | 593.29 ± 188.35 |
6bit Quantitization
1 | $ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-14B-Q6_K.gguf -ngl 15,18,20,25,30 -n 1000 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 15 | pp512 | 490.09 ± 191.57 |
qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 15 | tg1000 | 4.52 ± 0.04 |
qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 18 | pp512 | 629.45 ± 14.33 |
qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 18 | tg1000 | 4.93 ± 0.03 |
qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 20 | pp512 | 685.08 ± 14.48 |
qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 25 | pp512 | 787.79 ± 18.55 |
$ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-14B-Q5_K_M.gguf -ngl 20,25
,30,35 -n 1000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | 20 | pp512 | 735.40 ± 7.36 |
qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | 20 | tg1000 | 5.74 ± 0.12 |
qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | 25 | pp512 | 829.91 ± 7.98 |
qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | 25 | tg1000 | 6.77 ± 0.16 |
^C |
DeepSeek-R1-Distill-Qwen-32B
320만개 파라미터를 가진 Deepssek 와 Qwen-32B 를 distill 한 버전이다.
- unsloth/DeepSeek-R1-Distill-Qwen-32B-Q3
- unsloth/DeepSeek-R1-Distill-Qwen-32B-Q2
DeepSeek-R1-Distill-Qwen-32B-Q3_K_M.gguf
1 | $ llama-bench -m unsloth/DeepSeek-R1-Distill-Qwen-32B-Q3_K_M.gguf -ngl 27,30,33,35 -n 1000 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 27 | pp512 | 392.53 ± 2.94 |
qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 27 | tg1000 | 3.76 ± 0.02 |
qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 30 | pp512 | 411.41 ± 4.29 |
qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 30 | tg1000 | 4.04 ± 0.02 |
qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 33 | pp512 | 362.17 ± 93.15 |
qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 33 | tg1000 | 4.11 ± 0.01 |
qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 35 | pp512 | 427.65 ± 24.95 |
qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 35 | tg1000 | 4.44 ± 0.05 |
정리
- 8B 모델 대비 확실히 1/10로 반응시간이 느려졌다. 그래서 실행해서 프롬프트 테스트하기 버겁다.
- ngl 당 걸린 시간이 너무 오래 걸린다. 측정을 못했지만 20분 이상 걸리는 것 같다.
DeepSeek-R1-Distill-Qwen-32B-Q2_K.gguf
1 | $ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-32B-Q2_K.gguf -ngl 25,28,30,33,35 -n 1000 |
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
qwen2 32B Q2_K - Medium | 11.46 GiB | 32.76 B | CUDA | 25 | pp512 | 360.50 ± 104.74 |
qwen2 32B Q2_K - Medium | 11.46 GiB | 32.76 B | CUDA | 25 | tg1000 | 4.49 ± 0.07 |
qwen2 32B Q2_K - Medium | 11.46 GiB | 32.76 B | CUDA | 28 | pp512 | 422.67 ± 6.60 |
qwen2 32B Q2_K - Medium | 11.46 GiB | 32.76 B | CUDA | 28 | tg1000 | 4.83 ± 0.03 |
qwen2 32B Q2_K - Medium | 11.46 GiB | 32.76 B | CUDA | 30 | pp512 | 466.25 ± 3.99 |
정리
- 320만개 파라미터는 RTX3080TI 에서 무리이다.
- 80만개 파라미터를 가진 모델은 돌릴만 하다.
RTX 3080Ti 에서 적절한 ngl 표.
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama3-Korean-Bllossom-8B-Q4_K | 4.58 GiB | 8.03 B | CUDA | 40 | pp512 | 3483.66 ± 259.95 |
llama3-Korean-Bllossom-8B-Q4_K | 4.58 GiB | 8.03 B | CUDA | 40 | tg1000 | 89.85 ± 2.06 |
llama-3.1-korean-reasoning-8b-instruct-q8_0 | 7.95 GiB | 8.03 B | CUDA | 35 | pp512 | 3733.38 ± 187.10 |
llama-3.1-korean-reasoning-8b-instruct-q8_0 | 7.95 GiB | 8.03 B | CUDA | 35 | tg1000 | 73.87 ± 3.13 |
Local LLM 실험: RTX 3080Ti 에서 bench mark 결과
https://thinkbee.github.io/llama-model-bench-rtx3080ti-e1cdcf8d9cee/