Local LLM 실험: RTX 3080Ti 에서 bench mark 결과
RTX 3080TI 를 사용해서 LLM 모델 llama-bench 로 벤치마크 테스트를 수행했다.
- llama-3-korean-bllossom-8B
 - llama-3.1-korean-reasoning-8B
 - UNIVA-Deepseek-llama3.1-Bllossom-8B
 - Deepseek-r1-distill-llama-8B
 - DeepSeek-R1-Distill-Qwen-14B
 - DeepSeek-R1-Distill-Qwen-32B
 
벤치마크 결과는 아래 테이블 같이 나온다.
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | pp512 | 3730.08 ± 65.93 | 
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg1000 | 91.75 ± 1.07 | 
컬럼의 의미는 다음 같다.
- Prompt processing (pp): processing a prompt in batches (-p)
 - Text generation (tg) :
 - n-gpu-layers (ngl) : GPU offload layers
 
llama-3-Korean-Bllossom-8B-Q4_K_M.gguf
8B 파라미터 크기를 가진 Llama3 fintuned 모델
- MLP-KTLim/llama-3-Korean-Bllossom-8B-Q4_K_M.gguf
 
ngl 을 변경하며 벤치마킹,
1  | llama-bench -m llama-3-Korean-Bllossom-8B-Q4_K_M.gguf -ngl 10,20,30,40,50 -n 1000  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 10 | pp512 | 1303.36 ± 16.36 | 
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 10 | tg1000 | 10.85 ± 0.02 | 
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 20 | pp512 | 1719.75 ± 69.73 | 
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 20 | tg1000 | 16.87 ± 0.04 | 
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 30 | pp512 | 2906.49 ± 23.43 | 
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 30 | tg1000 | 39.91 ± 0.16 | 
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 40 | pp512 | 3483.66 ± 259.95 | 
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 40 | tg1000 | 89.85 ± 2.06 | 
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 50 | pp512 | 3419.22 ± 348.84 | 
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 50 | tg1000 | 89.79 ± 0.37 | 
정리:
- RTX3080TI 에서 ngl=40 개 정도에서 꽤, 쓸만하게 반응한다. (시간적으로)
 
lemon-mint/LLaMa-3.1-Korean-Reasoning-8B-Instruct-Q8
llama3.1-8B 는 32 layers 를 가진 모델이다.
- https://huggingface.co/lemon-mint/LLaMa-3.1-Korean-Reasoning-8B-Instruct
 - https://huggingface.co/lemon-mint/LLaMa-3.1-Korean-Reasoning-8B-Instruct-Q8_0-GGUF
 
여기서는 lemon-mint/llama-3.1-korean-reasoning-8b-instruct-q8_0.gguf 모델을 사용ㅇ했다.
1  | llama-bench -m Bllossom/lemon-mint/llama-3.1-korean-reasoning-8b-instruct-q8_0.gguf -ngl 25,30,35,40,45  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 25 | pp512 | 1784.23 ± 93.34 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 25 | tg1000 | 14.80 ± 0.06 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | pp512 | 2786.34 ± 31.32 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | tg1000 | 26.87 ± 0.30 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 35 | pp512 | 3733.38 ± 187.10 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 35 | tg1000 | 73.87 ± 3.13 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 40 | pp512 | 3797.38 ± 166.76 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 40 | tg1000 | 74.09 ± 3.33 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 45 | pp512 | 3791.58 ± 82.35 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 45 | tg1000 | 74.12 ± 3.20 | 
정리
- RTX3080TI 는 8B 모델들은 적당히 잘 돌아 간다.
 - bllossom 8B 와 비슷하게 ngl=40 가 적당하다.
 
UNIVA-Deepseek-llama3.1-Bllossom-8B
DeepSeek-Bllossom Series는 기존 DeepSeek-R1-Distill Series 모델의 language mixing, 다국어 성능 저하 문제를 해결하기 위해 추가로 학습된 모델입니다.
DeepSeek-llama3.1-Bllossom-8B는 DeepSeek-R1-distill-Llama-8B 모델을 베이스로 구축된 모델로, 한국어 환경에서의 추론 성능 향상을 목표로 개발되었습니다.
6Bit
1  | llama-bench -m UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q6_K.gguf -ngl 20,23,25,27,30 -n 1000  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 20 | pp512 | 1543.16 ± 24.32 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 20 | tg1000 | 13.13 ± 0.11 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 23 | pp512 | 1765.23 ± 58.73 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 23 | tg1000 | 16.08 ± 0.07 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 25 | pp512 | 2027.43 ± 43.47 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 25 | tg1000 | 19.04 ± 0.30 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 27 | pp512 | 2249.32 ± 57.11 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 27 | tg1000 | 23.01 ± 0.82 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | pp512 | 3001.55 ± 29.89 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | tg1000 | 33.67 ± 0.20 | 
1  | (Deepseek_R1) qkboo:~$ llama-bench -m /mnt/e/LLM_Run/UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q6_K.gguf -ngl 30,33,35,37,40 -n 1000  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | pp512 | 3011.60 ± 50.04 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | tg1000 | 34.08 ± 1.11 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 33 | pp512 | 3895.08 ± 25.09 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 33 | tg1000 | 76.81 ± 4.94 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 35 | pp512 | 3933.71 ± 32.81 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 35 | tg1000 | 77.27 ± 6.96 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 37 | pp512 | 3883.86 ± 20.62 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 37 | tg1000 | 77.30 ± 4.44 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 40 | pp512 | 3909.77 ± 14.13 | 
| ^[[C | llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 40 | tg1000 | 
8bit
1  | $ llama-bench -m UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q8_0.gguf -ngl 17,23,27,30,33 -n 1000  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 17 | pp512 | 1152.58 ± 20.30 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 17 | tg1000 | 8.79 ± 0.06 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 23 | pp512 | 1653.79 ± 44.44 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 23 | tg1000 | 12.79 ± 0.08 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 27 | pp512 | 2170.69 ± 66.22 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 27 | tg1000 | 18.02 ± 0.10 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | pp512 | 2997.54 ± 36.25 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | tg1000 | 26.93 ± 0.28 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 33 | pp512 | 4311.76 ± 17.63 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 33 | tg1000 | 80.54 ± 2.72 | 
1  | $ llama-bench -m /mnt/e/LLM_Run/UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q8_0.gguf -ngl 47,53,57,60,65 -n 1000  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 47 | pp512 | 4252.55 ± 170.94 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 47 | tg1000 | 79.03 ± 8.48 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 53 | pp512 | 4341.45 ± 181.79 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 53 | tg1000 | 80.21 ± 8.60 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 57 | pp512 | 4470.11 ± 27.91 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 57 | tg1000 | 80.12 ± 6.18 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 60 | pp512 | 4542.52 ± 23.46 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 60 | tg1000 | 80.92 ± 9.37 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 65 | pp512 | 4502.80 ± 57.29 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 65 | tg1000 | 81.02 ± 10.89 | 
DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf
그 유명한 deepseek r1 으로 unsloth 의 distill 버전을 사용했다.
- unsloth.ai/blog/deepseek-r1
 - https://unsloth.ai/blog/deepseekr1-dynamic
 - https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF
 
deepseek r1 은 61 layers 를 사용한다.
1  | $ llama-bench -m DeepSeek-R1-Distill-Llama-8B-GGUF/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -ngl 10,20,30  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 10 | pp512 | 849.57 ± 12.77 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 10 | tg1000 | 6.34 ± 0.06 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 20 | pp512 | 1279.56 ± 22.85 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 20 | tg1000 | 10.41 ± 0.08 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | pp512 | 2712.69 ± 96.48 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 30 | tg1000 | 26.45 ± 0.42 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 40 | pp512 | 3581.72 ± 261.82 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 40 | tg1000 | 72.33 ± 1.53 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 50 | pp512 | 3653.35 ± 292.75 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 50 | tg1000 | 73.69 ± 2.39 | 
정리
- RTX3080TI 에서 ngl=40 에서 잘 반응한다.
 - 역시 8B 파라미터라서 앞의 llama 3- bllossom, llama-3.1 8B 모델과 비슷하다.
 
DeepSeek-R1-Distill-Llama-8B_korean_reasoning
https://huggingface.co/mradermacher/DeepSeek-R1-Distill-Llama-8B_korean_reasoning-GGUF
6bit
1  | $ llama-bench -m DeepSeek_R1_Distill/Llama-8B/DeepSeek-R1-Distill-Llama-8B_korean_reasoning.Q6_K.gguf -ngl 17,25,30,35,40,45 -n 1000  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 17 | pp512 | 1420.67 ± 56.23 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 17 | tg1000 | 10.87 ± 0.45 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 25 | pp512 | 2126.29 ± 80.18 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 25 | tg1000 | 18.29 ± 0.83 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | pp512 | 3136.95 ± 97.13 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 30 | tg1000 | 33.18 ± 1.54 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 37 | pp512 | 3670.82 ± 41.77 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 37 | tg1000 | 77.20 ± 1.17 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 40 | pp512 | 3711.66 ± 33.40 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 40 | tg1000 | 77.59 ± 1.12 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 42 | pp512 | 3725.29 ± 18.83 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 42 | tg1000 | 77.39 ± 1.52 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 45 | pp512 | 3690.92 ± 26.38 | 
| llama 8B Q6_K | 6.14 GiB | 8.03 B | CUDA | 45 | tg1000 | 77.49 ± 1.37 | 
RTX3080ti 에서
- pp 는 bllossom 버전보다 꽤 빠르다고 생각된다
 - tg 는 유사하다.
 - ngl은 40 정도가 적당할 듯.
 
8bit
l$ llama-bench -m DeepSeek_R1_Distill/Llama-8B/DeepSeek-R1-Distill-Llama-8B_korean_reasoning.Q8_0.gguf -ngl 25,29,35,39,42 -n 1000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 25 | pp512 | 1811.10 ± 51.70 | 
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 25 | tg1000 | 14.25 ± 0.66 | 
| ^C | 
RTX3080ti 에서
DeepSeek-R1-Distill-Qwen-14B
8bit Quantitization
1  | $ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-14B-Q8_0.gguf -ngl 25,28,30,33,35 -n 1000  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| qwen2 14B Q8_0 | 14.62 GiB | 14.77 B | CUDA | 25 | pp512 | 649.52 ± 7.97 | 
| qwen2 14B Q8_0 | 14.62 GiB | 14.77 B | CUDA | 25 | tg1000 | 4.73 ± 0.03 | 
| qwen2 14B Q8_0 | 14.62 GiB | 14.77 B | CUDA | 28 | pp512 | 593.29 ± 188.35 | 
6bit Quantitization
1  | $ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-14B-Q6_K.gguf -ngl 15,18,20,25,30 -n 1000  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 15 | pp512 | 490.09 ± 191.57 | 
| qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 15 | tg1000 | 4.52 ± 0.04 | 
| qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 18 | pp512 | 629.45 ± 14.33 | 
| qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 18 | tg1000 | 4.93 ± 0.03 | 
| qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 20 | pp512 | 685.08 ± 14.48 | 
| qwen2 14B Q6_K | 11.29 GiB | 14.77 B | CUDA | 25 | pp512 | 787.79 ± 18.55 | 
$ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-14B-Q5_K_M.gguf -ngl 20,25
,30,35 -n 1000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | 20 | pp512 | 735.40 ± 7.36 | 
| qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | 20 | tg1000 | 5.74 ± 0.12 | 
| qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | 25 | pp512 | 829.91 ± 7.98 | 
| qwen2 14B Q5_K - Medium | 9.78 GiB | 14.77 B | CUDA | 25 | tg1000 | 6.77 ± 0.16 | 
| ^C | 
DeepSeek-R1-Distill-Qwen-32B
320만개 파라미터를 가진 Deepssek 와 Qwen-32B 를 distill 한 버전이다.
- unsloth/DeepSeek-R1-Distill-Qwen-32B-Q3
 - unsloth/DeepSeek-R1-Distill-Qwen-32B-Q2
 
DeepSeek-R1-Distill-Qwen-32B-Q3_K_M.gguf
1  | $ llama-bench -m unsloth/DeepSeek-R1-Distill-Qwen-32B-Q3_K_M.gguf -ngl 27,30,33,35 -n 1000  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 27 | pp512 | 392.53 ± 2.94 | 
| qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 27 | tg1000 | 3.76 ± 0.02 | 
| qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 30 | pp512 | 411.41 ± 4.29 | 
| qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 30 | tg1000 | 4.04 ± 0.02 | 
| qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 33 | pp512 | 362.17 ± 93.15 | 
| qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 33 | tg1000 | 4.11 ± 0.01 | 
| qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 35 | pp512 | 427.65 ± 24.95 | 
| qwen2 32B Q3_K - Medium | 14.84 GiB | 32.76 B | CUDA | 35 | tg1000 | 4.44 ± 0.05 | 
정리
- 8B 모델 대비 확실히 1/10로 반응시간이 느려졌다. 그래서 실행해서 프롬프트 테스트하기 버겁다.
 - ngl 당 걸린 시간이 너무 오래 걸린다. 측정을 못했지만 20분 이상 걸리는 것 같다.
 
DeepSeek-R1-Distill-Qwen-32B-Q2_K.gguf
1  | $ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-32B-Q2_K.gguf -ngl 25,28,30,33,35 -n 1000  | 
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| qwen2 32B Q2_K - Medium | 11.46 GiB | 32.76 B | CUDA | 25 | pp512 | 360.50 ± 104.74 | 
| qwen2 32B Q2_K - Medium | 11.46 GiB | 32.76 B | CUDA | 25 | tg1000 | 4.49 ± 0.07 | 
| qwen2 32B Q2_K - Medium | 11.46 GiB | 32.76 B | CUDA | 28 | pp512 | 422.67 ± 6.60 | 
| qwen2 32B Q2_K - Medium | 11.46 GiB | 32.76 B | CUDA | 28 | tg1000 | 4.83 ± 0.03 | 
| qwen2 32B Q2_K - Medium | 11.46 GiB | 32.76 B | CUDA | 30 | pp512 | 466.25 ± 3.99 | 
정리
- 320만개 파라미터는 RTX3080TI 에서 무리이다.
 - 80만개 파라미터를 가진 모델은 돌릴만 하다.
 
RTX 3080Ti 에서 적절한 ngl 표.
| model | size | params | backend | ngl | test | t/s | 
|---|---|---|---|---|---|---|
| llama3-Korean-Bllossom-8B-Q4_K | 4.58 GiB | 8.03 B | CUDA | 40 | pp512 | 3483.66 ± 259.95 | 
| llama3-Korean-Bllossom-8B-Q4_K | 4.58 GiB | 8.03 B | CUDA | 40 | tg1000 | 89.85 ± 2.06 | 
| llama-3.1-korean-reasoning-8b-instruct-q8_0 | 7.95 GiB | 8.03 B | CUDA | 35 | pp512 | 3733.38 ± 187.10 | 
| llama-3.1-korean-reasoning-8b-instruct-q8_0 | 7.95 GiB | 8.03 B | CUDA | 35 | tg1000 | 73.87 ± 3.13 | 
Local LLM 실험: RTX 3080Ti 에서 bench mark 결과
https://thinkbee.github.io/llama-model-bench-rtx3080ti-e1cdcf8d9cee/