Local LLM 실험: RTX 3080Ti 에서 bench mark 결과

RTX 3080TI 를 사용해서 LLM 모델 llama-bench 로 벤치마크 테스트를 수행했다.

  1. llama-3-korean-bllossom-8B
  2. llama-3.1-korean-reasoning-8B
  3. UNIVA-Deepseek-llama3.1-Bllossom-8B
  4. Deepseek-r1-distill-llama-8B
  5. DeepSeek-R1-Distill-Qwen-14B
  6. DeepSeek-R1-Distill-Qwen-32B

벤치마크 결과는 아래 테이블 같이 나온다.

model size params backend ngl test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 99 pp512 3730.08 ± 65.93
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 99 tg1000 91.75 ± 1.07

컬럼의 의미는 다음 같다.

  • Prompt processing (pp): processing a prompt in batches (-p)
  • Text generation (tg) :
  • n-gpu-layers (ngl) : GPU offload layers

llama-3-Korean-Bllossom-8B-Q4_K_M.gguf

8B 파라미터 크기를 가진 Llama3 fintuned 모델

  • MLP-KTLim/llama-3-Korean-Bllossom-8B-Q4_K_M.gguf

ngl 을 변경하며 벤치마킹,

1
llama-bench -m llama-3-Korean-Bllossom-8B-Q4_K_M.gguf -ngl 10,20,30,40,50 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 10 pp512 1303.36 ± 16.36
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 10 tg1000 10.85 ± 0.02
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 20 pp512 1719.75 ± 69.73
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 20 tg1000 16.87 ± 0.04
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 30 pp512 2906.49 ± 23.43
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 30 tg1000 39.91 ± 0.16
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 40 pp512 3483.66 ± 259.95
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 40 tg1000 89.85 ± 2.06
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 50 pp512 3419.22 ± 348.84
llama 8B Q4_K - Medium 4.58 GiB 8.03 B CUDA 50 tg1000 89.79 ± 0.37

정리:

  • RTX3080TI 에서 ngl=40 개 정도에서 꽤, 쓸만하게 반응한다. (시간적으로)

lemon-mint/LLaMa-3.1-Korean-Reasoning-8B-Instruct-Q8

llama3.1-8B 는 32 layers 를 가진 모델이다.

여기서는 lemon-mint/llama-3.1-korean-reasoning-8b-instruct-q8_0.gguf 모델을 사용ㅇ했다.

1
llama-bench -m Bllossom/lemon-mint/llama-3.1-korean-reasoning-8b-instruct-q8_0.gguf -ngl 25,30,35,40,45

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 25 pp512 1784.23 ± 93.34
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 25 tg1000 14.80 ± 0.06
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 30 pp512 2786.34 ± 31.32
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 30 tg1000 26.87 ± 0.30
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 35 pp512 3733.38 ± 187.10
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 35 tg1000 73.87 ± 3.13
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 40 pp512 3797.38 ± 166.76
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 40 tg1000 74.09 ± 3.33
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 45 pp512 3791.58 ± 82.35
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 45 tg1000 74.12 ± 3.20

정리

  • RTX3080TI 는 8B 모델들은 적당히 잘 돌아 간다.
  • bllossom 8B 와 비슷하게 ngl=40 가 적당하다.


UNIVA-Deepseek-llama3.1-Bllossom-8B

DeepSeek-Bllossom Series는 기존 DeepSeek-R1-Distill Series 모델의 language mixing, 다국어 성능 저하 문제를 해결하기 위해 추가로 학습된 모델입니다.

DeepSeek-llama3.1-Bllossom-8B는 DeepSeek-R1-distill-Llama-8B 모델을 베이스로 구축된 모델로, 한국어 환경에서의 추론 성능 향상을 목표로 개발되었습니다.

6Bit

1
llama-bench -m UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q6_K.gguf -ngl 20,23,25,27,30 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 20 pp512 1543.16 ± 24.32
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 20 tg1000 13.13 ± 0.11
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 23 pp512 1765.23 ± 58.73
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 23 tg1000 16.08 ± 0.07
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 25 pp512 2027.43 ± 43.47
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 25 tg1000 19.04 ± 0.30
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 27 pp512 2249.32 ± 57.11
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 27 tg1000 23.01 ± 0.82
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 30 pp512 3001.55 ± 29.89
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 30 tg1000 33.67 ± 0.20
1
(Deepseek_R1) qkboo:~$ llama-bench -m /mnt/e/LLM_Run/UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q6_K.gguf -ngl 30,33,35,37,40 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 30 pp512 3011.60 ± 50.04
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 30 tg1000 34.08 ± 1.11
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 33 pp512 3895.08 ± 25.09
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 33 tg1000 76.81 ± 4.94
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 35 pp512 3933.71 ± 32.81
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 35 tg1000 77.27 ± 6.96
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 37 pp512 3883.86 ± 20.62
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 37 tg1000 77.30 ± 4.44
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 40 pp512 3909.77 ± 14.13
^[[C llama 8B Q6_K 6.14 GiB 8.03 B CUDA 40 tg1000

8bit

1
$ llama-bench -m UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q8_0.gguf -ngl 17,23,27,30,33 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 17 pp512 1152.58 ± 20.30
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 17 tg1000 8.79 ± 0.06
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 23 pp512 1653.79 ± 44.44
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 23 tg1000 12.79 ± 0.08
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 27 pp512 2170.69 ± 66.22
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 27 tg1000 18.02 ± 0.10
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 30 pp512 2997.54 ± 36.25
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 30 tg1000 26.93 ± 0.28
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 33 pp512 4311.76 ± 17.63
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 33 tg1000 80.54 ± 2.72
1
$ llama-bench -m /mnt/e/LLM_Run/UNIVA-DeepSeek-llama3.1-Bllossom-8B-Q8_0.gguf -ngl 47,53,57,60,65 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 47 pp512 4252.55 ± 170.94
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 47 tg1000 79.03 ± 8.48
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 53 pp512 4341.45 ± 181.79
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 53 tg1000 80.21 ± 8.60
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 57 pp512 4470.11 ± 27.91
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 57 tg1000 80.12 ± 6.18
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 60 pp512 4542.52 ± 23.46
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 60 tg1000 80.92 ± 9.37
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 65 pp512 4502.80 ± 57.29
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 65 tg1000 81.02 ± 10.89


DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf

그 유명한 deepseek r1 으로 unsloth 의 distill 버전을 사용했다.

deepseek r1 은 61 layers 를 사용한다.

1
2
$ llama-bench -m DeepSeek-R1-Distill-Llama-8B-GGUF/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -ngl 10,20,30
,40,50 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 10 pp512 849.57 ± 12.77
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 10 tg1000 6.34 ± 0.06
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 20 pp512 1279.56 ± 22.85
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 20 tg1000 10.41 ± 0.08
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 30 pp512 2712.69 ± 96.48
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 30 tg1000 26.45 ± 0.42
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 40 pp512 3581.72 ± 261.82
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 40 tg1000 72.33 ± 1.53
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 50 pp512 3653.35 ± 292.75
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 50 tg1000 73.69 ± 2.39

정리

  • RTX3080TI 에서 ngl=40 에서 잘 반응한다.
  • 역시 8B 파라미터라서 앞의 llama 3- bllossom, llama-3.1 8B 모델과 비슷하다.



DeepSeek-R1-Distill-Llama-8B_korean_reasoning

https://huggingface.co/mradermacher/DeepSeek-R1-Distill-Llama-8B_korean_reasoning-GGUF

6bit

1
$ llama-bench -m DeepSeek_R1_Distill/Llama-8B/DeepSeek-R1-Distill-Llama-8B_korean_reasoning.Q6_K.gguf -ngl 17,25,30,35,40,45 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 17 pp512 1420.67 ± 56.23
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 17 tg1000 10.87 ± 0.45
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 25 pp512 2126.29 ± 80.18
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 25 tg1000 18.29 ± 0.83
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 30 pp512 3136.95 ± 97.13
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 30 tg1000 33.18 ± 1.54
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 37 pp512 3670.82 ± 41.77
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 37 tg1000 77.20 ± 1.17
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 40 pp512 3711.66 ± 33.40
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 40 tg1000 77.59 ± 1.12
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 42 pp512 3725.29 ± 18.83
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 42 tg1000 77.39 ± 1.52
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 45 pp512 3690.92 ± 26.38
llama 8B Q6_K 6.14 GiB 8.03 B CUDA 45 tg1000 77.49 ± 1.37

RTX3080ti 에서

  • pp 는 bllossom 버전보다 꽤 빠르다고 생각된다
  • tg 는 유사하다.
  • ngl은 40 정도가 적당할 듯.

8bit

l$ llama-bench -m DeepSeek_R1_Distill/Llama-8B/DeepSeek-R1-Distill-Llama-8B_korean_reasoning.Q8_0.gguf -ngl 25,29,35,39,42 -n 1000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 25 pp512 1811.10 ± 51.70
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 25 tg1000 14.25 ± 0.66
^C

RTX3080ti 에서




DeepSeek-R1-Distill-Qwen-14B

8bit Quantitization

1
$ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-14B-Q8_0.gguf -ngl 25,28,30,33,35 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
qwen2 14B Q8_0 14.62 GiB 14.77 B CUDA 25 pp512 649.52 ± 7.97
qwen2 14B Q8_0 14.62 GiB 14.77 B CUDA 25 tg1000 4.73 ± 0.03
qwen2 14B Q8_0 14.62 GiB 14.77 B CUDA 28 pp512 593.29 ± 188.35

6bit Quantitization

1
$ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-14B-Q6_K.gguf -ngl 15,18,20,25,30 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
qwen2 14B Q6_K 11.29 GiB 14.77 B CUDA 15 pp512 490.09 ± 191.57
qwen2 14B Q6_K 11.29 GiB 14.77 B CUDA 15 tg1000 4.52 ± 0.04
qwen2 14B Q6_K 11.29 GiB 14.77 B CUDA 18 pp512 629.45 ± 14.33
qwen2 14B Q6_K 11.29 GiB 14.77 B CUDA 18 tg1000 4.93 ± 0.03
qwen2 14B Q6_K 11.29 GiB 14.77 B CUDA 20 pp512 685.08 ± 14.48
qwen2 14B Q6_K 11.29 GiB 14.77 B CUDA 25 pp512 787.79 ± 18.55

$ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-14B-Q5_K_M.gguf -ngl 20,25
,30,35 -n 1000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
qwen2 14B Q5_K - Medium 9.78 GiB 14.77 B CUDA 20 pp512 735.40 ± 7.36
qwen2 14B Q5_K - Medium 9.78 GiB 14.77 B CUDA 20 tg1000 5.74 ± 0.12
qwen2 14B Q5_K - Medium 9.78 GiB 14.77 B CUDA 25 pp512 829.91 ± 7.98
qwen2 14B Q5_K - Medium 9.78 GiB 14.77 B CUDA 25 tg1000 6.77 ± 0.16
^C

DeepSeek-R1-Distill-Qwen-32B

320만개 파라미터를 가진 Deepssek 와 Qwen-32B 를 distill 한 버전이다.

  • unsloth/DeepSeek-R1-Distill-Qwen-32B-Q3
  • unsloth/DeepSeek-R1-Distill-Qwen-32B-Q2

DeepSeek-R1-Distill-Qwen-32B-Q3_K_M.gguf

1
$ llama-bench -m unsloth/DeepSeek-R1-Distill-Qwen-32B-Q3_K_M.gguf -ngl 27,30,33,35 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
qwen2 32B Q3_K - Medium 14.84 GiB 32.76 B CUDA 27 pp512 392.53 ± 2.94
qwen2 32B Q3_K - Medium 14.84 GiB 32.76 B CUDA 27 tg1000 3.76 ± 0.02
qwen2 32B Q3_K - Medium 14.84 GiB 32.76 B CUDA 30 pp512 411.41 ± 4.29
qwen2 32B Q3_K - Medium 14.84 GiB 32.76 B CUDA 30 tg1000 4.04 ± 0.02
qwen2 32B Q3_K - Medium 14.84 GiB 32.76 B CUDA 33 pp512 362.17 ± 93.15
qwen2 32B Q3_K - Medium 14.84 GiB 32.76 B CUDA 33 tg1000 4.11 ± 0.01
qwen2 32B Q3_K - Medium 14.84 GiB 32.76 B CUDA 35 pp512 427.65 ± 24.95
qwen2 32B Q3_K - Medium 14.84 GiB 32.76 B CUDA 35 tg1000 4.44 ± 0.05

정리

  • 8B 모델 대비 확실히 1/10로 반응시간이 느려졌다. 그래서 실행해서 프롬프트 테스트하기 버겁다.
  • ngl 당 걸린 시간이 너무 오래 걸린다. 측정을 못했지만 20분 이상 걸리는 것 같다.

DeepSeek-R1-Distill-Qwen-32B-Q2_K.gguf

1
$ llama-bench -m DeepSeek_R1_Distill/unsloth/DeepSeek-R1-Distill-Qwen-32B-Q2_K.gguf -ngl 25,28,30,33,35 -n 1000

Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
qwen2 32B Q2_K - Medium 11.46 GiB 32.76 B CUDA 25 pp512 360.50 ± 104.74
qwen2 32B Q2_K - Medium 11.46 GiB 32.76 B CUDA 25 tg1000 4.49 ± 0.07
qwen2 32B Q2_K - Medium 11.46 GiB 32.76 B CUDA 28 pp512 422.67 ± 6.60
qwen2 32B Q2_K - Medium 11.46 GiB 32.76 B CUDA 28 tg1000 4.83 ± 0.03
qwen2 32B Q2_K - Medium 11.46 GiB 32.76 B CUDA 30 pp512 466.25 ± 3.99

정리

  1. 320만개 파라미터는 RTX3080TI 에서 무리이다.
  2. 80만개 파라미터를 가진 모델은 돌릴만 하다.

RTX 3080Ti 에서 적절한 ngl 표.

model size params backend ngl test t/s
llama3-Korean-Bllossom-8B-Q4_K 4.58 GiB 8.03 B CUDA 40 pp512 3483.66 ± 259.95
llama3-Korean-Bllossom-8B-Q4_K 4.58 GiB 8.03 B CUDA 40 tg1000 89.85 ± 2.06
llama-3.1-korean-reasoning-8b-instruct-q8_0 7.95 GiB 8.03 B CUDA 35 pp512 3733.38 ± 187.10
llama-3.1-korean-reasoning-8b-instruct-q8_0 7.95 GiB 8.03 B CUDA 35 tg1000 73.87 ± 3.13

Local LLM 실험: RTX 3080Ti 에서 bench mark 결과

https://thinkbee.github.io/llama-model-bench-rtx3080ti-e1cdcf8d9cee/

Author

Gangtai Goh

Posted on

2025-02-02

Updated on

2025-03-06

Licensed under

댓글