00:00:00

Share Your Feedback 🏝️

Model | Google - Gemma 2 (Gemma Scope)

Model | Google - Gemma 2 (Gemma Scope)

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Data Composition | AutoScale Next: MultiModal | Meta AI - Efficient Early Fusion

Model | Google - Gemma 2 (Gemma Scope)

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-07-31

Gemma Scope: helping the safety community shed light on the inner workings of language models

Gemma2 - 2B IT 모델이 일부 벤치에서 GPT 3.5 이상의 성능을 보였다고 하고, SOLAR-10.7B-IT 역시 추가학습(자세한 내용은 아직 공식 포스트나 페이퍼는 못 찾음)으로 타 30B 모델보다 벤치마크 점수가 좋아졌다고 합니다. (업스테이지 포스트) 리더보드랑 비교하면서 정성적으로 확인해봐야겠습니다.

최근 메타는 405B 모델을 위주로 학습하였다고 발표했는데, 2B 성능을 70B와 비교해보고 확인해봐야겠습니다.


Inference Code of Gemma2-2b-it

Using Single or Multi GPU

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))

Using vllm


!pip install vllm==0.5.3
!pip install flashinfer==0.0.8 -i https://flashinfer.ai/whl/cu121/torch2.3/

# Import necessary libraries
import os
import random
import torch
import vllm
from vllm import LLM, SamplingParams


print(f"vLLM version: {vllm.__version__}")  
print(f"PyTorch version: {torch.__version__}")  
print(f"CUDA version: {torch.version.cuda}") 

# Update backend variable for VLLM
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"

# Initialize and test vLLM model with sampling parameters https://huggingface.co/google/gemma-2-2b-it
llm = LLM(model="gemma-2-2b-it", trust_remote_code=True)

sampling_params = SamplingParams(
    temperature=0.8,
    max_tokens=512,
    top_p=0.95,
    top_k=1,
)

prompt = "Explain Large Language Model, LLaMA, architecture."


outputs = llm.generate(
    [prompt],
    sampling_params
)

Previous: Data Composition | AutoScale Next: MultiModal | Meta AI - Efficient Early Fusion

post contain ""

    No matching posts found containing ""