00:00:00

Share Your Feedback 🏝️

Model | Replit-3b

Model | Replit-3b

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Flash Attention Next: Web Images with LLaMA-3

Model | Replit-3b

  • Related Project: private
  • Category: Paper Review
  • Date: 2023-07-06

Contents


Replit-3b and FlashAttention

Model Description

replit-code-v1-3b is a 2.7B Causal Language Model focused on Code Completion. The model has been trained on a subset of the Stack Dedup v1.2 dataset.

The training mixture includes 20 different languages, listed here in descending order of number of tokens:
Markdown, Java, JavaScript, Python, TypeScript, PHP, SQL, JSX, reStructuredText, Rust, C, CSS, Go, C++, HTML, Vue, Ruby, Jupyter Notebook, R, Shell
In total, the training dataset contains 175B tokens, which were repeated over 3 epochs – in total, replit-code-v1-3b has been trained on 525B tokens (~195 tokens per parameter).

The model has been trained on the MosaicML platform with 256 x A100-40GB GPUs, leveraging their latest LLM examples repo.
replit-code-v1-3b is powered by state-of-the-art LLM techniques, such as: Flash Attention for fast training and inference, AliBi positional embeddings to support variable context length at inference time, LionW optimizer, etc.

How to Use

First of all, you need to install the latest versions of the following dependencies:

einops
sentencepiece
torch
transformers

You can then load the model as follows:

from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

To use the optimized Triton implementation of FlashAttention on GPUs with BF16 precision, first install the following dependencies:

flash-attn==0.2.8
triton==2.0.0.dev20221202

Then, move the model to bfloat16 and use it as follows:

from transformers import AutoModelForCausalLM, AutoConfig

config = AutoConfig.from_pretrained(
    "replit/replit-code-v1-3b",
    trust_remote_code=True
)
config.attn_config['attn_impl'] = 'triton'

# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', config=config, trust_remote_code=True)
model.to(device='cuda:0', dtype=torch.bfloat16)

# forward pass
x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
x = x.to(device='cuda:0')
y = model(x)

Note that trust_remote_code=True is passed to the from_pretrained method because ReplitLM is not a class in the Transformers library.

Tokenizer

We have trained a custom SentencePiece Unigram tokenizer optimized with a vocabulary specifically for code of 32768 tokens.

Note that using this requires the sentencepiece library to be installed.

The tokenizer can be used as follows:

from transformers import AutoTokenizer

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

# single input encoding + generation
x = tokenizer.encode('def hello():\n  print("hello world")\n', return_tensors='pt')
y = model.generate(x)

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)
  • trust_remote_code=True is passed to the from_pretrained method because ReplitLM is not a class in the Transformers library.
  • clean_up_tokenization_spaces=False is meant to avoid removing spaces in the output, because that would affect the syntactical correctness of the generated code.

Generation

You can generate code using the transformers library as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)

Experiment with different decoding methods and parameters to get the best results for your use case.

Loading with 8-bit and 4-bit quantization

Loading in 8-bit

You can also load the model in 8-bit with the load_in_8bit=True kwarg that uses bitsandbytes under the hood.

First you need to install the following additional dependanices:

accelerate
bitsandbytes

Then you can load the model in 8bit as follows:

model = AutoModelForCausalLM.from_pretrained("replit/replit-code-v1-3b", 
                                             trust_remote_code=True, 
                                             device_map="auto",
                                             load_in_8bit=True)

The additional kwargs that make this possible are device_map='auto' and load_in_8bit=True.

Loading in 4-bit

For loading in 4-bit, at the time of writing, support for load_in_4bit has not been merged into the latest releases for transformers and accelerate. However you can use it if you install the dependancies the main branches of the published repos:

pip install git+https://github.com/huggingface/accelerate.git
pip install git+https://github.com/huggingface/transformers.git

Then load in 4-bit with:

model = AutoModelForCausalLM.from_pretrained("replit/replit-code-v1-3b", 
                                             trust_remote_code=True, 
                                             device_map="auto",
                                             load_in_4bit=True)

Post Processing

Note that as with all code generation models, post-processing of the generated code is important. In particular, the following post-processing steps are recommended:

  • stop generation when the EOS token is encountered
  • remove trailing whitespaces
  • set max_tokens to a reasonable value based on your completion use case
  • truncate generation to stop words such as return, def, “```”, “\n\n\n” to avoid generating incomplete code when max_tokens is larger than the length of the expected generated code.

Source


FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

  • url: https://arxiv.org/abs/2205.14135
  • pdf: https://arxiv.org/pdf/2205.14135
  • abstract: Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware – accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

  • url: https://arxiv.org/abs/2108.12409
  • pdf: https://arxiv.org/pdf/2108.12409
  • abstract: Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi’s inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.

Previous: Flash Attention Next: Web Images with LLaMA-3

post contain ""

    No matching posts found containing ""