Contents
replit-code-v1-3b
is a 2.7B Causal Language Model focused on Code Completion. The model has been trained on a subset of the Stack Dedup v1.2 dataset.
The training mixture includes 20 different languages, listed here in descending order of number of tokens:
Markdown
, Java
, JavaScript
, Python
, TypeScript
, PHP
, SQL
, JSX
, reStructuredText
, Rust
, C
, CSS
, Go
, C++
, HTML
, Vue
, Ruby
, Jupyter Notebook
, R
, Shell
In total, the training dataset contains 175B tokens, which were repeated over 3 epochs – in total, replit-code-v1-3b
has been trained on 525B tokens (~195 tokens per parameter).
The model has been trained on the MosaicML platform with 256 x A100-40GB GPUs, leveraging their latest LLM examples repo.
replit-code-v1-3b
is powered by state-of-the-art LLM techniques, such as:
Flash Attention for fast training and inference,
AliBi positional embeddings to support variable context length at inference time,
LionW optimizer,
etc.
First of all, you need to install the latest versions of the following dependencies:
einops
sentencepiece
torch
transformers
You can then load the model as follows:
from transformers import AutoModelForCausalLM
# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
To use the optimized Triton implementation of FlashAttention on GPUs with BF16 precision, first install the following dependencies:
flash-attn==0.2.8
triton==2.0.0.dev20221202
Then, move the model to bfloat16
and use it as follows:
from transformers import AutoModelForCausalLM, AutoConfig
config = AutoConfig.from_pretrained(
"replit/replit-code-v1-3b",
trust_remote_code=True
)
config.attn_config['attn_impl'] = 'triton'
# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', config=config, trust_remote_code=True)
model.to(device='cuda:0', dtype=torch.bfloat16)
# forward pass
x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
x = x.to(device='cuda:0')
y = model(x)
Note that trust_remote_code=True
is passed to the from_pretrained
method because ReplitLM is not a class in the
Transformers library.
We have trained a custom SentencePiece Unigram tokenizer optimized with a vocabulary specifically for code of 32768 tokens.
Note that using this requires the sentencepiece
library to be installed.
The tokenizer can be used as follows:
from transformers import AutoTokenizer
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
# single input encoding + generation
x = tokenizer.encode('def hello():\n print("hello world")\n', return_tensors='pt')
y = model.generate(x)
# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)
trust_remote_code=True
is passed to the from_pretrained
method because ReplitLM is not a class in the Transformers library.clean_up_tokenization_spaces=False
is meant to avoid removing spaces in the output, because that would affect the syntactical correctness of the generated code.You can generate code using the transformers
library as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)
Experiment with different decoding methods and parameters to get the best results for your use case.
You can also load the model in 8-bit with the load_in_8bit=True
kwarg that uses bitsandbytes
under the hood.
First you need to install the following additional dependanices:
accelerate
bitsandbytes
Then you can load the model in 8bit as follows:
model = AutoModelForCausalLM.from_pretrained("replit/replit-code-v1-3b",
trust_remote_code=True,
device_map="auto",
load_in_8bit=True)
The additional kwargs that make this possible are device_map='auto'
and load_in_8bit=True
.
For loading in 4-bit, at the time of writing, support for load_in_4bit
has not been merged into the latest releases for
transformers
and accelerate
. However you can use it if you install the dependancies the main
branches of the published repos:
pip install git+https://github.com/huggingface/accelerate.git
pip install git+https://github.com/huggingface/transformers.git
Then load in 4-bit with:
model = AutoModelForCausalLM.from_pretrained("replit/replit-code-v1-3b",
trust_remote_code=True,
device_map="auto",
load_in_4bit=True)
Note that as with all code generation models, post-processing of the generated code is important. In particular, the following post-processing steps are recommended:
max_tokens
to a reasonable value based on your completion use casereturn
, def
, “```”, “\n\n\n
” to avoid generating incomplete code when max_tokens
is larger than the length of the expected generated code.Source