Locally run LLM in python

the living tribunal

Moderator
Staff member
moderator
🚀 Step 1: Install Dependencies
Open a terminal and run:
Code:
pip install transformers torch sentencepiece
(This installs the necessary libraries to run an LLM.)

📥 Step 2: Download the Model Without Git
1. Go to Hugging Face and search for a model like Llama 3, GPT-J, or Mistral.
2. Click on the model you want.
3. Look for the “Files and Versions” tab.
4. Download the [.bin, .pt, or .safetensors] file manually.

đź“‚ Step 3: Organize Your Project
Once downloaded, place the model files inside a folder in your project:
Code:
my_project/
├── models/
│   └── pytorch_model.bin
├── main.py
└── requirements.txt
(Put the downloaded model inside models/.)

🏗 Step 4: Load the Model in Python
Use `transformers` to load the model:
Code:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "./models/"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
(This loads the model for inference.)

📝 Step 5: Run the Model Locally
Try generating text:
Code:
text = "Explain quantum physics simply."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))
(This makes the model generate text based on your input.)

⚡ Step 6: Optimize for Speed
- Use Ollama (Guide here) for faster local execution.
- Try 4-bit quantization to reduce memory usage.

đź’¬ Step 7: Build a Simple Chatbot
Code:
from flask import Flask, request

app = Flask(__name__)

@app.route("/", methods=["POST"])
def chat():
    user_input = request.form["query"]
    inputs = tokenizer(user_input, return_tensors="pt")
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0])

app.run(port=5000)
(This sets up a basic chatbot interface.)

đź›  Troubleshooting
- Model too slow? Try a smaller one like GPT-J.
- GPU acceleration? Install CUDA for faster inference.

@fukurou
 
Top