Locally run LLM in python

the living tribunal · May 1, 2025

Step 1: Install Dependencies
Open a terminal and run:

Code:

pip install transformers torch sentencepiece

(This installs the necessary libraries to run an LLM.)

Step 2: Download the Model Without Git
1. Go to Hugging Face and search for a model like Llama 3, GPT-J, or Mistral.
2. Click on the model you want.
3. Look for the “Files and Versions” tab.
4. Download the [.bin, .pt, or .safetensors] file manually.

Step 3: Organize Your Project
Once downloaded, place the model files inside a folder in your project:

Code:

my_project/
├── models/
│   └── pytorch_model.bin
├── main.py
└── requirements.txt

(Put the downloaded model inside models/.)

🏗 Step 4: Load the Model in Python
Use `transformers` to load the model:

Code:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "./models/"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

(This loads the model for inference.)

Step 5: Run the Model Locally
Try generating text:

Code:

text = "Explain quantum physics simply."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))

(This makes the model generate text based on your input.)

Step 6: Optimize for Speed
- Use Ollama (Guide here) for faster local execution.
- Try 4-bit quantization to reduce memory usage.

Step 7: Build a Simple Chatbot

Code:

from flask import Flask, request

app = Flask(__name__)

@app.route("/", methods=["POST"])
def chat():
    user_input = request.form["query"]
    inputs = tokenizer(user_input, return_tensors="pt")
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0])

app.run(port=5000)

(This sets up a basic chatbot interface.)

🛠 Troubleshooting
- Model too slow? Try a smaller one like GPT-J.
- GPU acceleration? Install CUDA for faster inference.

@fukurou

Locally run LLM in python

the living tribunal

Moderator