Sam Aydlette

Author and Cybersecurity Practitioner

DIY AI: Running an LLM On Any Standard Laptop For Free

The views and opinions expressed in this article are those of the author and do not reflect the views of any organization or employer.

Subscriptions to cloud-based LLMs like ChatGPT and Claude are great, but having your own locally-run AI model is free and provides better privacy as well as offline access. In this guide, I'll walk you through setting up TinyLlama on a standard consumer laptop.

Why TinyLlama?

While larger models like Llama 2 (13B) and Mistral (7B) offer impressive capabilities, they demand significant computing resources. TinyLlama, at just 1.1B parameters, offers a compelling compromise. TinyLlama runs on consumer-grade hardware, works with limited RAM (4-6GB), is CPU-only friendly (no expensive GPU required), downloads quickly (~600MB vs 4GB+ for larger models), and is fully open source with no authentication requirements. TinyLlama was built on the foundation of Meta's LLaMA architecture and is primarily maintained by researchers from the University of Washington and the Allen Institute for AI (AI2). TinyLlama is an open-source project, which means its code and weights are publicly available. Anyone can inspect the code and model directly to understand what is being deployed.

Prerequisites

Before we begin, you'll need:

  • A laptop with at least 4GB RAM
  • About 2GB of free disk space
  • Python 3.8 or newer

The code examples below work for Linux and Mac users. For Windows users, the commands require minor adjustments.

Step 1: Setting Up the Environment

First, let's create a dedicated Python virtual environment to keep our dependencies organized:


                        # Create a directory for our project (make sure you're in the directory you'd like the project to be in)
                        mkdir -p ~/llm-project
                        cd ~/llm-project

                        # Create a Python virtual environment
                        python3 -m venv llm-env

                        # Activate the environment
                        source llm-env/bin/activate
                        

You'll know the environment is active when your command prompt shows (llm-env) at the beginning.

Step 2: Installing Dependencies

With our environment ready, let's install the necessary packages:


                        # Upgrade pip
                        pip install --upgrade pip

                        # Install PyTorch (CPU version to save space)
                        pip install torch --index-url https://download.pytorch.org/whl/cpu

                        # Install Transformers and related libraries
                        pip install transformers sentencepiece protobuf
                        

Installing the CPU version of PyTorch significantly reduces download size and memory requirements, making this setup more accessible for laptops with limited resources.

Step 3: Creating the TinyLlama Script

Now, let's create a Python script to load and interact with TinyLlama. Create a new file called run_tinyllama.py


                        import torch
                        from transformers import AutoModelForCausalLM, AutoTokenizer
                        import gc
                        import time

                        def generate_response(prompt, max_length=256):
                            start_time = time.time()
                            print("Loading tokenizer...")
                            
                            # TinyLlama - open source, no auth required
                            model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
                            
                            tokenizer = AutoTokenizer.from_pretrained(model_name)
                            
                            print(f"Loading model: {model_name}...")
                            # CPU-only mode for compatibility
                            model = AutoModelForCausalLM.from_pretrained(
                                model_name,
                                low_cpu_mem_usage=True,
                                device_map="cpu"
                            )
                            
                            print(f"Model loaded in {time.time() - start_time:.2f} seconds")
                            
                            # Format prompt correctly for chat format
                            messages = [
                                {"role": "user", "content": prompt}
                            ]
                            
                            encoded_input = tokenizer.apply_chat_template(
                                messages,
                                return_tensors="pt"
                            )
                            
                            print("Generating response...")
                            # Generate text
                            outputs = model.generate(
                                encoded_input,
                                max_new_tokens=max_length,
                                temperature=0.7,
                                top_p=0.9,
                                do_sample=True
                            )
                            
                            # Decode and return
                            full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
                            
                            # Clean up to save memory
                            del model, tokenizer, outputs, encoded_input
                            gc.collect()
                            
                            print(f"Generation completed in {time.time() - start_time:.2f} seconds")
                            return full_output

                        if __name__ == "__main__":
                            print("TinyLlama 1.1B Chat")
                            print("-------------------")
                            print("Note: First run will download the model (~600MB)")
                            print("This will be slow on CPU, please be patient")
                            
                            while True:
                                prompt = input("\nEnter your prompt (or 'quit' to exit):\n")
                                if prompt.lower() == 'quit':
                                    break
                                    
                                try:
                                    print("\nProcessing...")
                                    response = generate_response(prompt)
                                    print("\nResponse:")
                                    print(response)
                                except Exception as e:
                                    print(f"Error: {e}")
                                    
                                # Force cleanup
                                gc.collect()
                        

Step 4: Creating a Convenient Launcher

For ease of use, let's create a simple bash script that will activate our environment and run our Python script. Create a file called run_llm.sh


                        #!/bin/bash
                        cd ~/llm-project
                        source llm-env/bin/activate
                        python run_tinyllama.py
                        

Make it executable:


                        chmod +x run_llm.sh
                        

Step 5: Running TinyLlama

Now, let's launch our local LLM:


                        ./run_llm.sh
                        

The first time you run this, it will download the TinyLlama model, which is about 600MB. This may take a few minutes depending on your internet connection.

Once loaded, you'll see a prompt asking for input. Type your question or request, and TinyLlama will generate a response. Keep in mind that on a CPU, generation will be slower than commercial cloud-based services. Expect 20-60 seconds for a response depending on your hardware.

Optimization Tips

If you're experiencing slow performance or memory issues, try these optimizations:

  1. Reduce max_length: Change max_length=256 to a smaller value like 128 or 64
  2. Close other applications: Free up memory by closing unnecessary programs
  3. Add swap space: If your system supports it, adding swap space can prevent out-of-memory errors
  4. Try overnight: Run complex or creative generations when you don't need immediate responses

Conclusion

Running TinyLlama locally gives you a private, offline AI that, while not as powerful as commercial offerings, provides remarkable value considering its modest resource requirements. This setup demonstrates that AI is becoming increasingly accessible. You don't need expensive subscriptions or specialized hardware to start experimenting with artificial intelligence!

The future of AI isn't just about the most powerful models that require an expensive subscription to use. It's also about personal, private models running right on our own devices. If you followed the instructions in this article and would like to share your experience, please reach out via the Contact Me page!