Llama-3-70B-GGUF on an NVIDIA 3090 in 2-Bit Quantisation

The news last week that Llama-3 by meta was released meant that everyone around me was talking about it. I immediately downloaded and tried the 8B varient and was impressed by the results. However I quickly began to see benchmarks that challenged GPT-3.5 for the 70B version.

I initially tried to run LLama3 70B GPTQ with 4-bit quantisation as I know GPTQ gives me the best performance with my local equipment. I knew that this almost certainly would fill up my 24GB of vram and I was right. A few days later I discovered a 70B GGUF 2-bit released by LMStudio community. I ran it with an 8192 context window using OobaBooga and JUST fit it in :')

VRAM usage of Llama3-70B-GGUF-Q2

Token generation was running fairly slow but I am not complaining as the quality of output is leagues above anything I have seen in the open source spce. I am SO excited by the possibilities this has created.

Token generation speed of Llama3-70B-GGUF-Q2

2.91 t/s. I'll take it for an LLM that can finally do 0 shot prompt python scripts that actually work!

Literally the first prompt and output with 0 changes:


        You
        
        Please write a python script that calls back to a nats server and listens on a subject called email.json and extracts data like name, cc, bcc, title, message, to, from from the content published to that channel
        AI
        
        Here is a Python script that connects to a NATS server, subscribes to a subject called "email.json", and extracts data like name, CC, BCC, title, message, to, and from from the published content:
        
        import nats
        import json
        
        # Define the NATS server URL and credentials
        nats_server_url = "nats://demo.nats.io:4222"
        nats_username = "your_username"
        nats_password = "your_password"
        
        # Connect to the NATS server
        nc = nats.connect(url=nats_server_url, user=nats_username, password=nats_password)
        
        # Subscribe to the "email.json" subject
        sub = nc.subscribe("email.json")
        
        # Define a callback function to process incoming messages
        def callback(msg):
            # Parse the message content as JSON
            data = json.loads(msg.payload.decode("utf-8"))
        
            # Extract the desired data from the JSON
            name = data.get("name")
            cc = data.get("cc")
            bcc = data.get("bcc")
            title = data.get("title")
            message = data.get("message")
            to = data.get("to")
            from_ = data.get("from")
        
            # Print the extracted data
            print(f"Name: {name}")
            print(f"CC: {cc}")
            print(f"BCC: {bcc}")
            print(f"Title: {title}")
            print(f"Message: {message}")
            print(f"To: {to}")
            print(f"From: {from_}")
        
        # Start the subscription and wait for messages
        sub.max_opt_wait = 10
        while True:
            msg = sub.next()
            callback(msg)
        
        # Close the NATS connection
        nc.close()
        
        Please note that you'll need to replace the nats_server_url, nats_username, and nats_password variables with your actual NATS server URL, username, and password.
        
        Also, the script assumes that the message payload is a JSON object with the specified fields. You may need to adjust the script to fit your specific use case.
        
        Let me know if you have any questions or need further assistance!

ONWARD!!!

home

>> End of output.

‎