LLMs running on your laptops and smartphones.
Building the Nolano community - Our vision for the future!
Large language models (LLMs) are powerful tools to assist us in tasks like email completion, code explanation, and generation. However, running LLMs requires hardware accelerators beyond the capabilities of our personal laptops and smartphones. The current trend is to query LLM APIs with our potentially sensitive data to avail assistance of these models. Additionally, centralising the APIs limits accessibility and diversity. To address this issue, a growing community of developers is coming together around the common goal of enabling LLMs inference locally and to build AI-powered apps.
Moreover, using on-device language models (LLMs) enables greater control over data, making it available for individuals and organisations dealing with sensitive information, such as those in healthcare, legal, or finance. Running LLMs locally provides the ability to create personalised generation models that cater to individual preferences and faster response times. Sometimes engaging with LLMs is a more enjoyable experience than being limited to playing the Google chrome Dinosaur Game offline.
To complement these LLMs, a series of APIs will be made available on personal devices alongside the vanilla chatbot. For now, we are calling these as AICalls, similar to SysCalls that the Operating System offers. The purpose of these APIs is to empower creators, developers, and researchers to utilise these models for their projects.
Tentative Goals and Strategies.
Recently, we have seen prospects of LLMs to be quantized and sparsified (like LLM.int8, SparseGPT) without performance degradation. We have also observed an improvement in the processors on personal devices (like Apple Silicon), along with efficient C/C++ libraries (like ggml). Moreover, it has been empirically observed that smaller models like UL2 and LLaMa, can be competitive to 175B-GPT-3 when trained on more data or on different objectives. These models are small enough to run on personal hardware. For example, as a proof-of-concept we implemented fast inference of int-4 quantized LLMs, as large as LLaMa 13B on personal hardware at more than 12 tokens/second latency.
Our strategy involves selecting one of the open source models that have permissive licensing (No LLaMa, not you!). For now we are eyeing Flan models. We will compress these models while maintaining their performance with minimal additional training. Additionally, we will create packages that support inference with low latency. Furthermore, we will develop interfaces for AICalls to allow fast inference from dynamically typed languages like Python and JS.
We have open sourced toy-experimental code base of LLaMa inference. We will soon share benchmarks of quantized 7B-LLaMa inference on Android and iOS phones. By early next week, we will also share more insights on fast-inference possibilities of sparsified LLMs on Macbooks.
If you're excited about the potential of LLMs and interested in contributing to the development of LLMs and surrounding APIs, we'd love to hear from you! Join our Discord community and let us know how we can make this project even better. Your feedback and inputs are invaluable as we work to create a tool that empowers developers to build more effectively. We can't wait to hear from you!
Twitter: https://twitter.com/nolanoorg
Int-4 quant LLaMa: https://github.com/NolanoOrg/llama-int4-quant
Fun-Fact: Nolano is acronymified version of “No LANguage Obstacles”