May 8, 2023 • MLC Community
In this post, we introduce MLC LLM for Android – a solution that allows large language models to be deployed natively on Android devices, plus a productive framework for everyone to further optimize model performance for their use cases. Everything runs locally and accelerated with native GPU on the phone.
We have witnessed significant progress in the field of generative AI and large language models. Thanks to the open source movement, we have seen the blooming moment of open source foundational models. While it is helpful to run those models on the server platforms, there are also great deal of potential to enable large-language models on consumer devices
Empowering LLMs on mobile devices is very important given this is what we interact with daily. Android is an operating system inside 2.5 billion active devices, we would love to bring LLM support to Android devices. This post shares our experience on bringing LLMs to Android devices.
If you follow our latest posts, you might know that we bought have support for LLMs on iPhones. We would like to do the same thing for Android. However, the iOS and Android ecosystems have different sets of programming models both for host app programming and GPU programming, bringing challenges to our goal, to name a few:
Thanks to MLC-LLM’s universal deployment solution, we can overcome these challenges and productively deploy a vicuna 7b model onto a Samsung Galaxy S23, powered by the latest Snapdragon 8 Gen 2 Mobile Platform.
The cornerstone of our solution is machine learning compilation (MLC), which we leverage to deploy AI models efficiently.
Our solution provides a good harness to optimize more models on Android hardware backends further. We believe there are still a lot of opportunities, but it is amazing how far we can go in one week’s effort. We would love to work with the open-source community to bring further optimizations via ML compilation.
Because Android does not have the 4GB app RAM limit, which iOS enforce, we can leverage more RAM than our iOS deployment. So we choose to enable a 4-bit quantized vicuna model, which preserves more capabilities, especially in languages other than English. We are also looking forward to supporting other storage-efficient models in the future.
You can check out the demo instruction to try it out, our github repo for source code. MLC LLM enables deployment to various devices, including Windows, Linux, MacOS, iPhone, and now Android. You are also welcome to check out the demo instruction page for more information on running on other devices.
The MLC LLM project is initiated by members from CMU catalyst, UW SAMPL, SJTU, OctoML and the MLC community.
We would love to continue developing and supporting the open-source ML community. The overall MLC projects are only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. The open-source ML community members made these models publicly available. PyTorch and Hugging Face communities that make these models accessible. We would like to thank the teams behind Vicuna, SentencePiece, LLaMA, and Alpaca. We also would like to thank the OpenCL Vulkan, Android, C++, python Rust communities that enable this project.