industry

Advances in private training for production on-device language models (blog.research.google)

research.google · 2 years ago · write a board post referencing this

Posted by Zheng Xu, Research Scientist, and Yanxiang Zhang, Software Engineer, Google Language models (LMs) trained to predict the next word given input text are the key technology for many applications [ 1 , 2 ]. In Gboard , LMs are used to improve users’ typing experience by supporting features like next word prediction (NWP), Smart Compose , smart completion and suggestion , slide to type , and proofread . Deploying models on users’ devices rather than enterprise servers has advantages like lower latency and better privacy for model usage. While training on-device models directly from user data effectively improves the utility performance for applications such as NWP and smart text selection , protecting the privacy of user data for model training is important. Gboard features powered by on-device language models. In this blog we discuss how years of research advances now power the private training of Gboard LMs, since the proof-of-concept development of federated learning (FL) in 2017 and formal differential privacy (DP) guarantees in 2022. FL enables mobile phones to collaboratively learn a model while keeping all the training data on device, and DP provides a quantifiable measure of data anonymization. Formally, DP is often characterized by ( ε , δ ) with smaller values representing stronger guarantees. Machine learning (ML) models are considered to have reasonable DP guarantees for ε=10 and strong DP guarantees for ε=1 when δ is small. As of today, all NWP neural network LMs in Gboard are trained with FL with formal DP guarantees, and all future launches of Gboard LMs trained on user data require DP. These 30+ Gboard on-device LMs are launched in 7+ languages and 15+ countries, and satisfy ( ɛ , δ )-DP guarantees of small δ of 10 -10 and ɛ between 0.994 and 13.69. To the best of our knowledge, this is the largest known deployment of user-level DP in production at Google or anywhere, and the first time a strong DP guarantee of ɛ < 1 is announced for models trained directly on user data. Privacy principles and practices in Gboard In “ Private Federated Learning in Gboard ”, we discussed how different privacy principles are currently reflected in production models, including: Transparency and user control : We provide disclosure of what data is used, what purpose it is used for, how it is processed in various channels, and how Gboard users can easily configure the data usage in learning models. Data minimization : FL immediately aggregates only focused updates that improve a specific model. Secure aggregation (SecAgg) is an encryption method to further guarantee that only aggregated results of the ephemeral updates can be accessed. Data anonymization : DP is applied by the server to prevent models from memorizing the unique information in individual user’s training data. Auditability and verifiability : We have made public the key algorithmic approaches and privacy accounting in open-sourced code ( TFF aggregator , TFP DPQuery , DP account