Dec 31, 2023
It seems to me that the division-of-labour concept under investigation here could potentially extend beyond efficient flash/DRAM utilization. It could allow entire parts the network to be in the cloud, while time-critical elements are on-device. Information going over the network would be limited to tensor data, and would thus be meaningful only to the rest of the on-device model. So user data remains secure. This could allow for a hugely scalable approach to running LLMs on mobile devices.