What are the main performance challenges when deploying SLMs from development to production devices?
When porting SLMs from development environments like Jupyter notebooks to actual devices, performance can drop 15-20% due to quantization and operator gaps in the device's NPU. Additionally, on-device inference times can slow down unpredictably due to thermal throttling and power management, unlike consistent cloud API latency.