Natural language is really messy.. Could go through many variants on things. Then you get text to speech issues due to audio quality / accents.. And you need an engine that can "best guess / best match" based on what it has or ask for clarification.
Similarly you can ask for TWO of a complex thing: I would like Two.... meals, with,,, XXXX