To become more versed in working with LLMs, I added a natural language query (NLQ) endpoint to the API of my NPS hikes project — with Claude’s help. As covered in the API tutorial, the endpoint lets users submit natural language questions about their trail and park data.
NLQ workflow
Here’s an overview of the workflow.
1. Build the payload
When a user submits a POST /query, the code in api/main.py builds a JSON payload to send to Ollama:
payload = {
"model": config.OLLAMA_MODEL,
"messages": messages,
"tools": tools,
"stream": False,
}That payload includes:
A user query, such as “Show me long trails in Zion that I’ve hiked”
A park lookup table to resolve park names to their four-letter NPS codes
A system message prompt:
You are a trail finder assistant for US National Parks. Your ONLY job is to call the appropriate function with the correct parameters based on the user’s question…
Rules inside the system prompt like:
If the user asks about trails or hikes, use search_trails.
OpenAI-compatible tool definitions mirroring API calls, such as:
{ "type": "function", "function": { "name": "search_parks", "description": ( "Search for National Parks. " "Use this for questions about parks, which parks exist, " "park visit status, or when parks were visited." ), "parameters": { "type": "object", "properties": { "visited": { "type": "boolean", "description": ( "true = only visited parks, false = only unvisited parks. " "Omit to return all parks." ), ... }, "required": [], }, }, }
2. Send the payload to Ollama
The next step is to submit the JSON payload to a local LLM using the httpx library. Following the instructions in the payload, the LLM (hopefully) generates a tool call response like:
{
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": "search_trails",
"arguments": {
"park_code": "zion",
"min_length": 5,
"hiked": true
}
}
}
]
}
}3. Parse and normalize the LLM response
By design, the content of the response should be empty. In the best case, the LLM acts only as a parameter extractor, rather than as a chatbot. There’s a fallback plan to try to parse JSON inside a text response.
In either case, there’s a layer to parse and normalize the arguments in the tool calls. For example, there’s a table to map state names to two-letter state codes, park names to four-letter park codes, variations on month names and seasons to values compatible with the database. It corrects for negation (see the evaluation section).
This layer also infers visitation status. For a query like, “What parks did I visit in October?”, the LLM would correctly extract the visit_month, but leave out visited equal to true. This step makes a few corrections of this nature.
4. Dispatch the clean response to the existing API
Next, the workflow routes the normalized function name and parameters to the same query functions that power the REST endpoints. For example:
- The tool
search_trailsmaps to the query functionfetch_trails()— the same function that powers theGET /trailsendpoint. - The tool
search_parksmaps tofetch_all_parks(), which powersGET /parks.
The parameters are unpacked as keyword arguments: fetch_trails(park_code="zion", min_length=5, hiked=True) produces the same SQL query as GET /trails?park_code=zion&min_length=5&hiked=true.
Important: The NLQ endpoint has no new database logic of its own. The LLM only translates natural language into function parameters. The rest is a direct API call.
5. Return the results from the API
In addition to the normal API query results, the response includes metadata showing how the question was interpreted.
{
"original_query": "Show me long trails in Zion that I've hiked",
"interpreted_as": {
"park_code": "zion",
"min_length": 5.0,
"hiked": true
},
"function_called": "search_trails",
"results": {
"count": 3,
"trails": [...]
}
}The interpreted_as field is the key transparency mechanism. If a query returns unexpected results, you can see exactly which parameters the LLM extracted and whether the normalization layer changed anything.
Evaluating accuracy
To test how well the NLQ endpoint was working, I instituted a golden dataset of queries, each with an expected tool call and parameters. To use the same example:
{
"query": "Show me long trails in Zion that I've hiked",
"expected_function": "search_trails",
"expected_params": {"park_code": "zion", "min_length": 5, "hiked": true},
"category": "search_trails"
}An evaluation script executes each golden query through the full pipeline and delivers an accuracy report. Initially, the LLM consistently struggled with the following issues:
- Min and max. The LLM would interpret “hikes under 2 miles” as
min_lengthof 2 instead ofmax_length. - States vs. parks in states. A query of “trails in Colorado” would lead to a park in that state instead of the state itself.
- Negation. Any variation of “parks/trails I haven’t done” resulted in
hikedorvisitedequal to true instead of false.
Prompt engineering helped to some extent. Ultimately though I used a rather blunt post-normalization regex-based approach for the negation problem. It probably wouldn’t be necessary with a better model!
Next steps
While there’s a small degree of variation on each run, quite often all 45+ golden queries pass. Despite these positive results, there’s a huge area of questions that the NLQ can’t answer:
- Multi-entity queries: “Compare trails in Zion and Bryce Canyon.”
- Ambiguous routing: “What’s in Yellowstone”? Would that be
GET /trailsorGET /parks? - Spatial queries: Trails above 3,000 feet or within 100 miles of Denver.
- Negation misfires: “Show me trails, not park info.”
I find this to be one of the main challenges of an NLQ endpoint in general. For this case, you almost need to know the API well enough to be able to anticipate what the endpoint will be able to answer. In which case, you almost might as well read the API docs and use the API directly!