EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context
arXiv:2407.04472v4 Announce Type: replace Abstract: Large language models (LLMs) present an enormous evolution in the strategic potential of conversational recommender systems (CRS). Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises (SME) that makeup the bedrock of the global economy. In the current paper, we detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations. While doing so, we additionally outline a short-form revised ResQue model for evaluating LLM-driven CRS, enabling replicability
View PDF
Abstract:Large language models (LLMs) present an enormous evolution in the strategic potential of conversational recommender systems (CRS). Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises (SME) that makeup the bedrock of the global economy. In the current paper, we detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations. While doing so, we additionally outline a short-form revised ResQue model for evaluating LLM-driven CRS, enabling replicability in a rapidly evolving field. Our results reveal good system performance from a user experience perspective (85.5% recommendation accuracy) but underscore latency, cost, and quality issues challenging business viability. Notably, with a median cost of $0.04 per interaction and a latency of 5.7s, cost-effectiveness and response time emerge as crucial areas for achieving a more user-friendly and economically viable LLM-driven CRS for SME settings. One major driver of these costs is the use of an advanced LLM as a ranker within the retrieval-augmented generation (RAG) technique. Our results additionally indicate that relying solely on approaches such as Prompt-based learning with ChatGPT as the underlying LLM makes it challenging to achieve satisfying quality in a production environment. Strategic considerations for SMEs deploying an LLM-driven CRS are outlined, particularly considering trade-offs in the current technical landscape.
Comments: Just accepted version
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
MSC classes: 68T50
ACM classes: I.2.7; H.5.2
Cite as: arXiv:2407.04472 [cs.IR]
(or arXiv:2407.04472v4 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2407.04472
arXiv-issued DOI via DataCite
Related DOI:
https://doi.org/10.1145/3803546
DOI(s) linking to related resources
Submission history
From: Joseph Ollier Dr [view email] [v1] Fri, 5 Jul 2024 12:42:31 UTC (691 KB) [v2] Mon, 8 Jul 2024 14:50:49 UTC (693 KB) [v3] Tue, 9 Jul 2024 13:31:00 UTC (699 KB) [v4] Tue, 31 Mar 2026 08:47:21 UTC (1,169 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelannounceA Very Fine Untuning
How fine-tuning made my chatbot worse (and broke my RAG pipeline) I spent weeks trying to improve my personal chatbot, Virtual Alexandra , with fine-tuning. Instead I got increased hallucination rate and broken retrieval in my RAG system. Yes, this is a story about a failed attempt, not a successful one. My husband and I called fine tuning results “Drunk Alexandra” — incoherent answers that were initially funny, but quickly became annoying. After weeks of experiments, I reached a simple conclusion: for this particular project, a small chatbot that answers questions based on my writing and instructions, fine tuning was not a good option. It was not just unnecessary, it actively degraded the experience and didn’t justify the extra time, cost, or complexity compared to the prompt + RAG system
Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell
<h4>Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory</h4> <p>When Google unveiled <a target="_blank" rel="nofollow" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.…</p>
Writing Better RFCs and Design Docs
<p>RFCs (Request for Comments) and design docs are how engineering teams align on the “what” and “why” before writing code. Done well, they reduce rework and create a record of decisions. Done poorly, they sit unread or trigger endless debate. Here’s how to write <strong>better RFCs and design docs</strong> that get read, get feedback, and lead to decisions.</p> <h2> Why Write Them at All? </h2> <ul> <li> <strong>Alignment:</strong> Everyone works from the same understanding of the problem and the approach.</li> <li> <strong>Async review:</strong> People can respond in their own time, including across time zones.</li> <li> <strong>Memory:</strong> Later you have a record of why you chose X and what you rejected.</li> <li> <strong>Onboarding:</strong> New joiners (and future you) can unders
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
A Very Fine Untuning
How fine-tuning made my chatbot worse (and broke my RAG pipeline) I spent weeks trying to improve my personal chatbot, Virtual Alexandra , with fine-tuning. Instead I got increased hallucination rate and broken retrieval in my RAG system. Yes, this is a story about a failed attempt, not a successful one. My husband and I called fine tuning results “Drunk Alexandra” — incoherent answers that were initially funny, but quickly became annoying. After weeks of experiments, I reached a simple conclusion: for this particular project, a small chatbot that answers questions based on my writing and instructions, fine tuning was not a good option. It was not just unnecessary, it actively degraded the experience and didn’t justify the extra time, cost, or complexity compared to the prompt + RAG system
Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell
<h4>Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory</h4> <p>When Google unveiled <a target="_blank" rel="nofollow" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.…</p>
Introducing The Screwtape Ladders
The time has come for me to find a new home for my writings. Like many an author before me, I've enjoyed improving my craft and getting feedback on my essays here. LessWrong is a good incubator for honing one's skills in that arena. There's a chance to get your point out in front of a really broad audience of really smart people. There's been some cool moments. My oldest visible post, Write A Thousand Roads to Rome , got cited in a discussion with Eliezer Yudkowsky once. I keep seeing people bring up Loudly Give Up, Don't Quietly Fade as a motivator for speaking out. Sometimes it's really cool people working on awesome projects, and I feel a flash of sadness at 'aww, it's not going to happen' and also a bit of cool 'whoa, they remember that post?' You've all also let me get away with a lot
Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code - Bloomberg.com
<a href="https://news.google.com/rss/articles/CBMitgFBVV95cUxOM0VfSzdRYUNpT21XMlVuNXhsVEY4TUFxM3UzWUJDOEhFcUtJQnhTbjY2VjBXOUw1d1ZOUDRKeHVKMzkta3pFVWRWSGNoQkp3aWVndlRBQlpVUGxVN0ZnQW80OUZnYWN6RlhJWHRjT0V4RVhPcGhxMmE3b3oyVDlUV2RLY0g2NEx4M1dfMXhvTlhPTW50eFR1cEhxcHB3SXpURnRtbDZtZHp6bGQ2Z09IMjZBODBjdw?oc=5" target="_blank">Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code</a> <font color="#6f6f6f">Bloomberg.com</font>

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!