Gpt-4o mini consistently fails to select the correct item from enum ...
When tasked with classifying Japanese conversations related to real estate management issues, the Gpt-4o mini model encountered difficulties in consistently selecting the correct classification from a predefined list. While the initial results were acceptable overall, there was a recurring issue with one specific conversation that consistently produced incorrect outputs.
Identifying the Issue
One particular conversation highlighted the challenges faced by the model:
Owner: Hello. I have something I’d like to discuss with you today.
Staff: Hello. This is Marilyn Tanaka from Real Estate Management Support. What would you like to discuss with me?
Owner: Apparently the tenants in my apartment are making a lot of noise in the middle of the night, and other residents are complaining.
Staff: That’s a problem. How often does this happen?
Owner: Not every day, but it seems to happen a lot on weekends and holidays.
Staff: What time does it start specifically?
Owner: It starts around 11pm, and sometimes it continues until the next morning.
Staff: That’s a problem. Is there a specific problem?
Owner: Some residents are saying they’re going to call the police.
Staff: I’d like to avoid that. Should I put up a notice on the bulletin board first to warn them?
Owner: Yes, I’ll see how it goes. Thank you.
Upon translation, it was evident that the conversation revolved around a "noise problem." Despite this clarity, the model struggled to consistently identify the correct topic using structured output response format and enumerated classifications.
Testing and Results
Various tests were conducted using different models and response schemas to address the issue. The model's outputs ranged from accurate ("tenant noise issue") to unrelated topics like "water leak" or "equipment breakdown."
Even after adjustments to the list and descriptions, the model continued to exhibit challenges in selecting the correct classification. The introduction of English descriptions briefly improved the outcomes, but the issue persisted when using enums.
Despite the model's struggle with Japanese texts and enum usage, the impact on internal tools was deemed manageable. Considering potential solutions, fine-tuning the model for classification tasks was suggested to enhance reliability and accuracy.
While providing definitions for classification categories may aid in improving the model's performance, the challenge lies in managing a lengthy list of topics. The exploration of fine-tuning the model for better classification outcomes is currently underway to address the persisting issue.
Exploring Solutions
In conclusion, the consistent failure to select the correct item from enum highlights a limitation in the model's understanding of Japanese texts within structured outputs. As efforts continue to optimize the model's performance, the journey towards improved classification accuracy remains ongoing.
Powered by Discourse, best viewed with JavaScript enabled.