実践におけるLLMデータアノテーション

published

author

authorDisplayName

Lukas Frannek

Introduction

In this article, we explore the process of tagging tours on NEWT, such as categorizing them with descriptors like "Ocean View Room". By adding these tags, we significantly improve the searchability of specific content within the extensive catalog of available tours. This tagging process mirrors the practice of annotating input data with labels, a fundamental task in machine learning. Large Language Models (LLMs) have proven to be particularly effective in automating this process. We will examine how and why LLMs were integrated into our workflow, the challenges they present, and the role of expert oversight in refining the development cycle.

LLM Data Annotation in Practice

Let’s start by explaining what data annotation is. It is the process of assigning labels or tags to data, for instance, to train a classification machine learning model that can classify if a tour package contains an optional tour activity. That would be binary classification, i.e., the tour has optional tour content included (label “1”) or not (label “0”). So you annotate a relatively small set of tours and train a machine learning model with it, hoping that it can generalize well enough over all tours to correctly predict the labels of all other tours.

Now, anyone who has ever done data annotation on a mildly complex project knows how time-consuming it is to obtain good labels. If you have a wall of text of tour information and you want to decide if a tour contains an optional tour activity, you have to look at various parts of the tour information. If you have 800 tours to annotate, this takes a long time.

Luckily, sometimes it is as easy as looking at the tour title.

Figure 1: A tour detail page with the tour’s title

Nice, this clearly indicates that optional tour activities are included. Job done. However, at other times you have to think a bit more.

Figure 2: Another tour detail page with the tour’s title

What is a trolley? Does this count as optional tour activities?

Figure 3: Details of the tour shown in figure 2

If you now say “Of course it counts!”, you would be wrong. Trolleys are just transportation according to our definition.

Seems like we need some annotation guidelines created by experts, or have those experts annotate tours themselves. It’s just that experts are expensive. Maybe they can write us a guideline! But this still doesn’t guarantee that non-experts can create good labels.

Using LLMs for Data Annotation

As you might have suspected from the title, we can ask a LLM to do annotations for us. You do not need to train the LLM; you just have to let the LLM know what you want it to do in the prompt, which is nice. If a non-expert needs guidelines anyway, we can just as well give those guidelines to the LLM and let it do the work.

Generally, you would do some pre-processing on the tour information and add it as context to the prompt, ask the LLM to generate the responses, and then post-process the output. In the figure below, the pre-processing is simply selecting the part of the tour information that can be used in the input prompt as context, as well as adding the general instructions and specific guidelines for the LLM. The LLM then produces an output according to the structured output format, such as optional_sightseeing_activities：”０”. Post-processing could include further rule-based decision-making.

Figure 4: The LLM call in a nutshell

Now, there is also the question about selecting the most suitable LLM and configuring its parameters, but we will not go into detail this time. Suffice it to say that there is the potential to save a great deal of time by asking the LLM to do the annotation—if it weren't for some important caveats.

Common Pitfalls in LLM Data Annotation

Invalid Output

Some LLMs might not actually honor the required format and produce output that is not consistent with the prescribed format. This is rare but it happens, especially when using smaller self-hosted models or unclear prompts. In that case, the output validation should pick that up and call the LLM API again.

Randomness

A much bigger issue is the randomness of some LLMs. Even if the LLM temperature is set to 0, i.e., set to a deterministic response, one might still sometimes get variability and some hallucinations. Larger models, some level of reasoning, and prompt engineering can reduce this variability and hallucinations, but it might be hard to guarantee no variability and hallucinations for some annotation tasks.

Context Window Size

Depending on the data that needs to be searched through, the max context window size can be quickly reached for some LLMs. Chunking input data and introducing several steps in the LLM workflow, where summaries of chunks are chained together, can mitigate this.

Uncommon Tasks

Now, if you have a database full of tours that follow some given schema, LLMs can do annotation tasks reasonably well if proper guidelines are given. However, it is not possible to write guidelines for all tours in a large database and get it right the first time. There are always some special cases that might not be picked up by the LLM, which may struggle with contexts or nuances that require socio-cultural understanding or certain domains that are not well covered in their pre-training data. Going back to the optional tour activities, you either have them included in the tour or not. However, there are a few tours that explain some optional tour activities without including them.

Figure 5: Example of an uncommon tour

For a LLM, it is not so easy to understand that this content is not included when a large wall of text is provided in the input context. Assuming that a tour is newly added to the DB, there might be no guidelines, and the annotation could be incorrect.

Cost

Irrespective of whether a self-hosted LLM or one of the big LLM APIs is used, cost is an issue. Depending on the number of input tokens, i.e., tour information and guidelines, the cost can become very high for bigger models. This can be alleviated by conducting more concise prompt engineering or using cheaper, i.e., less resource-intensive, models. Depending on the expected accuracy of tags, there is a tradeoff between cost and the above-mentioned points, such as randomness, output validation, and context window size.

Human-in-the-Loop

As LLMs became better and better at question answering, we definitely want to utilize LLMs for annotation, as human-only annotation is too expensive. However, as we saw above, there are caveats. Engineering the best prompts and developing a robust annotation system with LLMs is an iterative process. The prompts may need to be updated several times for all kinds of tours, and guidelines will need to be regularly updated. This means that humans (i.e., experts) have to be incorporated into the development process.

Figure 6: Data annotation can be done by humans, LLMs, or a combination of both.

How can we incorporate humans? By asking them to assess some of the labels that the LLM produced for a subset of tours. The question is, how do we select that subset of tours?

Stratified Sampling

Figure 7: Stratified sampling

At Reiwa Travel, we offer tour packages all over the world. Some labels for tours are hard for LLMs to assess, while others are easy. Some require guidelines, and some do not. Therefore, it is important to select a stratified sample of tours that represents a good cross-section of all the tours in the database, small enough that humans can reasonably assess them quickly. This way, we can find out which types of tours are frequently mislabeled by the LLM and improve the guidelines and prompts iteratively. Doing this a few times will result in a fairly robust system.

LLM-as-a-Judge

We can also ask another LLM to assess the labels provided by the first LLM. Depending on the confidence level of the second LLM, we can ask humans to re-confirm.

Building LLM Data Annotation API

It is rather straightforward to create an app that takes in tour information, pre-processes it, calls the LLM, and outputs labels. The question is how to implement the human-in-the-loop. One approach is to use Langsmith.

Figure 8: LLM data annotation development cycle

As we already use Langchain at Reiwa Travel, logging LLM runs is fairly easy. Simply augment the call to the LLM, and all or a subset of inputs and outputs are logged in Langsmith, ready to be put into annotation pipelines where humans can input their assessment about the labels the LLM added. This can be done ad-hoc or regularly during normal operation to check if new labels or new tours are accurately annotated.

Conclusion

In this article, we explored how LLMs can be used in data annotation tasks. We discussed a conceptual view of how that works and common pitfalls that we encountered at Reiwa Travel, including some methods that can be used to overcome those pitfalls.

We also discussed the importance of the human-in-the-loop that makes the whole idea of a high-quality data annotation system plausible.

Next time, we’ll discuss the idea on a system implementation level.

はじめに

この記事では、旅行アプリ『NEWT（ニュート）』で提供しているツアーにLLMを活用して「オーシャンビュールーム」などのタグを追加する方法について説明します。

これらのタグを追加することで、ユーザーが数千のツアーの中から、自分が興味のある特定のツアーコンテンツを簡単に検索できるようになります。このタグ追加のプロセスは、入力データ（ツアーなど）にラベルを付ける作業に近い性質を持っていて、多くの機械学習タスクで一般的に行われている作業でもあります。

大規模言語モデル（LLM）がこのプロジェクトに対して有効であることもわかっているので、LLMの活用プロセスやその経緯について、またどんな落とし穴があったのか、そして開発サイクルにどのように関与しているのかについて探ります。

まず、データアノテーションが何かを説明しましょう。

データアノテーションとは、データにラベルやタグを追加するプロセスです。今回のNEWTの例だと、分類機械学習モデルを訓練するために、ツアーパッケージにオプショナルなツアーアクティビティが含まれているかどうかを分類する場合などを指します。

それは二項分類、つまりツアーにオプショナルなツアー内容が含まれている（ラベル「1」）か、含まれていない（ラベル「0」）かに分類されます。

そこで、比較的小さなツアーセットをアノテートし、それを使って機械学習モデルを訓練します。結果的に、そのモデルが他のすべてのツアーに対してラベルを正しく予測できるように一般化することを目的としたプロセスです。

少し複雑なプロジェクトでデータアノテーションを行ったことがある人なら、良いラベルを取得するのにどれだけ時間がかかるかをよくわかっていると思います。

ツアー情報のテキストが大量にあり、そのツアーにオプショナルなツアーアクティビティが含まれているかを判断する場合、ツアー情報のさまざまな部分を見なければなりません。もし800個のツアーをアノテートする必要があれば、それは非常に時間がかかります。一方で、ツアーのタイトルを見るだけで簡単に分かることもあったりします。

#1 ツアーのタイトルが記載されたツアー詳細ページ

このツアーでは、オプショナルなツアーアクティビティが含まれていることを明確に示しています。しかし、以下の場合にはもう少し考えなければなりません。

#2 ツアーのタイトルが記載された別のツアー詳細ページ

タイトルを見ただけでは、トロリーが一体なにか、これがオプショナルなツアーアクティビティに含まれるのかは正しく判断できませんね。

#2のツアーに掲載されているツアー詳細

「トロリーはアクティビティにもちろん含まれるでしょ！」と思ったあなた、それは間違いです。トロリーは私たちの定義によれば、単なる交通手段です。

つまり、ドメインエキスパートによるアノテーションガイドラインの策定、もしくはドメインエキスパート自身がツアーをアノテートする必要がありますが、それはかなり高コストです。彼らにガイドラインを書いてもらうこともできますが、それだけでは素人が良いラベルを作成できるとは限りません。

LLMを使ったデータアノテーション

そこで、本ブログのタイトルにある通り、LLMにアノテーションを依頼するにしました。LLMを訓練する必要はありません。求める条件をプロンプトで伝えれば良いだけです。もしガイドラインを必要とするなら、それをLLMに渡して作業をさせれば良いのです。

ツアー情報を事前に処理し、それをプロンプトにコンテキストとして追加し、LLMにレスポンスを生成させ、出力を後処理するというのが基本的な流れです。

下図では、事前処理は入力プロンプトとして使用するツアー情報の部分を選択し、LLMへの一般的な指示と具体的なガイドラインを追加することを表しています。

その後、LLMは構造化出力フォーマットに従って出力を生成します。例えば、「観光活動付き：”０”」。後処理には、さらにルールベースの意思決定が含まれる場合があります。

図4: LLM呼び出しの概要

最適なLLMの選定やパラメータ設定についても検討をすべきポイントではありますが、今回の記事では詳しく説明しません。

言うまでもなく、LLMにアノテーションを依頼することで、かなりの時間を節約できる可能性があります。しかし、いくつかの重要な留意点があります。

LLMデータアノテーションでの一般的な落とし穴

無効な出力

一部のLLMは、必要なフォーマットを守らず、指定されたフォーマットと一致しない出力を生成することがあります。

これは稀ではありますが、特に小規模な自己ホスト型のモデルや不明確なプロンプトを使用している場合に発生することがあります。その場合、出力の検証によってそれを検出し、再度LLM APIを呼び出す必要があります。

ランダム性

もっと大きな問題は、一部のLLMのランダム性です。LLMのtemperatureを0に設定、つまり決定論的な応答を設定しても、時には変動性やハルシネーションが発生することがあります。

大規模なモデルやある程度の推論能力、そしてプロンプト設計によって、この変動性やハルシネーションを減らすことはできますが、いくつかのアノテーションタスクにおいて、変動性やハルシネーションが全くないことを保証するのは難しいかもしれません。

コンテキストウィンドウサイズ

必要なデータの検索に応じて、いくつかのLLMでは最大コンテキストウィンドウサイズにすぐに達してしまうことがあります。

入力データをチャンク化し、LLMのワークフローで数段階に分けて処理を行い、チャンクの要約を繋げることでこれを軽減できます。

非常に特殊なタスク

ツアーに関するスキーマが存在するデータベースがある場合、適切なガイドラインを提供すれば、LLMはアノテーションタスクを比較的うまく処理できます。しかし、すべてのツアーに関してガイドラインを記述し、最初からうまくいくわけではありません。

特定のツアーや文脈に対する理解がLLMには難しい場合があります。例えば、オプショナルなツアーアクティビティが含まれているかどうかについてですが、ツアーによっては、以下のようにオプショナルなツアーアクティビティが説明されているものの、実際には含まれていないこともあります。

稀なツアーの例

以下のツアーのように、コンテンツが含まれていないことを理解するのは、LLMにとっては難しいかもしれません。新たにデータベースに追加されたツアーに対しては、ガイドラインがなく、アノテーションが間違っている可能性があります。

コスト

セルフホスト型のLLMを使うか、大規模なLLM APIを使うかに関わらず、コストは問題です。

入力トークン、つまりツアー情報とガイドラインの数に応じて、より大きなモデルを使用するとコストが非常に高くなる可能性があります。これを軽減するために、プロンプトエンジニアリングをもっと簡潔にしたり、リソース消費が少ないモデルを使うことが考えられます。しかし、タグの精度とコストとの間でトレードオフが生じます。

ヒューマン・イン・ザ・ループ

LLMが質問応答において、ますます優れた性能を発揮するようになりました。

そこで人的なアノテーションはコストが高すぎるため、LLMをアノテーションに活用したいと考えています。しかし、上記のように、いくつかの留意点があります。最適なプロンプトの設計とLLMを使った堅牢なアノテーションシステムの開発は反復的なプロセスです。

ツアーの種類に応じて、プロンプトは何度も更新する必要があり、ガイドラインも定期的に更新する必要があります。つまり、開発プロセスにはヒューマン（人間）の手が組み込まれる必要があるのです。

**データのアノテーションは、人間、LLM、またはその両方の組み合わせで行うことができます。**

ヒューマン（人間）をプロセスにどう組み込むか？それは、LLMが生成したラベルをいくつかのツアーのサブセットに対して評価するタイミングです。では、そのサブセットのツアーはどう選ぶのがよいでしょうか？

層化抽出

令和トラベルでは、世界中のツアーパッケージを提供しています。LLMがアノテートするのが難しいツアーもあれば、簡単なツアーもあります。ガイドラインが必要なものもあれば、不要なものもあります。

したがって、データベース内のすべてのツアーを代表する層化されたサンプルを選択し、それを人が合理的に短時間で評価できるようにすることが重要です。こうすることで、どのタイプのツアーがLLMによってご判定されやすいかを把握し、ガイドラインやプロンプトを反復的に改善することができます。これを何度か行うことで、かなり堅牢なシステムが完成します。

LLM-as-a-Judge

別のLLMに、最初のLLMが提供したラベルを評価させることもできます。二番目のLLMの信頼度に応じて、人に再確認を依頼することができます。

LLMデータアノテーションAPIの構築

ツアー情報を入力し、それを事前に処理してLLMを呼び出し、ラベルを出力するアプリを作成するのは非常に簡単です。問題は、ヒューマン・イン・ザ・ループをどのように実装するかです。アプローチの1つとしてはLangsmithを使用することです。

図8: LLMデータアノテーション開発サイクル

令和トラベルではすでにLangchainを使用しているので、LLMの実行をログに記録するのは非常に簡単です。

LLMを呼び出す際にログを増強するだけで、入力と出力のすべて、または一部がLangsmithに記録されます。それをアノテーションパイプラインに入れて、LLMが追加したラベルに対する人間の評価を受けることができます。これはアドホックにでも定期的にでも、あたらしいツアーやあたらしいラベルが正確にアノテートされているかを確認するために行うことができます。

結論

この記事では、LLMがデータアノテーションタスクにどのように使用できるかについてシェアしました。

それがどのように機能するのか、そして令和トラベルで直面した一般的な落とし穴、さらにはそれらを克服するために利用できる方法についてご紹介しました。また、高品質なデータアノテーションシステムの実現可能性を持たせるためのヒューマン・イン・ザ・ループの重要性についても触れました。

次回は、今回語りきれなかったシステム実装レベルでのアイデアについてお伝えできればと思います。

📣宣伝

最後までお読みいただきありがとうございました！

明日、DAY6の『NEWT 3rd ANNIVERSARY CALENDAR』は、ビジネス創造Unit グループトラベルグループチームリーダーの福本が『令和トラベルが描く、団体旅行の未来』というテーマで担当。

団体旅行とはなにか？という基本から、団体旅行事業に携わることのおもしろさや醍醐味、そして令和トラベルの団体旅行事業において、立ち上げ期だからこそ経験できるゼロイチフェーズ、そして令和トラベル・NEWTだからこそ目指し、実現することができる団体旅行の未来について、これまでの実例やエピソードを交えながらご紹介します。

『NEWT 3rd ANNIVERSARY CALENDAR』についてはこちらから。

NEWT 3rd ANNIVERSARY CALENDARを開催します！｜株式会社令和トラベル

こんにちは！令和トラベルnote編集チームです。明日4月5日に、旅行アプリ『NEWT（ニュート）』はプロダクトローンチから3周年を迎えます。それを記念して、今年も『NEWT 3rd ANNIVERSARY CALENDAR』を開催いたします！ ▼2周年のANNIVERSARY CALENDARはこちらから祝 🎉 サービスローンチ3周年！ 2025年4月、NEWTはついにサービスローンチ3周年を迎えました🎉🎉 2022年のローンチから3年間、コロナ禍を乗り越え、旅行マーケットの回復に後押しされながら、NEWTは大きく成長を遂げました。『あたらしい旅行を、デザインする。』と