AI 自然言語

Llama2とQdrantとLangchainでRAGっぽいものを作る②

①はこちらから。Qdrantの基本操作について書きおろし。

1ファイルをembeddingにして格納

langchainのLlamaCppEmbeddingsを使用して格納する。
langchainにはQdrantが操作出来る関数もあるのでそちらを利用。

pip install langchain

9kbくらいの簡単なテキストファイルを作成してembeddingにしてみる。

from langchain.vectorstores import Qdrant
from langchain.embeddings import LlamaCppEmbeddings

embeddings = LlamaCppEmbeddings(
    model_path="./llama/llama-2-7b-chat.ggmlv3.q5_0.bin"
)

text = "./test.txt"

query_result = embeddings.embed_documents([text])

print(query_result)

#結果
llama_print_timings:        load time =   879.36 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   879.28 ms /     5 tokens (  175.86 ms per token,     5.69 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   880.45 ms

特にエラーなくembeddingにできたはいいが、「5token」となってしまっている。
恐らく1行くらいしか読み込めていない可能性が高い。

もう少しちゃんとやる。
以下のテキストファイルを対象に処理をしてみる。

When a person catches a common cold (cold syndrome), in addition to nose and throat symptoms, a variety of other symptoms throughout the body may be noticed. While some of these symptoms may resolve with the alleviation of cold symptoms, if the systemic symptoms are strong, it is important to note that it may not be cold syndrome, but other infectious diseases such as influenza or viral gastroenteritis.

・Headache, muscle pain
A substance called prostaglandin, which is secreted by the body to actively fight the virus, can cause fever and intensify headache, muscle aches, and joint pain. If the pain and fatigue are severe, there is a possibility of influenza.
・Mouth ulcers.
Mouth ulcers may be caused by summer colds, such as hand-foot-and-mouth disease and herpangina.
・Diarrhea and vomiting
When diarrhea and vomiting occur along with fever, it may be due to viral gastroenteritis, also known as a tummy cold.
・Constipation
Constipation may occur depending on the ingredients of cold remedies.

langchainのドキュメント通りにファイルをスプリットして読み込ませるようにする。

from langchain.vectorstores import Qdrant
from langchain.embeddings import LlamaCppEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader

from qdrant_client import QdrantClient
from qdrant_client.http import models


embeddings = LlamaCppEmbeddings(
    model_path="./llama/llama-2-7b-chat.ggmlv3.q8_0.bin"
)

loader = TextLoader("./text/cold.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000,chunk_overlap=0)
docs = text_splitter.split_documents(documents)

url = "172.17.0.2"
qdrant = Qdrant.from_documents(
    docs,
    embeddings,
    url = url,
    collection_name = "my_documents",
    force_recreate = True
)

docsの中には上のテキストファイルをぶった切った文章が格納されている。
ちなみにぶったぎられた中身はこんな感じ。ちなみに、風邪の症状が書いてある文章。

[Document(page_content='When a person catches a common cold (cold syndrome), in addition to nose and throat symptoms, 
a variety of other symptoms throughout the body may be noticed. While some of these symptoms may resolve with the alleviation of cold symptoms, 
if the systemic symptoms are strong, it is important to note that it may not be cold syndrome, but other infectious diseases such as influenza or viral gastroenteritis.', 
metadata={'source': './text/cold.txt'}), 

Document(page_content='・Headache, muscle pain\nA substance called prostaglandin, which is secreted by the body to actively fight the virus, 
can cause fever and intensify headache, muscle aches, and joint pain. If the pain and fatigue are severe, there is a possibility of influenza.\n
・Mouth ulcers.\nMouth ulcers may be caused by summer colds, such as hand-foot-and-mouth disease and herpangina.\n
・Diarrhea and vomiting\nWhen diarrhea and vomiting occur along with fever, it may be due to viral gastroenteritis, also known as a tummy cold.\n
・Constipation\nConstipation may occur depending on the ingredients of cold remedies.', 
metadata={'source': './text/cold.txt'})]

結果。所定の区切り方をしてあげないと正しく処理できない模様。
ちゃんとそれっぽいトークン数を処理してくれている。直でテキストを渡すと何がダメなのかはよくわからないので要調査。
そして、なんで3回処理されてるんだこれ・・・。今見返して気が付いた。

llama_print_timings:        load time =  1242.70 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 14828.89 ms /    96 tokens (  154.47 ms per token,     6.47 tokens per second)
llama_print_timings:        eval time =   183.88 ms /     1 runs   (  183.88 ms per token,     5.44 tokens per second)
llama_print_timings:       total time = 15038.92 ms

llama_print_timings:        load time =  1242.70 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 14901.77 ms /    96 tokens (  155.23 ms per token,     6.44 tokens per second)
llama_print_timings:        eval time =   190.72 ms /     1 runs   (  190.72 ms per token,     5.24 tokens per second)
llama_print_timings:       total time = 15115.98 ms

llama_print_timings:        load time =  1242.70 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 27076.05 ms /   171 tokens (  158.34 ms per token,     6.32 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 27115.19 ms

ちなみにQdrantの中身はこんな感じ。

[ScoredPoint(id='24ec942a-9ebb-414a-8925-dbf8620b3a54', version=0, score=0.16597658, 
payload={
'metadata': {'source': './text/cold.txt'}, 
'page_content': 
'When a person catches a common cold (cold syndrome), 
in addition to nose and throat symptoms, a variety of other symptoms throughout the body may be noticed. 
While some of these symptoms may resolve with the alleviation of cold symptoms, if the systemic symptoms are strong, 
it is important to note that it may not be cold syndrome, but other infectious diseases such as influenza or viral gastroenteritis.'}, 
vector=None), 

ScoredPoint(id='5b2932d1-f6ab-4a21-a9f0-79565b2d4d7d', version=0, score=-0.030298123, 
payload={'metadata': {'source': './text/cold.txt'}, 
'page_content': 
'・Headache, muscle pain\nA substance called prostaglandin, which is secreted by the body to actively fight the virus, 
can cause fever and intensify headache, muscle aches, and joint pain. If the pain and fatigue are severe, 
there is a possibility of influenza.\n・Mouth ulcers.\nMouth ulcers may be caused by summer colds, such as hand-foot-and-mouth disease and herpangina.\n
・Diarrhea and vomiting\nWhen diarrhea and vomiting occur along with fever, it may be due to viral gastroenteritis, also known as a tummy cold.
\n・Constipation\nConstipation may occur depending on the ingredients of cold remedies.'}, 
vector=None)]

langchainのドキュメントを見るかぎりでは”page_content”と"metadata"の名前しか変更が出来なさそう(少なくともドキュメントには未記載)なので、独自のpayloadを作りたいときにはqdrant側のupsertなどを利用しないとダメかもしれないね。

複数ファイルを処理してllamaを使って検索をしてみる。

風邪の症状とコロナの症状とインフルエンザの症状を「適当」に記載したテキストファイルのembeddingをQdrantに登録し、llamaに問あわせをして望みの結果が得られるか?を確認してみる。

テキストファイルの読み込みはfor文で簡単に処理できる。

from langchain.vectorstores import Qdrant
from langchain.embeddings import LlamaCppEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader

from qdrant_client import QdrantClient
from qdrant_client.http import models


embeddings = LlamaCppEmbeddings(
    model_path="./llama/llama-2-7b-chat.ggmlv3.q8_0.bin"
)

files = ["./text/colona.txt","./text/influenza.txt"]
url = "172.17.0.2"

for file in files:
    loader = TextLoader(file)
    documents = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=1000,chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    qdrant = Qdrant.from_documents(
        docs,
        embeddings,
        url = url,
        collection_name = "my_documents",
    )

↓はllamaへ「コロナの症状を教えて?」というクエリをQdrantに投げている。(下から3~4行目あたり)

from langchain.vectorstores import Qdrant
from langchain.embeddings import LlamaCppEmbeddings

from qdrant_client import QdrantClient

client = QdrantClient("172.17.0.2", port=6333)
embeddings = LlamaCppEmbeddings(
    model_path="./llama/llama-2-7b-chat.ggmlv3.q8_0.bin"
)

db = Qdrant(
    client = client,
    collection_name = "my_documents",
    embeddings=embeddings
    )

query = "Tell me the symptoms of corona."
result,score = db.similarity_search_with_score(query)[0]

print(result.page_content)
print(f'score:{score}')

結果。

When a person catches a common cold (cold syndrome), in addition to nose and throat symptoms, a variety of other symptoms throughout the body may be noticed. 
While some of these symptoms may resolve with the alleviation of cold symptoms, if the systemic symptoms are strong, it is important to note that it may not be cold syndrome, 
but other infectious diseases such as influenza or viral gastroenteritis.
score:0.6049131

これは風邪の症状が書かれたテキストファイル。今回の実験ではあまりうまくいかなかった。恐らく食べさせたテキストファイルが適当すぎたため。でも、chat側の応答をVector DBをうまく紐づければ色々出来そうだということがよくわかった実験でした。

-AI, 自然言語
-, , , ,