검색 쿼리에 대한 임베딩 생성하기
Sentence Transformers는 문장과 단락의 의미론적 의미를 포착하는 로컬 임베딩 모델을 제공하며, 사용이 간편합니다.
이 HackerNews 데이터셋에는 all-MiniLM-L6-v2 모델로 생성된 벡터 임베딩이 포함되어 있습니다.
아래 Python 스크립트 예제는 sentence_transformers1 Python 패키지를 사용하여 프로그래밍 방식으로 임베딩 벡터를 생성하는 방법을 보여줍니다. 검색 임베딩 벡터는 SELECT 쿼리의 cosineDistance() 함수에 인수로 전달됩니다.
from sentence_transformers import SentenceTransformer
import sys
import clickhouse_connect
print("Initializing...")
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
chclient = clickhouse_connect.get_client() # ClickHouse credentials here
while True:
# Take the search query from user
print("Enter a search query :")
input_query = sys.stdin.readline();
texts = [input_query]
# Run the model and obtain search vector
print("Generating the embedding for ", input_query);
embeddings = model.encode(texts)
print("Querying ClickHouse...")
params = {'v1':list(embeddings[0]), 'v2':20}
result = chclient.query("SELECT id, title, text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params)
print("Results :")
for row in result.result_rows:
print(row[0], row[2][:100])
print("---------")
위 Python 스크립트 실행 예시와 유사도 검색 결과는 다음과 같습니다
(상위 20개 게시물에서 각각 100자만 출력됩니다):
Initializing...
Enter a search query :
Are OLAP cubes useful
Generating the embedding for "Are OLAP cubes useful"
Querying ClickHouse...
Results :
27742647 smartmic:
slt2021: OLAP Cube is not dead, as long as you use some form of:<p>1. GROUP BY multiple fi
---------
27744260 georgewfraser:A data mart is a logical organization of data to help humans understand the schema. Wh
---------
27761434 mwexler:"We model data according to rigorous frameworks like Kimball or Inmon because we must r
---------
28401230 chotmat:
erosenbe0: OLAP database is just a copy, replica, or archive of data with a schema designe
---------
22198879 Merick:+1 for Apache Kylin, it's a great project and awesome open source community. If anyone i
---------
27741776 crazydoggers:I always felt the value of an OLAP cube was uncovering questions you may not know to as
---------
22189480 shadowsun7:
_Codemonkeyism: After maintaining an OLAP cube system for some years, I'm not that
---------
27742029 smartmic:
gengstrand: My first exposure to OLAP was on a team developing a front end to Essbase that
---------
22364133 irfansharif:
simo7: I'm wondering how this technology could work for OLAP cubes.<p>An OLAP cube
---------
23292746 scoresmoke:When I was developing my pet project for Web analytics (<a href="https://github
---------
22198891 js8:It seems that the article makes a categorical error, arguing that OLAP cubes were replaced by co
---------
28421602 chotmat:
7thaccount: Is there any advantage to OLAP cube over plain SQL (large historical database r
---------
22195444 shadowsun7:
lkcubing: Thanks for sharing. Interesting write up.<p>While this article accurately capt
---------
22198040 lkcubing:Thanks for sharing. Interesting write up.<p>While this article accurately captures the issu
---------
3973185 stefanu:
sgt: Interesting idea. Ofcourse, OLAP isn't just about the underlying cubes and dimensions,
---------
22190903 shadowsun7:
js8: It seems that the article makes a categorical error, arguing that OLAP cubes were r
---------
28422241 sradman:OLAP Cubes have been disrupted by Column Stores. Unless you are interested in the history of
---------
28421480 chotmat:
sradman: OLAP Cubes have been disrupted by Column Stores. Unless you are interested in the
---------
27742515 BadInformatics:
quantified: OP posts with inverted condition: “OLAP != OLAP Cube” is the actual titl
---------
28422935 chotmat:
rstuart4133: I remember hearing about OLAP cubes donkey's years ago (probably not far
---------
요약 데모 애플리케이션
위 예제에서는 ClickHouse를 사용하여 시맨틱 검색 및 문서 검색을 시연했습니다.
다음으로 매우 간단하지만 잠재력이 높은 생성형 AI 예제 애플리케이션이 제시됩니다.
애플리케이션은 다음 단계를 수행합니다:
- 사용자가 입력한 topic을 입력으로 받습니다
SentenceTransformers의 all-MiniLM-L6-v2 모델을 사용하여 topic에 대한 임베딩 벡터를 생성합니다
hackernews 테이블에서 벡터 유사도 검색으로 가장 관련성이 높은 게시물과 댓글을 조회합니다
LangChain과 OpenAI gpt-3.5-turbo Chat API를 사용하여 3단계에서 가져온 콘텐츠를 요약합니다.
3단계에서 가져온 게시물/댓글은 Chat API에 context로 전달되며, 생성형 AI에서 핵심 연결 고리 역할을 합니다.
요약 애플리케이션 실행 예시가 먼저 아래에 나열되어 있으며, 그 다음에 요약 애플리케이션 코드가 제공됩니다. 애플리케이션을 실행하려면 환경 변수 OPENAI_API_KEY에 OpenAI API 키를 설정해야 합니다. OpenAI API 키는 https://platform.openai.com 에서 등록 후 얻을 수 있습니다.
이 애플리케이션은 고객 감정 분석, 기술 지원 자동화, 사용자 대화 분석, 법률 문서, 의료 기록, 회의 녹취록, 재무제표 등 다양한 엔터프라이즈 영역에 적용 가능한 생성형 AI 사용 사례를 보여줍니다
$ python3 summarize.py
Enter a search topic :
ClickHouse performance experiences
Generating the embedding for ----> ClickHouse performance experiences
Querying ClickHouse to retrieve relevant articles...
Initializing chatgpt-3.5-turbo model...
Summarizing search results retrieved from ClickHouse...
Summary from chatgpt-3.5:
The discussion focuses on comparing ClickHouse with various databases like TimescaleDB, Apache Spark,
AWS Redshift, and QuestDB, highlighting ClickHouse's cost-efficient high performance and suitability
for analytical applications. Users praise ClickHouse for its simplicity, speed, and resource efficiency
in handling large-scale analytics workloads, although some challenges like DMLs and difficulty in backups
are mentioned. ClickHouse is recognized for its real-time aggregate computation capabilities and solid
engineering, with comparisons made to other databases like Druid and MemSQL. Overall, ClickHouse is seen
as a powerful tool for real-time data processing, analytics, and handling large volumes of data
efficiently, gaining popularity for its impressive performance and cost-effectiveness.
위 애플리케이션의 코드:
print("Initializing...")
import sys
import json
import time
from sentence_transformers import SentenceTransformer
import clickhouse_connect
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
import textwrap
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
encoding = tiktoken.encoding_for_model(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
chclient = clickhouse_connect.get_client(compress=False) # ClickHouse credentials here
while True:
# Take the search query from user
print("Enter a search topic :")
input_query = sys.stdin.readline();
texts = [input_query]
# Run the model and obtain search or reference vector
print("Generating the embedding for ----> ", input_query);
embeddings = model.encode(texts)
print("Querying ClickHouse...")
params = {'v1':list(embeddings[0]), 'v2':100}
result = chclient.query("SELECT id,title,text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params)
# Just join all the search results
doc_results = ""
for row in result.result_rows:
doc_results = doc_results + "\n" + row[2]
print("Initializing chatgpt-3.5-turbo model")
model_name = "gpt-3.5-turbo"
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
model_name=model_name
)
texts = text_splitter.split_text(doc_results)
docs = [Document(page_content=t) for t in texts]
llm = ChatOpenAI(temperature=0, model_name=model_name)
prompt_template = """
Write a concise summary of the following in not more than 10 sentences:
{text}
CONSCISE SUMMARY :
"""
prompt = PromptTemplate(template=prompt_template, input_variables=["text"])
num_tokens = num_tokens_from_string(doc_results, model_name)
gpt_35_turbo_max_tokens = 4096
verbose = False
print("Summarizing search results retrieved from ClickHouse...")
if num_tokens <= gpt_35_turbo_max_tokens:
chain = load_summarize_chain(llm, chain_type="stuff", prompt=prompt, verbose=verbose)
else:
chain = load_summarize_chain(llm, chain_type="map_reduce", map_prompt=prompt, combine_prompt=prompt, verbose=verbose)
summary = chain.run(docs)
print(f"Summary from chatgpt-3.5: {summary}")