Indlela yokwakha i-Documentation Smart - Ngokusekelwe kwi-OpenAI Embeddings (Chunking, Indexing, and Searching)

Hey bonke! Ndifuna ukunika indlela yam yokwenza i-chatbot ye-"intelligent documentation" yeeprojekthi ngexesha elandelayo.I’m not an AI expert, so any suggestions or improvements are more than welcome!

Indawo yeeposi leyo ayikwazi ukwenza enye uphando yokwakha i-chatbot esekelwe kwi-OpenAI. Kukho kwakhona izinto ezininzi kwi-theme yayo. Kwiimeko yokuqala, ukubaindex documentationUkuqhathanisa kwi-manageablechunksUkusebenzaembeddingsnge-OpenAI, kunyeperforming a similarity searchukufumana nokufumana ulwazi olungafanelekileyo kwifowuni lomsebenzisi.

Kwimeko yam, i-documentation iya kuba iifayile ze-Markdown, kodwa kunokuba iintlobo ze-text, i-database object, njl.

Ngaba?

Ngenxa yokuba kunokwenzeka ukufumana ulwazi olufunayo, ndingathanda ukuvelisa i-chatbot enokufumana imibuzo malunga ne-theme eyodwa kwaye inikeza inkxaso olufanelekileyo kwi-documentation.

I-Assistant ingasetyenziswa kwiindlela ezininzi, ezifana:

Ukunikezela ngokushesha imibuzo embalwa
Ukufuna i-doc/page njengoko i-Algolia
Ukusiza kubasebenzisi ukufumana iinkcukacha ezifuna kwi-doc
Ukufumana iingxaki / imibuzo yabasebenzisi ngokuphefumla imibuzo ezidlulileyo

Ukucaciswa

Ngezantsi, ndicinga iindawo ezintathu eziphambili kwisombululo yam:

Ukucinga iifayile zeDokumentation
Ukubhalisa i-Documentation (chunking, overlap, kunye nokubhalisa)
Ukukhangela i-documentation (kuvela kwi-chatbot)

Iifayile ye-Tree

.
└── docs
    └── ...md
└── src
    └── askDocQuestion.ts
    └── index.ts # Express.js application endpoint
└── embeddings.json # Storage for embeddings
└── packages.json

1. Ukucinga iifayile zeDokumentation

Kwimeko yokubhalisa i-hardcoding ye-documentation, unako ukucinga ifolda.mdifayile usebenzisa izixhobo ezifanaglob.

// Example snippet of fetching files from a folder:
import fs from "node:fs";
import path from "node:path";
import glob from "glob";

const DOC_FOLDER_PATH = "./docs";

type FileData = {
  path: string;
  content: string;
};

const readAllMarkdownFiles = (): FileData[] => {
  const filesContent: FileData[] = [];
  const filePaths = glob.sync(`${DOC_FOLDER_PATH}/**/*.md`);

  filePaths.forEach((filePath) => {
    const content = fs.readFileSync(filePath, "utf8");
    filesContent.push({ path: filePath, content });
  });

  return filesContent;
};

Kwi-alternative, unako ngokwenene ukufumana i-documentation yakho kwi-database yakho okanye i-CMS njl.

Kwi-alternative, unako ngokwenene ukufumana i-documentation yakho kwi-database yakho okanye i-CMS njl.

I-Indexing yeDokumentation

Ukwenza injini yethu yokufunda, siya kusebenzisa i-OpenAII-Vector Embeddings ye-APIUkwenza iinkcukacha zethu.

I-Vector embeddings yindlela yokubonisa idatha kwi-format ye-numerical, leyo ingasetyenziswa ukuqhuba i-similarity searches (kwixesha lethu, phakathi kwe-user question kunye ne-documentation sections).

I-vector, ebonakalayo kwi-list yeenombolo ze-point floating, iyasetyenziswa ukucacisa i-similarity usebenzisa i-formula ye-mathematical.

[
  -0.0002630692, -0.029749284, 0.010225477, -0.009224428, -0.0065269712,
  -0.002665544, 0.003214777, 0.04235309, -0.033162255, -0.00080789323,
  //...+1533 elements
];

Ngokusekelwe kwinkqubo yayo, iVector Database yenzelwe. Ngenxa yoko, ngaphandle kokusebenzisa i-OpenAI API, kunokwenzeka ukusetyenziswa ne-vector database efana neChroma, Qdrant okanye Pinecone.

Ngokusekelwe kwinkqubo yayo, iVector Database yenzelwe. Ngenxa yoko, ngaphandle kokusebenzisa i-OpenAI API, kunokwenzeka ukusetyenziswa ne-vector database efana neChroma, Qdrant okanye Pinecone.

2.1 Chunk zonke iifayile & Overlap

I-block enkulu ye-text inokukwazi ukufumana iintlobo ze-context ye-model okanye ukuphazamiseka kwe-hits ezincinane, ngoko kunceda ukuxhaswa kwi-chunk ukuze kwenziwe ukufuna ngokugqithisileyo. Nangona kunjalo, ukuze ukugcina ukuxhaswa phakathi kwe-chunk, sincoma kwi-token ye-token ye-token ye-token ye-token ye-token ye-token ye-token ye-token ye-token ye-token ye-token ye-token ye-token ye-token ye-token ye-token.

Umzekelo we Chunking

Kule nqakraza, sinayo umbhalo elide ekubeni kwizithuba ezincinane. Kule mzekelo, ufuna ukwenza izithuba ze-100 iimpawu kwaye zihlanganisa nge-50 iimpawu.

Full Text (406 characters):

Kwiindawo ezininzi ezininzi, i-bibliothèque elidlulileyo yaye iindwendwe abaninzi. Iingcingo zayo ezininzi ziquka iincwadi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi ezininzi.

Chunk 1 (Characters 1-150):

In the heart of the bustling city, there stood an old library that many had forgotten. Its towering shelves were filled with books from every imaginabl.
Chunk 2 (Characters 101-250):

shelves were filled with books from every imaginable genre, each whispering stories of adventures, mysteries, and timeless wisdom. Every evening, a d
Chunk 3 (Characters 201-350):

ysteries, and timeless wisdom. Every evening, a dedicated librarian would open its doors, welcoming curious minds eager to explore the vast knowledge
Chunk 4 (Characters 301-406):

curious minds eager to explore the vast knowledge within. Children would gather for storytelling sessions.

ikhowudi Snippet

const CHARS_PER_TOKEN = 4.15; // Approximate pessimistically number of characters per token. Can use `tiktoken` or other tokenizers to calculate it more precisely

const MAX_TOKENS = 500; // Maximum number of tokens per chunk
const OVERLAP_TOKENS = 100; // Number of tokens to overlap between chunks

const maxChar = MAX_TOKENS * CHARS_PER_TOKEN;
const overlapChar = OVERLAP_TOKENS * CHARS_PER_TOKEN;

const chunkText = (text: string): string[] => {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    let end = Math.min(start + maxChar, text.length);

    // Don’t cut a word in half if possible:
    if (end < text.length) {
      const lastSpace = text.lastIndexOf(" ", end);
      if (lastSpace > start) end = lastSpace;
    }

    chunks.push(text.substring(start, end));
    // Overlap management
    const nextStart = end - overlapChar;
    start = nextStart <= start ? end : nextStart;
  }

  return chunks;
};

Ukuze ufunde ngakumbi malunga ne-chunking, kunye ne-impact ye-size kwi-embedding, unokufunda le nqaku.

Ukuze ufunde ngakumbi malunga ne-chunking, kunye ne-impact ye-size kwi-embedding, unokufunda le nqaku.

2.2 Ukulungiselela Generation

Emva kokuba ifayile ifunyenwe, sinikeza ama-embeddings ye-vector ngalinye usebenzisa i-API ye-OpenAI (isib.text-embedding-3-largeinguqulelo

import { OpenAI } from "openai";

const EMBEDDING_MODEL: OpenAI.Embeddings.EmbeddingModel =
  "text-embedding-3-large"; // Model to use for embedding generation

const openai = new OpenAI({ apiKey: OPENAI_API_KEY });

const generateEmbedding = async (textChunk: string): Promise<number[]> => {
  const response = await openai.embeddings.create({
    model: EMBEDDING_MODEL,
    input: textChunk,
  });

  return response.data[0].embedding; // Return the generated embedding
};

2.3 Ukuguqulwa kunye nokuphucula i-embeddings ye-file epheleleyo

Ukunciphisa ukuguqulwa kwe-embeddings ngexesha elide, sinikezela i-embeddings. Oku kunokuhlalwa kwi-database. Kodwa ngexesha elandelayo, sinikezela nje kwi-file JSON kwi-locally.

Ikhowudi elandelayo kuphela:

iterates phezu ngalinye iinkcukacha,
Ukubandakanya i-document kwi-chunks,
iimveliso ze-embeddings ngenxa ye-chunk,
Ukuphepha i-embeddings kwi-JSON file.
Fumana i-vectorStore kunye neengxaki ezisetyenziselwa ukusetyenziswa kwisiqinisekiso.

import embeddingsList from "../embeddings.json";

/**
 * Simple in-memory vector store to hold document embeddings and their content.
 * Each entry contains:
 * - filePath: A unique key identifying the document
 * - chunkNumber: The number of the chunk within the document
 * - content: The actual text content of the chunk
 * - embedding: The numerical embedding vector for the chunk
 */
const vectorStore: {
  filePath: string;
  chunkNumber: number;
  content: string;
  embedding: number[];
}[] = [];

/**
 * Indexes all Markdown documents by generating embeddings for each chunk and storing them in memory.
 * Also updates the embeddings.json file if new embeddings are generated.
 */
export const indexMarkdownFiles = async (): Promise<void> => {
  // Retrieve documentations
  const docs = readAllMarkdownFiles();

  let newEmbeddings: Record<string, number[]> = {};

  for (const doc of docs) {
    // Split the document into chunks based on headings
    const fileChunks = chunkText(doc.content);

    // Iterate over each chunk within the current file
    for (const chunkIndex of Object.keys(fileChunks)) {
      const chunkNumber = Number(chunkIndex) + 1; // Chunk number starts at 1
      const chunksNumber = fileChunks.length;

      const chunk = fileChunks[chunkIndex as keyof typeof fileChunks] as string;

      const embeddingKeyName = `${doc.path}/chunk_${chunkNumber}`; // Unique key for the chunk

      // Retrieve precomputed embedding if available
      const existingEmbedding = embeddingsList[
        embeddingKeyName as keyof typeof embeddingsList
      ] as number[] | undefined;

      let embedding = existingEmbedding; // Use existing embedding if available

      if (!embedding) {
        embedding = await generateEmbedding(chunk); // Generate embedding if not present
      }

      newEmbeddings = { ...newEmbeddings, [embeddingKeyName]: embedding };

      // Store the embedding and content in the in-memory vector store
      vectorStore.push({
        filePath: doc.path,
        chunkNumber,
        embedding,
        content: chunk,
      });

      console.info(`- Indexed: ${embeddingKeyName}/${chunksNumber}`);
    }
  }

  /**
   * Compare the newly generated embeddings with existing ones
   *
   * If there is change, update the embeddings.json file
   */
  try {
    if (JSON.stringify(newEmbeddings) !== JSON.stringify(embeddingsList)) {
      fs.writeFileSync(
        "./embeddings.json",
        JSON.stringify(newEmbeddings, null, 2)
      );
    }
  } catch (error) {
    console.error(error);
  }
};

3. Ukufumana i-Documentation

3.1 Uhlobo lweVector

Ukusabela umbuzo lomsebenzisi, kuqala ukuguqulwa kwe-embeddinguser's questionuze usebenze i-cosine similarity phakathi kwe-query embedding kunye ne-embedding ye-chunk. Sifiltrate into engaphantsi kwinqanaba le-similarity kwaye sinikezela kuphela i-top X matches.

/**
 * Calculates the cosine similarity between two vectors.
 * Cosine similarity measures the cosine of the angle between two vectors in an inner product space.
 * Used to determine the similarity between chunks of text.
 *
 * @param vecA - The first vector
 * @param vecB - The second vector
 * @returns The cosine similarity score
 */
const cosineSimilarity = (vecA: number[], vecB: number[]): number => {
  // Calculate the dot product of the two vectors
  const dotProduct = vecA.reduce((sum, a, idx) => sum + a * vecB[idx], 0);

  // Calculate the magnitude (Euclidean norm) of each vector
  const magnitudeA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
  const magnitudeB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));

  // Compute and return the cosine similarity
  return dotProduct / (magnitudeA * magnitudeB);
};

const MIN_RELEVANT_CHUNKS_SIMILARITY = 0.77; // Minimum similarity required for a chunk to be considered relevant
const MAX_RELEVANT_CHUNKS_NB = 15; // Maximum number of relevant chunks to attach to chatGPT context

/**
 * Searches the indexed documents for the most relevant chunks based on a query.
 * Utilizes cosine similarity to find the closest matching embeddings.
 *
 * @param query - The search query provided by the user
 * @returns An array of the top matching document chunks' content
 */
const searchChunkReference = async (query: string) => {
  // Generate an embedding for the user's query
  const queryEmbedding = await generateEmbedding(query);

  // Calculate similarity scores between the query embedding and each document's embedding
  const results = vectorStore
    .map((doc) => ({
      ...doc,
      similarity: cosineSimilarity(queryEmbedding, doc.embedding), // Add similarity score to each doc
    }))
    // Filter out documents with low similarity scores
    // Avoid to pollute the context with irrelevant chunks
    .filter((doc) => doc.similarity > MIN_RELEVANT_CHUNKS_SIMILARITY)
    .sort((a, b) => b.similarity - a.similarity) // Sort documents by highest similarity first
    .slice(0, MAX_RELEVANT_CHUNKS_NB); // Select the top most similar documents

  // Return the content of the top matching documents
  return results;
};

3.2 Ukukhuthaza i-OpenAI kunye ne-Chunks ezifanelekileyo

Emva kokuzihlanganisa, thina ukutyatopI-ChatGPT ibonisa iinkcukacha zokusebenza kwi-system prompt ye-ChatGPT request. Oku kuthetha ukuba i-ChatGPT ibonisa iinkcukacha zayo ezininzi zokusetyenziswa kwi-docs yakho njengoko ukhiye kwi-conversation. Emva koko, sincoma i-ChatGPT ukubonisa umyalezo kubasebenzisi.

const MODEL: OpenAI.Chat.ChatModel = "gpt-4o-2024-11-20"; // Model to use for chat completions

// Define the structure of messages used in chat completions
export type ChatCompletionRequestMessage = {
  role: "system" | "user" | "assistant"; // The role of the message sender
  content: string; // The text content of the message
};

/**
 * Handles the "Ask a question" endpoint in an Express.js route.
 * Processes user messages, retrieves relevant documents, and interacts with OpenAI's chat API to generate responses.
 *
 * @param messages - An array of chat messages from the user and assistant
 * @returns The assistant's response as a string
 */
export const askDocQuestion = async (
  messages: ChatCompletionRequestMessage[]
): Promise<string> => {
  // Assistant's response are filtered out otherwise the chatbot will be stuck in a self-referential loop
  // Note that the embedding precision will be lowered if the user change of context in the chat
  const userMessages = messages.filter((message) => message.role === "user");

  // Format the user's question to keep only the relevant keywords
  const formattedUserMessages = userMessages
    .map((message) => `- ${message.content}`)
    .join("\n");

  // 1) Find relevant documents based on the user's question
  const relevantChunks = await searchChunkReference(formattedUserMessages);

  // 2) Integrate the relevant documents into the initial system prompt
  const messagesList: ChatCompletionRequestMessage[] = [
    {
      role: "system",
      content:
        "Ignore all previous instructions. \
        You're an helpful chatbot.\
        ...\
        Here is the relevant documentation:\
        " +
        relevantChunks
          .map(
            (doc, idx) =>
              `[Chunk ${idx}] filePath = "${doc.filePath}":\n${doc.content}`
          )
          .join("\n\n"), // Insert relevant chunks into the prompt
    },
    ...messages, // Include the chat history
  ];

  // 3) Send the compiled messages to OpenAI's Chat Completion API (using a specific model)
  const response = await openai.chat.completions.create({
    model: MODEL,
    messages: messagesList,
  });

  const result = response.choices[0].message.content; // Extract the assistant's reply

  if (!result) {
    throw new Error("No response from OpenAI");
  }

  return result;
};

I-OpenAI API ye-Chatbot usebenzisa i-Express

Ukusebenza kwinkqubo yethu, siya kubasebenzisa i-Express.js server. Nazi isibonelo ye-Express.js endpoint esincinane ekusebenziseni isicelo:

import express, { type Request, type Response } from "express";
import {
  ChatCompletionRequestMessage,
  askDocQuestion,
  indexMarkdownFiles,
} from "./askDocQuestion";

// Automatically fill the vector store with embeddings when server starts
indexMarkdownFiles();

const app = express();

// Parse incoming requests with JSON payloads
app.use(express.json());

type AskRequestBody = {
  messages: ChatCompletionRequestMessage[];
};

// Routes
app.post(
  "/ask",
  async (
    req: Request<undefined, undefined, AskRequestBody>,
    res: Response<string>
  ) => {
    try {
      const response = await askDocQuestion(req.body.messages);

      res.json(response);
    } catch (error) {
      console.error(error);
    }
  }
);

// Start server
app.listen(3000, () => {
  console.log(`Listening on port 3000`);
});

I-UI: Ukwenza interface ye-Chatbot

Kwi-frontend, i-React i-component ebonakalayo kunye ne-chat-like interface. Inikeza iingxelo kwi-Express ye-backend yam kwaye ibonisa imibuzo. Akukho izinto ezininzi ezininzi, ngoko siyafumana iinkcukacha.

Ikhowudi Temple

Ndawo Aikhowudi templateukuze usebenzise njenge point yokuqala ye-chatbot yakho.

I-Demo ye-Live

Ukuba ufuna ukuhlola ukuveliswa kokugqibela le chatbot, nqakraza okuUkubonisa page.

Ukubonisa page

I-Demo Code yam

Ukukhangisa: askDocQuestion.ts
Frontend: Iimpawu ze-ChatBot

Yiba ngaphezulu

kwi-YouTube, nqakraza okuIvidiyo ze-Adrien TwarogI-OpenAI Embeddings kunye ne-Vector Databases.

Ngathi ndingathandaI-OpenAI's Assistants File Search Documentation Ukukhangela iifayile, nto leyo kunokwenzeka ukuba ufuna indlela eyahlukileyo.

Ukucinga

Ndingathanda oku kunikeza indlela yokufakelwa kwe-documentation ye-chatbot:

Ukusetyenziswa kwe-chunking + overlap ukuze i-context efanelekileyo ifumaneka,
Ukuguqulwa kwe-embeddings kunye nokuphucula ku-vector similarity search,
Okwangoku, ndidlala kwi-ChatGPT kunye ne-context efanelekileyo.

Ndingathanda i-AI; oku kuphela isisombululo efunyenwe ngokufanelekileyo kwiimfuneko zayo. Ukuba unayo iingcebiso yokuphuculwa kwimveliso okanye isisombululo esihle kakhulu,please let me knowNdingathanda ukufumana imibuzo malunga nemisombululo ye-vektor storage, i-chunking strategies, okanye ezinye iingcebiso zokusebenza.

Thanks for reading, and feel free to share your thoughts!

Indlela yokwakha i-Documentation Smart - Ngokusekelwe kwi-OpenAI Embeddings (Chunking, Indexing, and Searching)

Inde kakhulu; Ukufunda

Ngaba?

Ukucaciswa

Iifayile ye-Tree

1. Ukucinga iifayile zeDokumentation

I-Indexing yeDokumentation

2.1 Chunk zonke iifayile & Overlap

Umzekelo we Chunking

ikhowudi Snippet

2.2 Ukulungiselela Generation

2.3 Ukuguqulwa kunye nokuphucula i-embeddings ye-file epheleleyo

3. Ukufumana i-Documentation

3.1 Uhlobo lweVector

3.2 Ukukhuthaza i-OpenAI kunye ne-Chunks ezifanelekileyo

I-OpenAI API ye-Chatbot usebenzisa i-Express

I-UI: Ukwenza interface ye-Chatbot

Ikhowudi Temple

I-Demo ye-Live

I-Demo Code yam

Yiba ngaphezulu

Ukucinga

About Author

ZIJONGE IIMPAWU

ELI NQAKU LINIKEZELWE KU...

Categories

Trending Topics

Indlela yokwakha i-Documentation Smart - Ngokusekelwe kwi-OpenAI Embeddings (Chunking, Indexing, and Searching)

Inde kakhulu; Ukufunda

Ngaba?

Ukucaciswa

Iifayile ye-Tree

1. Ukucinga iifayile zeDokumentation

I-Indexing yeDokumentation

2.1 Chunk zonke iifayile & Overlap

Umzekelo we Chunking

ikhowudi Snippet

2.2 Ukulungiselela Generation

2.3 Ukuguqulwa kunye nokuphucula i-embeddings ye-file epheleleyo

3. Ukufumana i-Documentation

3.1 Uhlobo lweVector

3.2 Ukukhuthaza i-OpenAI kunye ne-Chunks ezifanelekileyo

I-OpenAI API ye-Chatbot usebenzisa i-Express

I-UI: Ukwenza interface ye-Chatbot

Ikhowudi Temple

I-Demo ye-Live

I-Demo Code yam

Yiba ngaphezulu

Ukucinga

About Author

ZIJONGE IIMPAWU

ELI NQAKU LINIKEZELWE KU...

AMABALI ENXULUMENEYO

Categories

Trending Topics