Looking to move Data from Share Point data to a vector Data Using Open AI

I am currently trying to move data from MS SharePoint to A Vector Database using Open AI. I have designed the workflow to pull the data pages but I'm not sure how the data to be broken into chunks and embedded into the Pine cone database.

We are using PDF, WOrs and excel files in the SP site but looking to focus on PDF at the moment.

Hi Thomas,

Thanks for reaching out. It’s great to hear you’ve already set up the workflow to extract content from SharePoint. Regarding your question about chunking, I am providing a very common chunking approach - Recursive Character Text Splitter. This will work for almost all kind of data.

You can create a preSavePage hook and add the following code to chunk the data in suitable length chunks.

For the following code, I have used ["\n\n", "\n", " ", ""] as the separators. You can change it based on your needs. You can change the chunk size and overlap as per your needs.

//preSavePage script
class RecursiveCharacterTextSplitter {
  constructor(options) {
    this.chunkSize = options.chunkSize || 1000;
    this.chunkOverlap = options.chunkOverlap || 200;
    this.separators = options.separators || ["\n\n", "\n", " ", ""]; // Default separators
  }
  splitText(text) {
    const chunks = [];
    let remainingText = text;
    const _recursiveSplit = (currentText, currentSeparators) => {
      if (currentText.length <= this.chunkSize) {
        chunks.push(currentText);
        return;
      }
      if (currentSeparators.length === 0) {
        // Fallback if no more separators can be used to split within chunk size
        chunks.push(currentText.substring(0, this.chunkSize));
        if (currentText.length > this.chunkSize) {
          // Add remaining part, potentially with overlap
          _recursiveSplit(currentText.substring(this.chunkSize - this.chunkOverlap), []);
        }
        return;
      }
      const separator = currentSeparators[0];
      const parts = currentText.split(separator);
      const nextSeparators = currentSeparators.slice(1);
      let currentChunk = "";
      for (let i = 0; i < parts.length; i++) {
        const part = parts[i];
        if ((currentChunk + part + (i < parts.length - 1 ? separator : "")).length <= this.chunkSize) {
          currentChunk += part + (i < parts.length - 1 ? separator : "");
        } else {
          if (currentChunk.length > 0) {
            _recursiveSplit(currentChunk, nextSeparators);
          }
          currentChunk = part + (i < parts.length - 1 ? separator : "");
        }
      }
      if (currentChunk.length > 0) {
        _recursiveSplit(currentChunk, nextSeparators);
      }
    };
    _recursiveSplit(remainingText, this.separators);
    return chunks;
  }
}
function preSavePage (options) {
  const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 500,
  chunkOverlap: 70,
  separators: ["\n\n", "\n", " ", ""]
  });
  const chunks = splitter.splitText(options.data[0]._raw);
  //console.log(chunks.length)
  return {
    data: chunks, 
    errors: options.errors,
    abort: false,
    newErrorsAndRetryData: []
  }
}

Let me know if this works for you. please feel free to reach out if you need any further assistance.

Thanks

@thomasjohnnielsen926 …. how’d it go?

I did try this, but was not able to get it to work. We are working with @anirudhsundaram54. I have asked him to review these notes and see if he can get it to work.