Mark Bailey avatar

Vector Search App MVP

mada

Published: 16 May 2024 › Updated: 16 May 2024Vector Search App MVP

Vector Search App MVP

photo_2024-05-15_21-35-17.jpg

The above image was made by amberjyangHive account@amberjyang with Midjourney using the prompt 'a blue python slithering through computer coding numbers.'

Background

Last year, I wrote a news recommendation algorithm for WantToKnow.info. You can read about the project here and test out the recommendations by clicking on any article title in our archive. The recommendations are based on something called TF-IDF vector cosine similarity, which is to say the mathematical relationships between news stories.

More recently I was inspired to expand the underlying tech to vector search. WantToKnow has good search already, but it's keyword based. My thinking is that vectorizing search queries and then comparing query vectors with news article vectors could potentially surface good stories in situations where keywords alone aren't cutting it.

Success

Today I got a vector search app to the minimum viable product stage. I made a web page that takes any detailed question or description about any conspiracy-related topic as input and outputs a list of the 20 most relevant news article summaries. All of the logic is python, glued to the html with Pyscript, with a csv file stored on IPFS instead of a database.

<!DOCTYPE html>
<html lang="en">
<head>
    <title>WantToKnow Archive Vector Search</title>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <link rel="stylesheet" href="https://pyscript.net/releases/2023.05.1/pyscript.css" />
    <script defer src="https://pyscript.net/releases/2023.05.1/pyscript.js"></script>
    <style>
        body {
            margin-left: 20%;
            margin-right: 20%;
        }

        #mainstory {
            color: white;
            background-color: black;
            padding: 10px
        }

        textarea {
            width: 100%;
            height: 150px;
            padding: 12px 20px;
            box-sizing: border-box;
            border: 5px solid black;
            background-color: #f8f8f8;
            font-size: 16px;
            resize: none;
        }

        button {
            width: 100%;
            color: white;
            background-color: black;
            font-size: 24px;
            text-align: center;
            padding: 12px;
        }

        button:hover {
            color: black;
            background-color: white;
        }
    </style>
</head>
<body>
<py-config>
    packages = [
        "pandas",
        "scikit-learn"
    ]
    terminal = false
</py-config>
    
    <h1>WantToKnow.info Archive Vector Search</h1>
    <p>Find news article recommendations based on term frequency-inverse document frequency (TF-IDF) vector cosine similarities. A search returns the 20 most closely related summaries.</p>
    <p><strong>Instructions:</strong> enter a question or statement. When it comes to conspiracies and cover-ups, what do you most want to know? Be as detailed as possible. Five or six sentences is optimal. Press the submit button only once and wait for the data to be crunched.</p>
    
    <textarea id="askit">What do you want to know?</textarea>
    <button id="submit-btn">Submit Query for Processing</button>
    <div id="mainstory"></div>
    <div id="relatedstories"></div>
    
<script type="pyscript">
import pandas as pd
import re
from js import console
from pyscript import when, display
from pyodide.http import open_url
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

@when('click', '#submit-btn')
def query():
    question = Element('askit').element.value
    Element('mainstory').write(question)
    url = 'the IPFS url of my csv file'
    df = pd.read_csv(open_url(url), sep='|', usecols=['ArticleId','Title','PublicationDate','Publication','Links','Description','Priority','url'])
    
    # Deduplication and NaN cleanup
    df = df.drop_duplicates('Title')
    df = df[df['Priority'].notna()]

    # Substituting multiple spaces with single space
    df['Description'] = df['Description'].apply(lambda x: re.sub(r'\s+', ' ', str(x)))

    # Remove double quotes
    df['Description'] = df['Description'].apply(lambda r: r.replace('\"\"', '\"'))

    # Remove paragraph styling
    df['Description'] = df['Description'].apply(lambda r: r.replace('

'

, '

'

)) df['Description'] = df['Description'].apply(lambda r: r.replace('

'

, '

'

)) df['Description'] = df['Description'].apply(lambda r: r.replace('

'

, '')) df['Description'] = df['Description'].apply(lambda r: r.replace('

'
, '')) query_row = pd.DataFrame({'ArticleId': '54321','Title': 'Search Terms','PublicationDate': '','Publication': '','Links': '','Description': 'Variable','Priority': '','url': ''}, index=[0]) df = pd.concat([query_row, df]).reset_index(drop=True) df.at[0, 'Description'] = question # Compute TF-IDF vectors and cosine similarities vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(df['Description']) cosine_similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix).flatten() # Find the 20 most similar articles similar_indices = cosine_similarities.argsort()[-21:-1][::-1] similar_items = df.iloc[similar_indices] # Display the results in the specified format result_html = "" for index, row in similar_items.iterrows(): for col in df.columns: result_html += f"{col}: {row[col]}
"
result_html += "
"
display(result_html, target="relatedstories") </script> </body> </html>

As of now, the results display needs work, but the thing is basically operational. Calling the main function with an event-listening decorator still seems weird to me, but this was the only way I could get it to work. I ended up using gpt-4 to get the cosine similarities computed efficiently and was surprised by how much better gpt-4 is compared to gpt-3.5.

When I first started this project, my plan was to pre-compute the vectors to conserve browser resources. But storing the vectors in the csv made its size balloon from 27MB to 3.5GB. So I instead went with browser-computed vectors and it actually seems okay. A search takes well under a minute, with excellent results relevance.

As for next steps, after cleaning up the display, there are a few directions I could take the project in. I'd like to embed a Telegram group discussion in the page, but the available embed widget doesn't work, so I could try to do something with their API. I'm also looking at trying to send search results to gpt to generate a 500 word summary brief of the material. That might be pretty cool.


Read Free Mind Gazette on Substack

Read my novels:

See my NFTs:

  • Small Gods of Time Travel is a 41 piece Tezos NFT collection on Objkt that goes with my book by the same name.
  • History and the Machine is a 20 piece Tezos NFT collection on Objkt based on my series of oil paintings of interesting people from history.
  • Artifacts of Mind Control is a 15 piece Tezos NFT collection on Objkt based on declassified CIA documents from the MKULTRA program.

Leave Vector Search App MVP to:

Written by

Writer. Painter. Crypto fanatic.

Read more #programming posts


Best Posts From Mark Bailey

We have not curated any of mada's posts yet. But you can encourage our curation team to review posts by visiting them regularly and by referring other readers. Because we give priority to frequently read content.

More Posts From Mark Bailey