Technical Deep Dive into TabCrunch

We were excited to get our hands on something that probably thousands of other small teams like ours want to do: build something cool that pushes the capabilities of LLMs. We had an idea that actually solves a problem we’ve often encountered over the years - going through hundreds of different browser tabs and bookmarks when doing research. While we think we got the job done, it turned out that this project would, in fact, push the boundaries of the current LLMs more than we anticipated. Here is what we learned.

In short, we built TabCrunch - a browser extension that uses Large Language Models (LLMs) to provide bite-size summaries of browser tabs, and to categorise them into groups according to their content. Although this may seem like a straightforward task, it involves a complex process with numerous steps. In this post, we will focus on the steps related to Natural Language Processing (NLP), as well as the pitfalls we encountered and some alternative approaches we tried, as we believe this may be beneficial to other builders.

Here's an overview of all the steps performed for tab summarisation and categorisation. Detailed explanations for each step, including the technologies we chose and our reasoning for selecting them, will follow.

First, the URLs of all the browser tabs, which the user has given permission to access and scrape, are retrieved. This step allows us to extract the main text from the pages, which is then used for grouping the tabs.
All texts are translated into English before proceeding further. This ensures more consistent results. The texts are then summarised to capture the essence of each tab. This process standardises the length of all texts, ensuring that longer articles do not dominate the groups which may impact the group summary. It also eliminates non-essential information.
We employ a clustering algorithm to group all summarised texts into categories where each group represents a distinct topic.
Since clustering algorithms do not provide names, we use GPT 4 Turbo to generate the group names.
A summary is generated for each group.
A content-overlap score is calculated for each article to determine its overlap with the main topic of the group.
Key points and numerical data such as prices, dates, etc. are extracted and presented as bullet points.

Now let's dive deeper into each step, covering the technologies we used and the rationale behind our choice.

Scraping:

We began by using Beautiful Soup for scraping and parsing.
Eventually, we transitioned to using Playwright for scraping, combined with Beautiful Soup for NLP parsing.
Ultimately, we discontinued the use of NLP parsing since its application was limited to English only and it did not perform well.
We discovered that using Javascript on the page for content extraction worked best.

Translation:

Machine translation involves two primary steps: language detection and the translation process itself.
Translating texts from other languages into English is crucial, given that LLMs perform considerably better in English. Additionally, there can be instances where the LLMs mistakenly respond in a different language.
Initially, we employed lang-detect for language detection but later transitioned to fastText. Although fastText was slower, it delivered significantly improved accuracy.
For the machine translation component, we used Google Translator through the deep_translator library.
We could use DeepL in the future, but it supports fewer languages and it is also quite expensive.

Summarisation of articles:

We observed that having clear and concise summaries limited to 100 words, would significantly enhance the quality of the final clustering, surpassing the outcomes of using raw texts directly.
Currently, we use GPT-3.5 Turbo. While GPT-4 offers superior summarisation capabilities, the need to process hundreds of articles for each user (TabCrunch allows up to 500 at a time), coupled with the potential for extensive text lengths, renders it prohibitively expensive. The objective at this stage is not to achieve the utmost accuracy but to ensure the most efficient use of resources.
We experimented with various self-hosted LLMs, however, we found this approach to be unsustainable due to the slow performance and operational costs. Building our own custom hardware may allow us to explore this option in the future. However, using consumer-grade hardware did not justify the expense.
Below are some of our observations on the self-hosted LLMs (results may vary for others as LLMs evolve almost on a weekly basis and we do not cross-compare them all the time):
- GPT-NEO X - an old model with nothing remarkable, used for proof of concept mostly. Easy to set up and run, but prone to factual errors and hallucinations.
- Llama 1 & 2 - demonstrated the best results so far. Required a little bit of prompt refining, but in the end the delivered summary was on par with GPT 3.5 Turbo. If not for license restrictions and hardware limitations we would probably use the Llama 2 13b model.
- Alpaca and Vicuna - we observed a lot of hallucinations here. We stopped testing fine-tunes of other models after this as they offered little to no improvement in quality.
- Falcon - very slow and resource intensive. Also, we ran into a strange issue where something would break during text generation and the model would just start generating the same letter over and over again until it ran out of memory. Due to the slow performance, high resource usage and lacklustre results, we did not try to fix this issue.
- MPT 7B - we mainly tested the Storywriter version of this model due to its large context length. However, it was almost impossible to get it to follow a specific text length. The only way to roughly control the length was to use phrases like “in a few words”, “in a few sentences”, “in one paragraph”, etc. It would also abandon the original text at some point and just start writing its own story, roughly using the theme of the original text. Also, we observed the same issue as with the Falcon, although to a lesser extent.
- BART (large-sized model) - fine-tuned on CNN Daily Mail, it gave very good results overall. This model produced the clearest and most concise summaries relative to the response time of all of the models we tested.
- FLAN -T5 large - it had the quickest response time and generated the shortest summary of all of the models.
- Falconsai Fine-Tuned T5 Small for Text Summarisation - overall good summaries and speed.
- jordiclive/flan-t5-3b-summarize - produces very large summaries. Some hallucination was observed.
- mLong-T5-large-sumstew - The model returns accurate but very long summaries.
Our trials with SBERT combined with K-Means clustering the sentences for summarisation were discontinued due to unsatisfactory outcomes. It simply selects the sentences closest to the centroid of the cluster, giving the impression that it is simply repeating parts of the text.
We explored machine learning-based summarisation techniques that do not rely on LLMs, such as TF-IDF and LSA, but abandoned these due to the poor quality of the summaries.
It is worth noting that we compared the OpenAI API with the Azure OpenAI API and chose to continue with the former. The Azure API's stricter content filtering policies rejected a significant portion of our texts, which often originates from news websites containing sensitive topics.

Clustering (Part I):

We initially experimented with LDA and KMeans for clustering, but both methods yielded unsatisfactory results, producing meaningless clusters.
Consequently, we shifted our focus to employing pure GPT-4, noting a significant performance disparity with GPT-3.5, where GPT-4 demonstrated a remarkable improvement. Despite this, we encountered challenges related to the token limit of the initial GPT-4 model. Even after the introduction of GPT-4 Turbo, which offered a higher token capacity (not available at the start of our project), maintaining the JSON structure in longer prompts proved difficult. Eventually, the model would start producing incoherent responses. We also faced constraints regarding the number of tokens permitted per minute and an overall token limit.
Using the JSON output feature of GPT-4 was a disaster - it did not work at all for our use case.
To circumvent the token limit issue, we implemented a batching mechanism with GPT-4, achieving partial success.
Ultimately, we opted for a custom implementation of HDBSCAN for our clustering needs. HDBSCAN yielded satisfactory clustering results, albeit with a significant amount of noise, which, in this context, refers to articles that could not be grouped into any category. This characteristic is inherent to HDBSCAN's design.
We explored various dimensionality reduction techniques and extensively adjusted the parameters. However this did not provide any notable benefits.
Our final approach involved using HDBSCAN for initial clustering and then passing the ungrouped "noise" articles to GPT-4 Turbo for further classification.

Naming topics:

We have chosen OpenAI for generating names, as we were unable to identify any alternative solutions that offered meaningful outcomes.

Clustering (Part II):

As previously discussed, we encountered a significant volume of "noise," which consisted of articles that remained unclustered. To address this, we experimented with various methods to allocate these articles to existing groups.

Initially, we employed a similarity search technique, such as cosine similarity, but found that only a small number of articles could be assigned to the existing groups. This outcome indicated that HDBSCAN was effective in its initial clustering task.
We also explored the implementation of BERTopic, a technology that integrates various models for its operations. Specifically, we utilised an embedding model from OpenAI, dimensionality reduction via UMAP, the HDBSCAN clustering model, and representation models combining Maximal Marginal Relevance with OpenAI. The BERTopic approach enabled us to achieve clustering quality comparable to our custom solution.
Ultimately, we opted to use GPT-4 Turbo for this task. It's worth noting that GPT-4 sometimes struggles to adhere to a strictly predefined JSON structure in its responses. To mitigate this, we implemented numerous corrections for each instance of flawed text, drawing on the most common errors we encountered to enhance the success rate.
Additionally, we observed that including a few example outputs in the prompts, as opposed to merely providing instructions, yielded intriguing results. However, we ultimately decided that omitting example outputs allowed the model to be more adaptable across various topics.

Summarisation of topics:

The approach to summarising discussed previously applies here as well.
The key difference in this phase is that we are generating a summary of summaries.
Following this, we will proceed with tokenising the generated summary for the subsequent step.

Similarity scoring:

The process of similarity scoring is carried out in two primary steps. Initially, embeddings are generated for each sentence to encapsulate their semantic essence. Subsequently, we compute the similarity between a given sentence and other sentences across different articles within the same group. This enables us to estimate the content overlap of the article in question with other articles addressing the same topic.
We initially utilised a BERT model to tokenise each text upon receipt and planned to store the generated embeddings in a database. However, this approach was abandoned due to stability concerns related to reconstructing the embeddings from the stored data.
In search of a more efficient solution, we experimented with a self-hosted embedding service. Nevertheless, this was eventually deemed impractical due to the prohibitive costs and suboptimal performance associated with maintaining a dedicated server equipped with consumer-grade GPUs.
Consequently, we transitioned to using the OpenAI Embeddings API, which offered a more cost-effective and faster alternative to our previous setup.

Keypoint extraction:

Here again we've chosen to utilise GPT-4, specifically incorporating example expected outputs within the prompts to achieve a precise format and tone for the output, such as “Facts & Numbers”.
This strategy aims to ensure that the extracted facts not only adhere to a consistent length but also maintain a uniform tone, facilitating a standardised presentation of information across various outputs.
This approach is crucial for meeting our specific requirements for consistency and coherence in the presentation of factual content.

This is what our R&D process over the past two to three months looked like. We would love to get thoughts and feedback. We are also happy to answer any questions, so do not hesitate to reach out to us at hello@tabcrunch.com.