Building a new embedding(bit vector) for sentences
Embeddings are cool, but costly. Is there an alternative?
One clear use case that is coming up from LLMs is document QnA. We have been experimenting with QnA for the last 2 years.
The simple pipeline of
document→sentences→embeddings→faiss index→kNN
works for most cases. You could add generative AI like chatGPT at the end of the pipeline to generate an unique answer if you want.
But the heart of the solution is the embedding. As long as the embedding captures the meaning of the sentence and embeddings of similar sentences appear nearby in the multidimensional space, we are good to go.
Currently we have to use BERT or similar embeddings. Generating these embeddings is costly. And they are large in size. We need 768 dimensions(float) to represent any sentence. And that is a lot of memory to store for sentences.
Embeddings are the core for most new research in NLP. Everything starts from embeddings.
To attack the problem of storage we first started by trying to create an alternative embedding space built on top of the BERT embeddings. We wanted to build a bit vector space using BERT embeddings and we were able to build a system using the KE Sieve system, thanks to help from the Alpes research team.
This experiment was a tremendous success as can be seen from our blog post https://medium.com/ozonetel-ai/compressing-bert-sentence-embeddings-6120c84f5f4c
We were able to achieve a compression of almost 45 times from BERT embeddings.
Currently this is what we use to power the neural search in our products. The pipeline is same as above but with an extra step.
document→sentences→bert embeddings→ KE Sieve embeddings→faiss index→kNN
Once this was done, we wanted to attack the next big problem. We wanted to build our own embeddings. From scratch.
Why should we accept the 786 dimension BERT embeddings as the default. BERT embeddings are really good as they capture the context and meaning of the sentences. They have some problems, but in most cases they work well. But even if they work well, can we create better embeddings? Can we get an embedding using only 1s and 0s(bits) instead of floats? That has been the job of a team of researchers in Alpes and Ozonetel.
Today we will talk about some of the results we have achieved in this endeavour.
KE Sieve Embeddings From Scratch
The objective of the experiment is to find an embedding space to represent sentences in such a way that it preserves the meaning in it. More importantly, we have to create a high dimensional space such that similar sentences come close together in this space. There are many ways and many options to choose a good feature representation for a sentence. We could use parts of speech or maybe a dependency parse etc. But all of that would require further research. For now, we know that TF-IDF works decently in many sentence similarity benchmarks. TF-IDF can be calculated using basic statistics and we wouldn’t need deep learning for this. Can we use TF-IDF as a base for our embeddings? Lets see.
Why not just TF-IDF?
The problem with TF-IDF is that the number of dimensions depends on the vocabulary. So if our dataset has 5000 words, it becomes a 5000 dimension problem. And if we are building an embedding for all of english language, then the vocabulary becomes infinite and this in turn will make the dimensions nearly infinite.
To solve this, we will take tokens as the base unit instead of words. Thanks to experiments from OpenAI, Google and others we know that text in the whole of the Internet can be represented using approximately 20,000-30,000 tokens. So we decided to use BERT tokenizer. The BERT tokenizer has a vocabulary size of around 30,000. We used Wiki DPR dataset to create our “text space”. We are assuming there is enough variety in the Wiki DPR dataset to capture the essence of the English language. The Wiki DPR dataset has around 110 m
llion sentences. However, for our preliminary demonstration we have used only around one million of these sentences.
The features generated by the TF-IDF method are of a very very high dimension and they cause sparse features. To handle this scenario we use the KE Sieve method to build the model on the huge data with higher dimensional features.
KE Sieve method(this is the secret sauce, patented by Alpes.ai) creates an embedding space using bits. Since the embedding space is stored in ‘bit’ format it makes the storage less for even higher dimensions. The other advantage is that searching neighbors can be done with bit operations(XOR operation) which are very fast.
The actual experimental procedure and results are explained in detail below.
Experiment Flow diagram:
Procedural steps:
1.1 TF-IDF Model Building:
Load the WIKI DPR(Dense Passage Retrieval) data from the source. This data contains 105 Million sentences of text data.
Preprocess the text data and compute BERT tokens for all the sentences using a pre-trained model of the “Bert-base-cased” model.
Use 105 Million sentence’s tokens to build the TF-IDF model, and store the TF-IDF model. This model is used to build the KE Sieve model. (note: Vocabulary size of the above TF-IDF(Term Frequency-Inverse Document Frequency) model is 19,178)
1.2 KE Sieve Model Building:
The first step is loading the WIKI DPR data from the source. From that data, we have taken only 1 Million sentences of text data. Ideally we have to take all sentences to build the KE Sieve space. But because of time and compute constraints(our code is currently not optimized for GPU, so we had to use a CPU) and to quickly check our hypothesis we used 1 million sentences to check.
We preprocess text data and compute BERT tokens for these 1 million sentences using a pre-trained model of the “Bert-base-cased” model.
We load the TF-IDF model built on WIKI DPR, and get the TF-IDF features for 1 Million Sentence data with a dimension of 19,178.
This is our secret sauce. We use the TF-IDF features to build an alternate space. This alternate space can then be used in place of the TF-IDF space. While building the KE Sieve model we tried batch-wise training,(i.e 1 Million data split into 5 batches, for each batch size, is 200,000x19178 features) with 10 passes, and we got a total number of features of KE Sieve=438 bit features. We used a 32 core 32 GB RAM machine to do this. We had to do batches as the TF-IDF features with 19,178 dimensions were too big to all fit in the RAM. It took 4 days to create this model.
Store the KE Sieve model for other use cases. This KE Sieve model which has 438 dimensions(438 bits) can now be used to represent sentences and can be used in the place of BERT embeddings.
1.3 Testing of Benchmark datasets:
Once we have TF-IDF and KE Sieve Model we ran classification problems on different datasets. The below steps were followed and results are added in the below section.
The first step is loading the Benchmark dataset which contains the train and test set.
Compute the tokens for given data using a pre-trained model of tokenizer
(“Bert-base-cased”).Load the TF-IDF model and compute the features for the train and test set of benchmark dataset(feature dimension of TF-IDF is 19178).
Load the KE Sieve model(total no.of features 438) and then compute the features for train and test.
Find the top k-nearest neighbors of test features for given train features using XOR bit operation(i,e neighbors search on the bit difference), and from that get the predicted labels of the test set using kNN.
Compute the performance metric(accuracy) by using test and predicted labels. These above steps are same for the all benchmarked datasets.
These steps are replicated for all the public datasets we tested on. Basically the pipeline is
Training:
Train sentences→Tokenize→TF-IDF→Bit Features(store these)
Testing:
Test Sentence→Tokenize→ TF-IDF→Bit Features→kNN(against the stored train features)
The advantage is that for training and testing we don’t need GPUs and train and test time are a order of magnitute lesser than other deep learning methods. (There is no training because all we are doing is calculating the embedding by placing the test sentence into the KES space)
Results:
Note: Our embedding size is almost 56 times lesser than the sentence transformers embeddings. We can see the train data set size for most datasets is less than 1 MB with KES. That means we can include the dataset in the web page itself if we want. This opens up whole new possibilities because we can now ship the train sentences embedding to the client side and do the nearest neighbour search on the client machine.
Conclusion:
As we can see, we have been be able to create an alternate embedding for sentences. We built the embedding using 1 million sentences from the Wiki DPR dataset and then used that embedding space to solve classification problems on other datasets. These other datasets did not have any sentences from the Wiki DPR dataset. But the results were comparable to other deep learning embeddings. To our surprise, our embeddings performed admirably even in multi class problems with more than 10 classes. And we used just 1% of all the sentences to build our embedding space. So we can conclude that our new embedding captures the essence of a sentence similar to other sentence embeddings.
Future work:
We are now planning to build the embedding space for all the 105 million sentences of Wiki DPR. We are hypothesizing that this will lead to much better sentence embeddings at the cost of a few more bits.
We are sure we can provide a much better alternative to existing sentence embeddings which can drive newer upstream models.
Release an API for anyone to use these embeddings instead of other embeddings.
As the blog by a16z mentions, there are different layers in the generative AI stack. This is our attempt to create a foundation model not dependent on deep learning.