A Background on Laserlike Product and Tech

by Steven Baker, Chief Science Officer and Co-Founder

We live in a world of information abundance, where the main problem is sifting through the noise and discovering the stuff you actually care about. For instance, if you care about knowing when the next SpaceX livestream launch is because you like to watch it with your kids, or if the car you bought two years ago has had a recall, or if a company you’re interested in announces it’s opening a new office where you live, or if there’s a music festival coming to your town, you don’t know when to look for these things, and there’s no product that informs you automatically.

This is one of the things we want to fix on the Internet. Laserlike’s core mission is to deliver high quality information and diverse perspectives on any topic from the entire web. We are passionate about helping people follow their interests and engage with new perspectives.

We have a beautiful mobile app that delivers information on your interests:

Besides following particular interests, the Laserlike mobile app provides alternative views and commentary as well as related content to whatever the user is reading or watching. The app is designed to update the user with different content types from the web, whether that’s video, news, blogs, or social commentary like Twitter and Reddit posts. Consequently, our team has worked hard to create algorithms that work on any type of content and we are constantly working to improve the quality and types of content we surface. The Laserlike mobile app also personalizes in realtime based on your interactions in the app.

In-app examples of more coverage, commentary, and related content.

That’s a brief description of the core Laserlike mobile product. To power that app we have invested in a machine learning based search and discovery platform.

  1. We’ve built a content search engine that searches over billions of web pages such as blogs, news and video, as well as billions of conversations from sites like Reddit and Twitter.
  2. Because we are starting from a clean slate, we have built this search engine from the ground up using the latest research on unstructured document understanding.
  3. We’ve also invested significantly in realtime personalization. For a given user we can quickly learn their interests based on their content consumption. To do this we’ve built our own knowledge graph, as well as custom-built embeddings and a state-of-the-art embedding nearest neighbor search system.

We’ll talk more about 2.) Embedding Search and 3.) Realtime personalization, but here is a figure showing how our systems are put together:

Embedding Search:

In realtime we ingest all new content on the web and index it. We index by keywords, but we also index it in the embedding space. Indexing by keywords is a decades old technique where the search engine creates an assignment from a keyword to a list of documents that have that keyword.

Drawing of Traditional Keyword Indexing process. The keywords are on the left, and the document ids are the list on the right.

For a normal keyword search on a traditional search engine, this traditional method works great, because you simply lookup the keyword and retrieve the list of matching documents. If there are multiple keywords like in “adorable liger pictures”, the search engine simply intersects the lists.

Embedding is a relatively recently invented way of turning human-understandable objects (like an image, a word, or a document) into a machine learned list of N numbers. (See more about embeddings here).

In most of our products both types of indexing are useful, but for related content the embedding indexing is critical. There is no obvious way to retrieve using a document — they have hundreds of words in them, so how do you pick the essence of the document? Instead, we simply treat the web page you are viewing as a “query” in the embedding space, and then we find other documents that are nearby by comparing the closeness of the N numbers in the embedding space. This lets us find other content that doesn’t necessarily have the same words, but is talking about roughly the same concepts. We’ve found that distance in the embedding space indicates how close the content is, all the way from near duplicates (very close) to alternative coverage (close) to just related (slightly close), to… random stuff on the internet (far away).

Graphical representation of a single document in an embedding space surrounded by other documents at various distances.

Because we are searching over billions of documents in our corpus to find close content in our continuous embedding space, this is a hard retrieval problem that’s quite different from the standard keyword retrieval problem. For traditional keyword search, you just lookup which documents are in the list for that keyword, but here the brute force method would require us to measure the distance between the target documents and every one of the billions of documents in our index. Therefore, we’ve developed some custom algorithms for nearest neighbor vector search in order to do this efficiently, in under 100 milliseconds. After retrieval we rank the recommendations using many additional signals like authoritativeness/site quality, freshness, popularity, diversity of perspective from the original content, etc.

Besides powering related, we also feed our embeddings back into our search system, so it improves traditional keyword searches. We can measure the distance the embedding distance between every document and query representation. To do this we’ve created novel algorithms to automatically embed never-seen-before queries.

We’ve also created our own custom embedding process, which will be explained more later in this post.

For You / Personalization Model

Laserlike learns from what URLs you read in the app. Each of these public URLs has already been embedded via machine learning during our realtime document processing. For You personalizes very quickly by aggregating the embeddings from these public URLs and transforming that into a representation of content you are interested in. For You section is then generated by retrieving over our embedding index using your personal embeddings. Additionally, we have increased freshness scoring (to show more new stuff), but still use our authoritativeness, site quality, and other features to rank the results.

Example figure of a user with four interests defined by embeddings of past documents they’ve visited in the browser (or in our Laserlike mobile app). The green dots represent nearby neighboring documents in the embedding space.

Embedding Model

We developed a novel technique for training our embedding model that has worked better for our recommendation and search tasks. Word2Vec trains embeddings by predicting words in their context. GloVe trains embeddings by predicting co-occurrence of words in documents.

Our novel technique makes use of the fact that we have built a keyword search engine that indexes the billions of pages in our index. More than just indexing the page, we also use modern heuristic search techniques to assign a probability of topical relevance to every keyword on that page. Our retrieval index is therefore not simply a list of documents for each keyword, but also contains their probability of relevance.

Diagram of the index for the three keywords we used before “adorable”, “liger”, and “pictures”, updated to show probability of relevance for each document.

Therefore, rather than using adjacency of words in a document (as in Word2Vec) or simple co-occurrence (as in GloVe), we train our embeddings using these probabilities.

To put it more plainly, we use Laserlike’s modern heuristic search engine plus our copy of the web to “teach” the embeddings.

More than keywords

Another interesting aspect of our embeddings is that we use the search engine to co-embed normal keywords with knowledge graph entity annotations, sites, subreddits, and twitter handles. (The only similar work we know of is StarSpace.) We are able to do this because our Laserlike search engine has been designed from the ground up with entity annotations, but also the link structure of the web as an input into relevance. Using links to pages and the text of those links as votes for ranking was a key insight of the original Google PageRank algorithm, but here we are generalizing to use this as training data for machine learning, and we are differentiating between where the link comes from (the web, reddit, twitter, etc.).

To show what we mean by this, we’ll give some examples from our debug interface for embeddings, which lists the objects that are nearest to the target object in the embedding space.

Cloud_Gate is our representation of the entity in our knowledge graph for Cloud Gate, which is an architectural feature in Chicago that looks like this:

The nearest neighbors in the embedding space are a bunch of other sculptures and architectural features in Chicago:

r:floridaman is our representation of the subreddit FloridaMan, which is a subreddit about wacky news from Florida. In this example we can see that not only have we embedded r:floridaman and its nearest neighbors are words you’d expect like “florida man” and “florida woman”, but closeby are related subreddits about wacky news from Florida or elsewhere. Furthermore, we also co-embed and find that the site, wtfflorida.com, is close to r:floridaman in meaning. This is not only improving the quality of the embeddings for words, but it’s finding meaning in the subreddits, twitter handles, and sites.

We will probably do a subsequent post on more tech details of our embeddings, but we wanted to give a taste as to how they are different from the standard techniques.

Conclusion

We hope this post has shed some light on how our Laserlike mobile app and other products work. If you’d like to check out our tech in action, try out our products on laserlike.com. Stay tuned for future blog posts on tech and product announcements!

Like what you read? Give Laserlike Inc a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.