Real-time Search Updates with Experience Edge Webhooks: Part 2

In the last post we went over setting up Experience Edge to set up a webhook whenever a publish is completed. In this post, we’ll handle receiving that webhook event to push published updates to our search index.

First let’s review the high-level architecture. Our webhook fires after a publish to Experience Edge completes. We need to send this to a serverless function in our Next.js app. That function will be responsible for parsing the json payload from the webhook and pushing any changes to our search index. This diagram illustrates the process:

Before we build our serverless function, let’s take a look at the json that gets sent with the webhook :

Note that this webhook contains data about everything in the publish operation. For search index purposes, we’re interested in updates to pages on the website. This is represented by in the payload by "entity_definition": "LayoutData". Unfortunately, all we get is the ID of the item that was updated rather than the specific things that changed. That means we’ll need to query for the page data before pushing it to the search index.

Now that we understand the webhook data we’re dealing with, we need to make our function to handle it. If you’re using Vercel to host your web app, creating a serverless function is easy. Create a typescript file in the /pages/api folder in your app. We’ll call this handler “onPublishEnd.ts”. The function needs to do the following:

  • Loop over all “LayoutData” entries
  • Query GraphQL for that item’s field data
  • Validate the item is part of the site we’re indexing
  • Push the aggregate content data to the search provider

Let’s look at a sample implementation that will accomplish these tasks:

// Import the Next.js API route handler
import { NextApiRequest, NextApiResponse } from 'next';
import { graphqlRequest, GraphQLRequest } from '@/util/GraphQLQuery';
import { GetDate } from '@/util/GetDate';

// Define the API route handler
export default async function onPublishEnd(req: NextApiRequest, res: NextApiResponse) {
// Check if the api_key query parameter matches the WEBHOOK_API_KEY environment variable
if (req.query.api_key !== process.env.WEBHOOK_API_KEY) {
return res.status(401).json({ message: 'Unauthorized' });
}

// If the request method is not POST, return an error
if (req.method !== 'POST') {
return res.status(405).json({ message: 'Method not allowed' });
}

let data;
try {
// Try to parse the JSON data from the request body
//console.log('Req body:\n' + JSON.stringify(req.body));
data = req.body;
} catch (error) {
console.log('Bad Request: ', error);
return res.status(400).json({ message: 'Bad Request. Check incoming data.' });
}

const items = [];

// Loop over all the entries in updates
for (const update of data.updates) {
// Check if the entity_definition is LayoutData
if (update.entity_definition === 'LayoutData') {
// Extract the GUID portion of the identifier
const guid = update.identifier.split('-')[0]

try {
// Create the GraphQL request
const request: GraphQLRequest = {
query: itemQuery,
variables: { id: guid },
};

// Invoke the GraphQL query with the request
//console.log(`Getting GQL Data for item ${guid}`);
const result = await graphqlRequest(request);
//console.log('Item Data:\n' + JSON.stringify(result));

// Make sure we got some data from GQL in the result
if (!result || !result.item) {
console.log(`No data returned from GraphQL for item ${guid}`);
continue;
}

// Check if it's in the right site by comparing the item.path
if (!result.item.path.startsWith('/sitecore/content/Search Demo/Search Demo/')) {
console.log(`Item ${guid} is not in the right site`);
continue;
}

// Add the item to the items array
items.push(result.item)

} catch (error) {
// If an error occurs while invoking the GraphQL query, return a 500 error
return res.status(500).json({ message: 'Internal Server Error: GraphQL query failed' })
}
}
}
// Send the json data to the Yext Push API endpoint
const pushApiEndpoint = `${process.env.YEXT_PUSH_API_ENDPOINT}?v=${GetDate()}&api_key=${process.env.YEXT_PUSH_API_KEY}`;
console.log(`Pushing to ${pushApiEndpoint}\nData:\n${JSON.stringify(items)}`);

// Send all the items to the Yext Push API endpoint
const yextResponse = await fetch(pushApiEndpoint, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(items),
});

if (!yextResponse.ok) {
console.log(`Failed to push data to Yext: ${yextResponse.status} ${yextResponse.statusText}`);
}

// Send a response
return res.status(200).json({ message: 'Webhook event received' })
}

const itemQuery = `
query ($id: String!) {
item(path: $id, language: "en") {
id
name
path
url {
path
url
}
fields {
name
jsonValue
}
}
}

`;

https://github.com/csulham/nextjs-sandbox/blob/main/pages/api/onPublishEnd.ts

This function uses the Next.js API helpers to create a quick and easy API endpoint. After some validation (including an API key we define to ensure this endpoint isn’t used in unintended manners), the code goes through the json payload from the webhook and an executes the tasks described above. In this case, we’re pushing to Yext as our search provider, and we’re sending all the item’s field data. Sending everything is preferable here because it simplifies the query on the app side and allows us to handle mappings and transformations in our search provider, making future changes easier to manage without deploying new code.

As the previous post stated, CREATE and DELETE are separate operations that will need to be handled with separate webhooks. There may still be other considerations you’ll need to handle as well, such as a very large publish and the need to batch the querying of item data and the pushes to the search provider. Still, this example is a useful POC that you can adapt to your project’s search provider and specific requirements.

Publish Sitecore Media Items on Referenced Datasources

One of the great additions to Sitecore 8 is the ability to publish related items when executing a publish. Using this feature, you’ll be sure to publish out any necessary items that may be needed to render the page correctly, such as data sources, referenced taxonomy items, or images.

However, you may still have some gaps when using this feature. Consider common scenario where you have a new page, and you add a component to the page that uses an separate item as a data source. On that data source is a field for an image. When publishing the page, the newly created data source item goes out, but the media item linked to on that data source does not.

This is because of the way Sitecore processes referenced items. In essence, it only goes one level deep in the reference tree. So, items referenced by the item being published will be added to the queue, but items referenced by those referenced items will not.

Normally this is ok. If the publisher crawled references recursively, you’d probably wind up in an infinite publishing loop, or you’d at least wind up doing a large publish unintentionally. But it is common for data source items to reference new content, like media, so we need to include those in the publish too.

There’s a pipeline in Sitecore 8 we can use specifically for this purpose, the <getItemReferences> pipeline. Out of the box, it includes a step to AddItemLinkReferences. This step is the one responsible for adding our referenced data source item, so we can override this step to add logic to include media referenced by that data source.

Like all great Sitecore developers, we customize Sitecore by reflecting on their code and replacing it with our own logic. I opened up Sitecore.Publishing.Pipelines.GetItemReferences.AddItemLinkReferences, and added the following.

...
  foreach (Item obj in itemLinkArray.Select(link => link.GetTargetItem()).Where(relatedItem => relatedItem != null))
  {
    list.AddRange(PublishQueue.GetParents(obj));
    list.Add(obj);
    // This will look at the item's links looking for media items.
    list.AddRange(GetLinkedMediaItems(obj));
  }
  return list.Distinct(new ItemIdComparer());
}

Then we’ll add the GetLinkedMediaItems method,

protected virtual List<Item> GetLinkedMediaItems(Item item)
{
  List<Item> mediaList = new List<Item>();
  ItemLink[] itemLinkArray = item.Links.GetValidLinks()
    .Where(link => item.Database.Name.Equals(link.TargetDatabaseName, StringComparison.OrdinalIgnoreCase))
    .ToArray();
  foreach (ItemLink link in itemLinkArray)
  {
    try
    {
      Item target = link.GetTargetItem();       
      if (target == null || !target.Paths.IsMediaItem) 
        continue;
      // add parent media items or folders
      Item parent = target.Parent;
      while(parent != null && parent.ID != ItemIDs.MediaLibraryRoot)
      {
        mediaList.Insert(0, parent);
        parent = parent.Parent;
      }
      mediaList.Add(target);
    }
    catch (Exception ex)
    {
      Log.Error("Error publishing reference link related media items", ex, typeof(AddItemAndMediaLinkReferences));
    }
  }
  return mediaList;
}

We can include this new pipeline by replacing the old one we reflected on.

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
 <sitecore>
  <pipelines>
   <getItemReferences>
    <processor type="Sitecore.SharedSource.Pipelines.Publish.AddItemAndMediaLinkReferences, Sitecore.SharedSource"
               patch:instead="processor[@type='Sitecore.Publishing.Pipelines.GetItemReferences.AddItemLinkReferences, Sitecore.Kernel']"/>
   </getItemReferences>
  </pipelines>
 </sitecore>
</configuration>

With this in place, media items referenced on any linked item will be published. You can further refine the logic to just consider data sources, perhaps by checking the path or template to ensure it’s a data source, to cut down on unintentional publishes.

Benchmarking Sitecore Publishing

Publishing has been a sore spot lately for some of our clients due to the high amount of content they have in their Sitecore environment. When you start to get into hundreds of thousands of pieces of content, a full site publish is prohibitive. Any time a change is made that requires a large publish your deployment window goes from an hour to potentially an all-day affair. If a user accidentally starts a large publish, subsequent content publishes will get queued and backed up until that large publish completes, or until someone logs into the server and restarts the application.

Still waiting

There are options available to speed up the publishing process. Starting in Sitecore 7.2, parallel publishing was introduced, along with some experimental optimization settings. In Sitecore 8.2, we have a new option, the Sitecore Publishing Service.

What benefits can we see from these options?  I decided to do some tests of large content publishes using these techniques. Each publishing option has its own caveats of course, but this post is concerning itself mainly with the publishing performance of each of the available options.

Skip to the results!

Methodology

I wanted to run these tests in as pure an environment as possible. I set up 3 Sitecore 8.2 environments using Sitecore Instance Manager on my local machine. Using the FillDB tool, I generated 100,000 content items nested in a folder under the site root. Each of these items is of the Sample Item template that ships with a clean Sitecore installation. Full Publish on the entire site was used in each example. Each time the content was being published for the first time.

For benchmarking purposes, my local machine has the following specs,

  • Intel  i7, 8 Core, 2.3 GHz CPU
  • 16 GB RAM
  • Seagate SSHD (not an SSD, but it claims to perform like an SSD!)
  • Windows 7 x64, SP1
  • SQL Server Express 2015
  • .NET 4.6 and .NET Core installed

Default Publishing

The first test was doing a full site publish after generating 100,000 content items using the out-of-the-box publishing configuration. This is probably how most of Sitecore sites are configured unless you took steps to optimize the publishing processes. The results are, as expected, not great.

21620 12:19:30 INFO  Job started: Publish
21620 13:51:18 INFO  Job ended: Publish (units processed: 106669)

That’s over 90 minutes to publish these items, and the content items themselves only had 2 fields with any data.

Parallel Publishing

Next I tested parallel publishing, introduced in Sitecore 7.2. To use this, you need to enable Sitecore.Publishing.Parallel.config. Since I have an 8 core CPU, I set the Publishing.MaxDegreeOfParallelism setting to 8.

There is also Sitecore.Publishing.Optimizations.config, which contains, as the name implies, some optimization settings for publishing. The file comments state that the settings are experimental, and that you should evaluate them before using them in production. For purposes of this test, I ignored this file.

With parallel publishing enabled we see a much shorter publish time of around 25 minutes.

12164 14:27:10 INFO  Job started: Publish to 'web'
12164 14:52:58 INFO  Job ended: Publish to 'web' (units processed: 106669)

Publishing Optimizations

I reran the previous test with the Sitecore.Publishing.Optimizations.config enabled, along with the parallel publishing. This shortened the publish to around 15 minutes.

9836 15:52:34 INFO  Job started: Publish to 'web'
9836 16:07:20 INFO  Job ended: Publish to 'web' (units processed: 106669)

Sitecore Publishing Service

New in Sitecore 8.2 is the Publishing Service, which is a separate web application written in .NET Core that replaces the existing publishing mechanism in your Sitecore site. The documentation on setting up this service is thorough, so kudos to Sitecore for that, however it can be a bit dense. I found this blog post quite helpful in clearing up my confusion. Using it in conjunction with the official documentation, I was able to set up this service in less than an hour.

I ran into a problem using this method, however. The Publishing Service uses some new logic to gather the items it needs to publish, and one of the things it keys off of is the Revision field. Using the FillDb tool doesn’t explicitly write to the Revision field, therefore the service didn’t publish any of my generated items. I wound up running a script with Sitecore Powershell to make a simple edit to these items forcing the Revision field to be written. After that, my items published as expected.

The results were amazing. The new Publish Service was able to publish the entire site, over 100,000 items, in just over 4 minutes. That’s over 20x faster than the default publish settings.

2016-10-19 16:34:17.027 -04:00 [Information] New Job queued : 980bee8e-a132-4041-82d8-155b8496b19f - Targets: "Internet"
2016-10-19 16:39:07.304 -04:00 [Information] Job Result: 95b88a85-64f4-465e-b33d-a7a901331488 - "Complete" - "OK". Duration: 00:04:05.2786436

Summary

Each of these optimizations come with caveats. Parallel Publishing can introduce concurrency issues if you’re firing events during publish. The optimization config settings need to be vetted before rolling out, as it disables or alters many features you may be using, even if you don’t realize you’re using them.

If you’re on Sitecore 8.2 I strongly recommend giving the Publishing Service a look. Like any change to your system, you’ll want to test the effects it has on your publishing events and other hooks before rolling it out.