Processing large jobs with Edge Functions, Cron, and Queues

When you're building applications that process large amounts of data, you quickly run into a fundamental problem: trying to do everything at once leads to timeouts, crashes, and frustrated users. The solution isn't to buy bigger servers. It's to break big jobs into small, manageable pieces.

Supabase gives you three tools that work beautifully together for this: Edge Functions for serverless compute, Cron for scheduling, and database queues for reliable job processing.

Here's how to use them to build a system that can handle serious scale.

The three-layer pattern#

The architecture is simple but powerful. Think of it like an assembly line:

Collection: Cron jobs run Edge Functions that discover work and add tasks to queues

Distribution: Other cron jobs route tasks from main queues to specialized processing queues

Processing: Specialized workers handle specific types of tasks from their assigned queues

This breaks apart the complexity. Instead of one giant function that scrapes websites, processes content with AI, and stores everything, you have focused functions that each do one thing well.

Real example: Building an NFL news aggregator#

Let's say you want to build a dashboard that tracks NFL (American football) news from multiple sources including NFL-related websites and NFL-related videos on YouTube, automatically tags articles by topic, and lets users search by player or team. When they see an article they’re interested in, they can click on it and visit the website that hosts the article. It’s like a dedicated Twitter feed for the NFL without any of the toxicity.

This sounds straightforward, but at scale this becomes complex fast. You need to monitor dozens of news sites, process hundreds of articles daily, make API calls to OpenAI for content analysis, generate vector embeddings for search, and store everything efficiently. Do this wrong and a single broken webpage crashes your entire pipeline.

We need to build a more resilient approach. With Supabase Edge Functions, Cron, and Queues, we have the building blocks for a robust content extraction and categorization pipeline.

Setting up the foundation#

Everything starts with the database, and Supabase is Postgres at its core. We know what we’re getting: scalable, dependable, and standard.

The database design for the application follows a clean pattern. You have content tables for storing articles and videos, queue tables for managing work, entity tables for NFL players and teams, and relationship tables linking everything together. For example:


_10create table articles (
_10  url text unique not null,
_10  headline text,
_10  content text,
_10  embedding vector(1536)
_10);

Collection: Finding new content#

The collection layer seeks out new NFL-related content and runs on a schedule to discover new articles and videos. We create a collector for every site we want to search. A cron job triggers every 30 minutes to begin collection:


_10SELECT cron.schedule(
_10  'nfl-collector',
_10  '*/30 * * * *',
_10  $$SELECT net.http_post(url := '[https://your-project.supabase.co/functions/v1/collect-content')$$](https://your-project.supabase.co/functions/v1/collect-content')$$)
_10);

The Edge Function does the actual scraping. The trick is being selective about what you collect:


_10function isRelevantArticle(url: string): boolean {
_10  return url.includes('/news/') && !url.includes('/video/')
_10}

This simple filter prevents collecting promotional content or videos. You only want actual news articles.

When parsing HTML, you need to handle relative URLs properly:


_10if (href.startsWith('/')) {
_10  href = BASE_URL + href
_10}

And always deduplicate within a single scraping session:


_10const seen = new Set<string>()
_10if (!seen.has(href)) {
_10  seen.add(href)
_10  articles.push({ url: href, site: 'nfl' })
_10}

For database insertion, let the database handle duplicates rather than checking in your application:


_10const { error } = await supabase.from('articles').insert({ url, site })
_10if (error && !error.message.includes('duplicate')) {
_10  console.error(`Error inserting: ${url}`, error)
_10}

This approach is more reliable than complex application-level deduplication logic.

Distribution: Smart routing#

The distribution layer identifies articles that need processing and routes them to appropriate queues. The key insight is using separate queue tables for different content sources. NFL.com articles need different parsing than ESPN articles, so they get routed to specialized processors. It runs more frequently than collection: every 5 minutes:


_10SELECT cron.schedule(
_10  'distributor',
_10  '*/5 * * * *',
_10  $$SELECT net.http_post(url := '[https://your-project.supabase.co/functions/v1/distribute-work')$$](https://your-project.supabase.co/functions/v1/distribute-work')$$)
_10);

The Edge Function finds unprocessed articles using a simple SQL query:


_10const { data } = await supabase
_10  .from('articles')
_10  .select('url, site')
_10  .is('headline', null) // Missing headline means unprocessed
_10  .limit(50)

Then it routes based on the source site:


_10if ([article.site](http://article.site) === "nfl") {
_10  await supabase.from("nfl_queue").insert({ url: article.url });
_10} else if ([article.site](http://article.site) === "espn") {
_10  await supabase.from("espn_queue").insert({ url: article.url });
_10}

This separation is crucial because each site has different HTML structures and parsing requirements.

Processing: The heavy lifting#

Each content source gets its own processor that runs on its own schedule. NFL.com gets processed every 15 seconds because it's high-priority:


_10SELECT cron.schedule(
_10  'nfl-processor',
_10  '*/15 * * * *',
_10  $$SELECT net.http_post(url := '[https://your-project.supabase.co/functions/v1/process-nfl')$$](https://your-project.supabase.co/functions/v1/process-nfl')$$)
_10);

The processor handles one article at a time to stay within Edge Function timeout limits:


_10const { data } = await supabase
_10  .from('nfl_queue')
_10  .select('id, url')
_10  .eq('processed', false)
_10  .order('created_at')
_10  .limit(1)

Content extraction requires site-specific CSS selectors:


_10const headline = $('h1').first().text().trim()
_10const content = $('.article-body').text().trim() || $('article').text().trim()

Date parsing often needs custom logic for each site's format:


_10const dateText = $('.publish-date').text()
_10const match = dateText.match(/(\w+ \d+, \d{4})/)
_10if (match) {
_10  publication_date = new Date(match[1])
_10}

After scraping, the article gets analyzed with AI to extract entities:


_10const result = await classifyArticle(headline, content)
_10const playerIds = await upsertPlayers(supabase, result.players)
_10const teamIds = await upsertTeams(supabase, result.teams)

Finally, create the relationships and generate embeddings:


_10await supabase.from("article_players").insert(
_10  [playerIds.map](http://playerIds.map)(id => ({ article_url: url, player_id: id }))
_10);
_10
_10const embedding = await generateEmbedding(`${headline}\n${content}`);
_10await supabase.from("articles").update({ embedding }).eq("url", url);

The critical pattern is the finally block. We use it to always mark queue items as processed, preventing infinite loops when articles fail to process:


_10try {
_10  // Process article
_10} finally {
_10  await supabase.from("nfl_queue")
_10    .update({ processed: true })
_10    .eq("id", [item.id](http://item.id));
_10}

Monitoring with Sentry#

While the finally block prevents infinite loops, you still need visibility into what's actually failing. Sentry integration gives you detailed error tracking for your Edge Functions.

First, set up Sentry in your Edge Function:


_10import { captureException, init } from 'https://deno.land/x/sentry/index.js'
_10
_10init({
_10  dsn: Deno.env.get('SENTRY_DSN'),
_10  environment: Deno.env.get('ENVIRONMENT') || 'production',
_10})

Then wrap your processing logic with proper error capture:


_22try {
_22  const content = await scrapeArticle(url);
_22  const analysis = await classifyArticle(headline, content);
_22  await storeArticle(article, analysis);
_22} catch (error) {
_22  *// Capture the full context for debugging*
_22  captureException(error, {
_22    tags: {
_22      function: "nfl-processor",
_22      site: article.site
_22    },
_22    extra: {
_22      url: article.url,
_22      queueId: queueItem.id
_22    }
_22  });
_22  console.error(`Failed to process ${url}:`, error);
_22} finally {
_22  await supabase.from("nfl_queue")
_22    .update({ processed: true })
_22    .eq("id", queueItem.id);
_22}

This gives you real-time alerts when processors fail and detailed context for debugging production issues.

Processing user interactions through the pipeline#

The same pipeline pattern works for user-generated events. When someone clicks, shares, or saves an article, you don't want to block their response while updating trending scores for every player and team mentioned in that article.

Instead, treat interactions like any other job to be processed:


_10// Just record the interaction quickly
_10await supabase.from('interaction_queue').insert({
_10  article_url: url,
_10  user_id: userId,
_10  interaction_type: 'share',
_10})

Then let a separate cron job process the trending updates in batches:


_10SELECT cron.schedule(
_10  'process-interactions',
_10  '*/2 * * * *',  -- Every 2 minutes
_10  $$SELECT net.http_post(url := '[https://your-project.supabase.co/functions/v1/process-interactions')$$](https://your-project.supabase.co/functions/v1/process-interactions')$$)
_10);

The processor can handle multiple interactions efficiently:


_10const { data: interactions } = await supabase
_10  .from('interaction_queue')
_10  .select('*')
_10  .eq('processed', false)
_10  .limit(100)

This keeps your user interface snappy while ensuring trending scores get updated reliably. If the trending processor goes down, interactions are safely queued and will be processed when it recovers.

AI-powered content scoring#

To surface the most important content automatically, use AI to analyze article context and assign importance scores.

Define scores for different news types:


_10const CONTEXT_SCORES = {
_10  championship: 9,
_10  trade: 6,
_10  injury: 4,
_10  practice: 1,
_10}

Prompt OpenAI with structured output:


_10const prompt = `Analyze this headline: "${headline}"
_10Return JSON: {"context": "trade|injury|etc", "score": 1-9}`;
_10
_10const result = await [openai.chat](http://openai.chat).completions.create({
_10  model: "gpt-3.5-turbo",
_10  response_format: { type: "json_object" }
_10});

Process articles in batches to manage API costs:


_10const unprocessed = articles.filter((a) => !processedUrls.has(a.url)).slice(0, 10)
_10
_10for (const article of unprocessed) {
_10  const analysis = await analyzeArticle(article)
_10  await storeAnalysis(article.url, analysis)
_10}

Background tasks for expensive operations#

Some operations are too expensive to run synchronously, even in your cron-triggered processors. Vector embedding generation and bulk AI analysis benefit from background task patterns.

Edge Functions support background tasks that continue processing after the main response completes:

typescript


_20*// In your article processor*
_20const article = await scrapeAndStore(url);
_20
_20*// Start expensive operations in background*
_20const backgroundTasks = [
_20  generateEmbedding(article),
_20  analyzeWithAI(article),
_20  updateRelatedContent(article)
_20];
_20
_20*// Run background tasks without blocking the main flow*
_20Promise.all(backgroundTasks).catch(error => {
_20  captureException(error, {
_20    tags: { operation: "background-tasks" },
_20    extra: { articleUrl: url }
_20  });
_20});
_20
_20*// Main processing continues immediately*
_20await markAsProcessed(queueItem.id);

For operations that might take longer than Edge Function limits, break them into smaller background chunks:

typescript


_15async function generateEmbeddingInBackground(article: Article) {
_15  *// Process content in chunks*
_15  const chunks = splitIntoChunks(article.content, 1000);
_15
_15  for (const chunk of chunks) {
_15    await new Promise(resolve => {
_15      *// Use background task for each chunk*
_15      setTimeout(async () => {
_15        const embedding = await generateEmbedding(chunk);
_15        await storeEmbedding(article.id, embedding);
_15        resolve(void 0);
_15      }, 0);
_15    });
_15  }
_15}

This pattern keeps your main processing pipeline fast while ensuring expensive operations complete reliably.

Why this works#

This pattern succeeds because it embraces the constraints of serverless computing rather than fighting them. Edge Functions have time limits, so you process one item at a time. External APIs have rate limits, so you control timing with cron schedules. Failures happen, so you isolate them to individual tasks.

The result is a system that scales horizontally by adding more cron jobs and queues. Each component can fail independently without bringing down the whole pipeline. Users get fresh content as it becomes available rather than waiting for batch jobs to complete.

Most importantly, it's built entirely with Supabase primitives — no external queue systems or job schedulers required. You get enterprise-grade reliability with startup simplicity.