fresh take on pgvector. most "vector in postgres" content is about RAG / document search. i used it for a different thing: finding similar users based on their behavior, not their content.
context: our B2B SaaS has a feature where a user can ask "who else uses this similarly to me?" we wanted to surface other users whose interaction patterns matched theirs — for collaboration suggestions, template sharing, peer connections.
couldn't do this with traditional similarity (cosine on raw event counts). users have different volumes of activity, so "user with most events" was always the closest match. needed something that captured pattern, not magnitude.
what i shipped:
a vector embedding per user, computed from their last 30 days of behavior. the vector represents which actions they take in which proportions, normalized.
create extension if not exists vector;
create table user_behavior_vectors (
user_id uuid primary key references auth.users(id),
vector vector(64),
computed_at timestamptz default now()
);
create index on user_behavior_vectors using hnsw (vector vector_cosine_ops);
the vector dimensions correspond to behaviors. our 64 dimensions are 64 distinct event types we track. each dimension's value is the normalized frequency of that event in the user's last 30 days.
building the vectors (runs nightly via pg_cron):
create or replace function refresh_user_behavior_vectors()
returns void as $$
declare
e_types text[];
begin
-- get our 64 tracked event types
select array_agg(name order by name) into e_types
from event_type_registry where included_in_vector = true limit 64;
-- for each active user, compute their vector
insert into user_behavior_vectors (user_id, vector, computed_at)
select
user_id,
(select array_agg(coalesce(freq, 0)::real)::vector
from unnest(e_types) as et
left join lateral (
select count(*)::float
/ nullif((select count(*) from events e2
The user describes a feature implemented using pgvector to find similar users based on behavior in a B2B SaaS context. This involves creating vector embeddings from user actions over the last 30 days, stored in a PostgreSQL table with an HNSW index for fast querying. The feature aims to enhance collaboration by suggesting users with similar interaction patterns. The user seeks feedback on similar use cases and experiences with behavior-as-vector approaches.
Sounds complex what’s your background?
[ Removed by Reddit ]
where e2.user_id = events.user_id
and e2.occurred_at > now() - interval '30 days'), 0)
as freq
from events
where event_type = et
and user_id = events.user_id
and occurred_at > now() - interval '30 days'
) on true) as vector,
now()
from (select distinct user_id from events
where occurred_at > now() - interval '30 days') events
on conflict (user_id) do update
set vector = excluded.vector, computed_at = now();
end;
$$ language plpgsql;
(the actual production version is cleaner; simplified here for readability.)
querying it:
-- find users most similar to user X
select
ubv.user_id,
1 - (ubv.vector <=> (select vector from user_behavior_vectors where user_id = $1)) as similarity
from user_behavior_vectors ubv
where ubv.user_id != $1
order by ubv.vector <=> (select vector from user_behavior_vectors where user_id = $1)
limit 10;
this returns the 10 users whose 30-day behavior is most similar to user $1's. ~5ms with the HNSW index in place. ~50k users in our system.
what surprised me:
it actually works. the "similar users" surfaced by this query are noticeably more similar than what i'd get from "users in the same workspace" or "users who signed up the same week." real-world similarity, captured by behavior.
the dimensionality (64) is way smaller than text-embedding RAG vectors (typically 768-1536). HNSW on 64 dimensions is wildly fast. the index is ~30MB for 50k users.
what doesn't work:
new users have a sparse vector (few events, mostly zeros). their "similar users" are noisy until they have ~50+ events of history. we handle this in the product by not showing similar-user suggestions to users with <50 events.
if event types change (new events added, old ones removed), all vectors need to be rebuilt. we have a vector_schema_version field; when it changes, the nightly job rebuilds from scratch.
curious if anyone else is doing behavior-as-vector or other unusual pgvector use cases. specifically interested in the "embeddings of users for personalization" pattern at larger scale.