-
Notifications
You must be signed in to change notification settings - Fork 1.5k
JAVA-5988 #1862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JAVA-5988 #1862
Conversation
rozza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its looking good, add some unit tests to AggregatesTest, so that the shape of the produced Bson is as expected. So there is some test coverage for now.
Also don't forget to add the license headers to the new files!
hey @rozza thanks for your feedback
|
| @Beta(Reason.SERVER) | ||
| public static Bson vectorSearch( | ||
| final FieldSearchPath path, | ||
| final Bson query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this was considered already, but I wanted to propose we consider an alternative: model the auto-embedding query as a typed object instead of accepting raw Bson.
Why raw Bson is tricky here
Using raw Bson for the query is “stringly-typed”:
- typos / wrong types / wrong structure compile and fail at runtime .
- it’s not very discoverable (users would have to read the docs to learn the required document shape).
//What goes in `???`? User must read documentation. IDE provides no help.
vectorSearch(path, ???, index, limit, options)Also, adding model as a separate parameter feels easy to mix up since model and index are both String.
This matches existing patterns in the driver
The alternative proposed below follows patterns already in the driver's API:
- Static factory creates a variant:
public interface SearchPath extends Bson {
static FieldSearchPath fieldPath(final String path) {
return new SearchConstructibleBson(new BsonDocument(SEARCH_PATH_VALUE_KEY, new BsonString(path)));
}- Optional fluent config (e.g., FieldSearchPath.multi(...))
public interface FieldSearchPath extends SearchPath {
FieldSearchPath multi(String analyzerName); // optional
}
// usage:
fieldPath("plot")
fieldPath("plot").multi("english") // with optional analyzerThe equivalent here would be:
VectorSearchQuery.textQuery("...") (factory).model("...") (optional fluent config)
API surface:
a typed query abstraction:
@Beta(Reason.SERVER)
public interface VectorSearchQuery extends Bson {
static TextVectorSearchQuery textQuery(final String text) {
return new TextVectorSearchQueryImpl(notNull("text", text), null);
}
}
@Beta(Reason.SERVER)
public interface TextVectorSearchQuery extends VectorSearchQuery {
TextVectorSearchQuery model(String modelName);
}Implementation remains package-private and immutable: (expandable)
final class TextVectorSearchQueryImpl implements TextVectorSearchQuery {
private final String text;
private final @Nullable String model;
TextVectorSearchQueryImpl(String text, @Nullable String model) { ... }
@Override public TextVectorSearchQuery model(String modelName) {
return new TextVectorSearchQueryImpl(text, notNull("modelName", modelName));
}
@Override public @Nullable String getModel() { return model; }
@Override public <TDocument> BsonDocument toBsonDocument(Class<TDocument> dc, CodecRegistry cr) {
return new Document("text", text).toBsonDocument(dc, cr);
}
}Then Aggregate methods becomes:
@Beta(Reason.SERVER)
public static Bson vectorSearch(
FieldSearchPath path,
VectorSearchQuery query,
String index,
long limit,
VectorSearchOptions options
) { ... }The usage of the API is type safe:
vectorSearch(fieldPath("plot"),
textQuery("movies about cars").model("voyage-4-large"),
"my_index",
10,
options
)API Evolution Support (future modalities):
This approach can evolve without changing the Aggregates.vectorSearch(...) signature by adding new factories and subtypes later (e.g. imageQuery(...), audioQuery(...)) with modality-specific options.
- add VectorSearchQuery.imageQuery(...) returning
ImageVectorSearchQuery - add image-specific fluent methods on that subtype (resolution(...), etc.)
VectorSearchQuery (interface factory)
- textQuery(String) -> TextVectorSearchQuery
- (future) imageQuery(BsonBinary) -> ImageVectorSearchQuery
- (future) audioQuery(BsonBinary) -> AudioVectorSearchQuery
TextVectorSearchQuery (current)
- model(String)
ImageVectorSearchQuery (future)
- model(String)
- resolution(int, int)
AudioVectorSearchQuery (future)
- model(String)
- sampleRate(int)
This keeps the stage discoverable, compile-time safe, and provides a potential path for future query modalities. What do you think?
P.S : the server may ship modalities incrementally (e.g., text could reach GA first, with image/audio staying Beta).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR to include fluent builders thanks Slav really good feedback
| * Creates a {@code $vectorSearch} pipeline stage supported by MongoDB Atlas. | ||
| * You may use the {@code $meta: "vectorSearchScore"} expression, e.g., via {@link Projections#metaVectorSearchScore(String)}, | ||
| * to extract the relevance score assigned to each found document. | ||
| * This overload is used for auto-embedding in atlas community edition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it also be used in Enterprise Edition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only in community edition right now, but I believe in the future it will be extended to atlas too
driver-core/src/test/functional/com/mongodb/client/model/AggregatesTest.java
Outdated
Show resolved
Hide resolved
| @Beta(Reason.SERVER) | ||
| public static Bson vectorSearch( | ||
| final FieldSearchPath path, | ||
| final TextVectorSearchQuery query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now I added this specific Text interface because with plain Vector we won't have an access to get model without casting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why TextVectorSearchQuery is used in the signature today as it’s the only way to access getModel() without a cast. However, I think using VectorSearchQuery as parameter helps discoverability. We could keep VectorSearchQuery as the parameter type and casting to AbstractVectorSearchQuery inside the stage builder.
If users see this signature:
public static Bson vectorSearch(
FieldSearchPath path,
TextVectorSearchQuery query,
String index, ...)
a natural question is: “How do I create a TextVectorSearchQuery?” If they type TextVectorSearchQuery in the IDE to look for clues (method factories in this case), none will show up (since the factory lives on VectorSearchQuery). Using VectorSearchQuery keeps construction discoverable via VectorSearchQuery.textQuery(...).
Internally, getModel() could be accessed via the internal abstract base, similar to the approach used in the Client Bulk Write internal models:
AbstractClientDeleteModel.java#L23-L39
Suggested change: commit
| /** | ||
| * Specifies the embedding model to use for generating embeddings from the query text. | ||
| * <p> | ||
| * If not specified, the model configured in the vector search index definition will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]
| * If not specified, the model configured in the vector search index definition will be used. | |
| * If not specified, the model configured in the vector {@link SearchIndexModel} definition will be used. |
Note: requires an import.
| @Beta(Reason.SERVER) | ||
| public static Bson vectorSearch( | ||
| final FieldSearchPath path, | ||
| final TextVectorSearchQuery query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why TextVectorSearchQuery is used in the signature today as it’s the only way to access getModel() without a cast. However, I think using VectorSearchQuery as parameter helps discoverability. We could keep VectorSearchQuery as the parameter type and casting to AbstractVectorSearchQuery inside the stage builder.
If users see this signature:
public static Bson vectorSearch(
FieldSearchPath path,
TextVectorSearchQuery query,
String index, ...)
a natural question is: “How do I create a TextVectorSearchQuery?” If they type TextVectorSearchQuery in the IDE to look for clues (method factories in this case), none will show up (since the factory lives on VectorSearchQuery). Using VectorSearchQuery keeps construction discoverable via VectorSearchQuery.textQuery(...).
Internally, getModel() could be accessed via the internal abstract base, similar to the approach used in the Client Bulk Write internal models:
AbstractClientDeleteModel.java#L23-L39
Suggested change: commit
|
Approved, given functional tests aren’t currently run in CI; we should enable them in a follow-up PR under JAVA-6059. |
@rozza is out, concerns should be addressed
* JAVA-5988 * JAVA-5988 address PR comments * JAVA-5988 add a unit test to AggregatesTest * JAVA-5988 embedding fluent API with builders * JAVA-5988 introduce AbstractVectorSearchQuery * JAVA-5988 fix checkstyle * JAVA05988 use VectorSearchQuery as param * JAVA-5988 add more unit tests * JAVA-5988 add autoembedding to Scala --------- Co-authored-by: Almas Abdrazak <[email protected]>
JAVA-5988