browserbase / stagehand
- понедельник, 23 декабря 2024 г. в 00:00:03
An AI web browsing framework focused on simplicity and extensibility.
An AI web browsing framework focused on simplicity and extensibility.
Note
Stagehand
is currently available as an early release, and we're actively seeking feedback from the community. Please join our Slack community to stay updated on the latest developments and provide feedback.
Stagehand is the AI-powered successor to Playwright, offering three simple APIs (act
, extract
, and observe
) that provide the building blocks for natural language driven web automation.
The goal of Stagehand is to provide a lightweight, configurable framework, without overly complex abstractions, as well as modular support for different models and model providers. It's not going to order you a pizza, but it will help you reliably automate the web.
Each Stagehand function takes in an atomic instruction, such as act("click the login button")
or extract("find the red shoes")
, generates the appropriate Playwright code to accomplish that instruction, and executes it.
Instructions should be atomic to increase reliability, and step planning should be handled by the higher level agent. You can use observe()
to get a suggested list of actions that can be taken on the current page, and then use those to ground your step planning prompts.
Stagehand is open source and maintained by the Browserbase team. We believe that by enabling more developers to build reliable web automations, we'll expand the market of developers who benefit from our headless browser infrastructure. This is the framework that we wished we had while tinkering on our own applications, and we're excited to share it with you.
We also install zod to power typed extraction
npm install @browserbasehq/stagehand zod
You'll need to provide your API Key for the model provider you'd like to use. The default model provider is OpenAI, but you can also use Anthropic or others. More information on supported models can be found in the API Reference.
Ensure that an OpenAI API Key or Anthropic API key is accessible in your local environment.
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-...
If you plan to run the browser locally, you'll also need to install Playwright's browser dependencies.
npm exec playwright install
Then you can create a Stagehand instance like so:
import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";
const stagehand = new Stagehand({
env: "LOCAL",
});
If you plan to run the browser remotely, you'll need to set a Browserbase API Key and Project ID.
export BROWSERBASE_API_KEY=...
export BROWSERBASE_PROJECT_ID=...
import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";
const stagehand = new Stagehand({
env: "BROWSERBASE",
enableCaching: true,
});
await stagehand.init();
await stagehand.page.goto("https://github.com/browserbase/stagehand");
await stagehand.act({ action: "click on the contributors" });
const contributor = await stagehand.extract({
instruction: "extract the top contributor",
schema: z.object({
username: z.string(),
url: z.string(),
}),
});
await stagehand.close();
console.log(`Our favorite contributor is ${contributor.username}`);
This simple snippet will open a browser, navigate to the Stagehand repo, and log the top contributor.
This constructor is used to create an instance of Stagehand.
Arguments:
env
: 'LOCAL'
or 'BROWSERBASE'
. Defaults to 'BROWSERBASE'
.modelName
: (optional) an AvailableModel
string to specify the default model to use.modelClientOptions
: (optional) configuration options for the model client.enableCaching
: a boolean
that enables caching of LLM responses. When set to true
, the LLM requests will be cached on disk and reused for identical requests. Defaults to false
.headless
: a boolean
that determines if the browser runs in headless mode. Defaults to false
. When the env is set to BROWSERBASE
, this will be ignored.domSettleTimeoutMs
: an integer
that specifies the timeout in milliseconds for waiting for the DOM to settle. Defaults to 30000 (30 seconds).apiKey
: (optional) your Browserbase API key. Defaults to BROWSERBASE_API_KEY
environment variable.projectId
: (optional) your Browserbase project ID. Defaults to BROWSERBASE_PROJECT_ID
environment variable.browserbaseSessionCreateParams
: configuration options for creating new Browserbase sessions.browserbaseSessionID
: ID of an existing live Browserbase session. Overrides browserbaseSessionCreateParams
.logger
: a function that handles log messages. Useful for custom logging implementations.verbose
: an integer
that enables several levels of logging during automation:
0
: limited to no logging1
: SDK-level logging2
: LLM-client level logging (most granular)debugDom
: a boolean
that draws bounding boxes around elements presented to the LLM during automation.Returns:
Stagehand
class configured with the specified options.Example:
// Basic usage
const stagehand = new Stagehand();
// Custom configuration
const stagehand = new Stagehand({
env: "LOCAL",
verbose: 1,
headless: true,
enableCaching: true,
logger: (logLine) => {
console.log(`[${logLine.category}] ${logLine.message}`);
},
});
// Resume existing Browserbase session
const stagehand = new Stagehand({
env: "BROWSERBASE",
browserbaseSessionID: "existing-session-id",
});
init()
asynchronously initializes the Stagehand instance. It should be called before any other methods.
Warning
Passing parameters to init()
is deprecated and will be removed in the next major version. Use the constructor options instead.
Arguments:
modelName
: (deprecated, optional) an AvailableModel
string to specify the model to use. This will be used for all other methods unless overridden.modelClientOptions
: (deprecated, optional) configuration options for the model clientdomSettleTimeoutMs
: (deprecated, optional) timeout in milliseconds for waiting for the DOM to settleReturns:
Promise
that resolves to an object containing:
debugUrl
: a string
representing the URL for live debugging. This is only available when using a Browserbase browser.sessionUrl
: a string
representing the session URL. This is only available when using a Browserbase browser.sessionId
: a string
representing the session ID. This is only available when using a Browserbase browser.Example:
await stagehand.init();
act()
allows Stagehand to interact with a web page. Provide an action
like "search for 'x'"
, or "select the cheapest flight presented"
(small atomic goals perform the best).
Arguments:
action
: a string
describing the action to performmodelName
: (optional) an AvailableModel
string to specify the model to usemodelClientOptions
: (optional) configuration options for the model clientuseVision
: (optional) a boolean
or "fallback"
to determine if vision-based processing should be used. Defaults to "fallback"
variables
: (optional) a Record<string, string>
of variables to use in the action. Variables in the action string are referenced using %variable_name%
domSettleTimeoutMs
: (optional) timeout in milliseconds for waiting for the DOM to settleReturns:
Promise
that resolves to an object containing:
success
: a boolean
indicating if the action was completed successfully.message
: a string
providing details about the action's execution.action
: a string
describing the action performed.Example:
// Basic usage
await stagehand.act({ action: "click on add to cart" });
// Using variables
await stagehand.act({
action: "enter %username% into the username field",
variables: {
username: "john.doe@example.com",
},
});
// Multiple variables
await stagehand.act({
action: "fill in the form with %username% and %password%",
variables: {
username: "john.doe",
password: "secretpass123",
},
});
extract()
grabs structured text from the current page using zod. Given instructions and schema
, you will receive structured data. Unlike some extraction libraries, stagehand can extract any information on a page, not just the main article contents.
Arguments:
instruction
: a string
providing instructions for extractionschema
: a z.AnyZodObject
defining the structure of the data to extractmodelName
: (optional) an AvailableModel
string to specify the model to usemodelClientOptions
: (optional) configuration options for the model clientdomSettleTimeoutMs
: (optional) timeout in milliseconds for waiting for the DOM to settleuseTextExtract
: (optional) a boolean
to determine if text-based extraction should be used. Defaults to false
Returns:
Promise
that resolves to the structured data as defined by the provided schema
.Example:
const price = await stagehand.extract({
instruction: "extract the price of the item",
schema: z.object({
price: z.number(),
}),
});
Note
observe()
currently only evaluates the first chunk in the page.
observe()
is used to get a list of actions that can be taken on the current page. It's useful for adding context to your planning step, or if you unsure of what page you're on.
If you are looking for a specific element, you can also pass in an instruction to observe via: observe({ instruction: "{your instruction}"})
.
Arguments:
instruction
: (optional) a string
providing instructions for the observation. Defaults to "Find actions that can be performed on this page."modelName
: (optional) an AvailableModel
string to specify the model to usemodelClientOptions
: (optional) configuration options for the model clientuseVision
: (optional) a boolean
to determine if vision-based processing should be used. Defaults to false
domSettleTimeoutMs
: (optional) timeout in milliseconds for waiting for the DOM to settleReturns:
Promise
that resolves to an array of objects containing:
selector
: a string
representing the element selectordescription
: a string
describing the possible actionExample:
const actions = await stagehand.observe();
close()
is a cleanup method to remove the temporary files created by Stagehand. It's highly recommended that you call this when you're done with your automation.
await stagehand.close();
page
and context
are instances of Playwright's Page
and BrowserContext
respectively. Use these methods to interact with the Playwright instance that Stagehand is using. Most commonly, you'll use page.goto()
to navigate to a URL.
await stagehand.page.goto("https://github.com/browserbase/stagehand");
log()
is used to print a message to the browser console. These messages will be persisted in the Browserbase session logs, and can be used to debug sessions after they've completed.
Make sure the log level is above the verbose level you set when initializing the Stagehand instance.
stagehand.log("Hello, world!");
Stagehand leverages a generic LLM client architecture to support various language models from different providers. This design allows for flexibility, enabling the integration of new models with minimal changes to the core system. Different models work better for different tasks, so you can choose the model that best suits your needs.
Stagehand currently supports the following models from OpenAI and Anthropic:
OpenAI Models:
gpt-4o
gpt-4o-mini
gpt-4o-2024-08-06
Anthropic Models:
claude-3-5-sonnet-latest
claude-3-5-sonnet-20240620
claude-3-5-sonnet-20241022
These models can be specified when initializing the Stagehand
instance or when calling methods like act()
and extract()
.
The SDK has two major phases:
Stagehand uses a combination of techniques to prepare the DOM.
The DOM Processing steps look as follows:
While LLMs will continue to increase context window length and reduce latency, giving any reasoning system less stuff to think about should make it more reliable. As a result, DOM processing is done in chunks in order to keep the context small per inference call. In order to chunk, the SDK considers a candidate element that starts in a section of the viewport to be a part of that chunk. In the future, padding will be added to ensure that an individual chunk does not lack relevant context. See this diagram for how it looks:
The act()
and observe()
methods can take a useVision
flag. If this is set to true
, the LLM will be provided with a annotated screenshot of the current page to identify which elements to act on. This is useful for complex DOMs that the LLM has a hard time reasoning about, even after processing and chunking. By default, this flag is set to "fallback"
, which means that if the LLM fails to successfully identify a single element, Stagehand will retry the attempt using vision.
Now we have a list of candidate elements and a way to select them. We can present those elements with additional context to the LLM for extraction or action. While untested on a large scale, presenting a "numbered list of elements" guides the model to not treat the context as a full DOM, but as a list of related but independent elements to operate on.
In the case of action, we ask the LLM to write a playwright method in order to do the correct thing. In our limited testing, playwright syntax is much more effective than relying on built in javascript APIs, possibly due to tokenization.
Lastly, we use the LLM to write future instructions to itself to help manage it's progress and goals when operating across chunks.
Below is an example of how to extract a list of companies from the AI Grant website using both Stagehand and Playwright.
Prompting Stagehand is more literal and atomic than other higher level frameworks, including agentic frameworks. Here are some guidelines to help you craft effective prompts:
await stagehand.act({ action: "click the login button" });
const productInfo = await stagehand.extract({
instruction: "find the red shoes",
schema: z.object({
productName: z.string(),
price: z.number(),
}),
});
Instead of combining actions:
// Avoid this
await stagehand.act({ action: "log in and purchase the first item" });
Split them into individual steps:
await stagehand.act({ action: "click the login button" });
// ...additional steps to log in...
await stagehand.act({ action: "click on the first item" });
await stagehand.act({ action: "click the purchase button" });
observe()
to get actionable suggestions from the current pageconst actions = await stagehand.observe();
console.log("Possible actions:", actions);
// Too vague
await stagehand.act({ action: "find something interesting on the page" });
// Avoid combining actions
await stagehand.act({ action: "fill out the form and submit it" });
// Outside Stagehand's scope
await stagehand.act({ action: "book the cheapest flight available" });
By following these guidelines, you'll increase the reliability and effectiveness of your web automations with Stagehand. Remember, Stagehand excels at executing precise, well-defined actions so keeping your instructions atomic will lead to the best outcomes.
We leave the agentic behaviour to higher-level agentic systems which can use Stagehand as a tool.
At a high level, we're focused on improving reliability, speed, and cost in that order of priority.
You can see the roadmap here. Looking to contribute? Read on!
Note
We highly value contributions to Stagehand! For support or code review, please join our Slack community.
First, clone the repo
git clone git@github.com:browserbase/stagehand.git
Then install dependencies
npm install
Ensure you have the .env
file as documented above in the Getting Started section.
Then, run the example script npm run example
.
A good development loop is:
You'll need a Braintrust API key to run evals
BRAINTRUST_API_KEY=""
After that, you can run all evals at once using npm run evals
You can also run individual evals using npm run evals -- your_eval_name
.
Running all evals can take some time. We have a convenience script example.ts
where you can develop your new single eval before adding it to the set of all evals.
You can run npm run example
to execute and iterate on the eval you are currently developing.
To add a new model to Stagehand, follow these steps:
Define the Model: Add the new model name to the AvailableModel
type in the LLMProvider.ts
file. This ensures that the model is recognized by the system.
Map the Model to a Provider: Update the modelToProviderMap
in the LLMProvider
class to associate the new model with its corresponding provider. This mapping is crucial for determining which client to use.
Implement the Client: If the new model requires a new client, implement a class that adheres to the LLMClient
interface. This class should define all necessary methods, such as createChatCompletion
.
Update the getClient
Method: Modify the getClient
method in the LLMProvider
class to return an instance of the new client when the new model is requested.
Stagehand uses tsup to build the SDK and vanilla esbuild to build the scripts that run in the DOM.
npm run build
npm pack
to get a tarball for distributionThis project heavily relies on Playwright as a resilient backbone to automate the web. It also would not be possible without the awesome techniques and discoveries made by tarsier, and fuji-web.
Jeremy Press wrote the original MVP of Stagehand and continues to be a major ally to the project.
Licensed under the MIT License.
Copyright 2024 Browserbase, Inc.