OthersideAI / self-operating-computer
- четверг, 30 ноября 2023 г. в 00:00:01
A framework to enable multimodal models to operate a computer.
Using the same inputs and outputs of a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.
Note: GPT-4V's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.
At HyperwriteAI, we are developing a multimodal model with more accurate click location predictions.
We recognize that some operating system functions may be more efficiently executed with hotkeys such as entering the Browser Address bar using command + L
rather than by simulating a mouse click at the correct XY location. We plan to make these improvements over time. However, it's important to note that many actions require the accurate selection of visual elements on the screen, necessitating precise XY mouse click locations. A primary focus of this project is to refine the accuracy of determining these click locations. We believe this is essential for achieving a fully self-operating computer in the current technological landscape.
Below are instructions to set up the Self-Operating Computer Framework locally on your computer.
git clone https://github.com/OthersideAI/self-operating-computer.git
cd self-operating-computer
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install .
.example.env
file to .env
so that you can save your OpenAI key in it.mv .example.env .env
.env
file. If you don't have one, you can obtain an OpenAI key here:OPENAI_API_KEY='your-key-here'
operate
For any input on improving this project, feel free to reach out to me on Twitter.
Stay updated with the latest developments: