mirror of
https://github.com/hexastack/hexabot
synced 2024-11-24 04:53:41 +00:00
173 lines
6.6 KiB
Markdown
173 lines
6.6 KiB
Markdown
|
# Hexabot NLU
|
|||
|
|
|||
|
The [Hexabot](https://hexabot.ai/) NLU (Natural Language Understanding) engine is a Python-based project that provides tools for building, training, and evaluating machine learning models for natural language tasks such as intent detection and language recognition. It also includes a REST API for inference, built using FastAPI.
|
|||
|
|
|||
|
## Directory Structure
|
|||
|
- **/run.py:** The CLI tool that provides commands for training, evaluating, and managing models.
|
|||
|
- **/models:** Contains the different model definitions and logic for training, testing, and evaluation.
|
|||
|
- **/data:** Placeholder for datasets used during training and evaluation.
|
|||
|
- **/experiments:** Placeholder for stored models generated during training.
|
|||
|
- **/data_loaders:** Classes that define the way to load datasets to be used by the different models.
|
|||
|
- **/main.py:** The FastAPI-based REST API used for inference, exposing endpoints for real-time predictions.
|
|||
|
|
|||
|
## Setup
|
|||
|
|
|||
|
**No dependencies needed besides Python 3.11.6, virtualenv, and TensorFlow.** Start developing your new model on top of this workflow by cloning this repository:
|
|||
|
|
|||
|
```bash
|
|||
|
# Set up a virtualenv
|
|||
|
pip install virtualenv
|
|||
|
|
|||
|
python3.11 -m venv venv
|
|||
|
|
|||
|
source env.sh
|
|||
|
|
|||
|
pip install -r requirements.txt
|
|||
|
```
|
|||
|
|
|||
|
## Directory structure
|
|||
|
|
|||
|
- `data`: gitignore'd, place datasets here.
|
|||
|
- `experiments`: gitignore'd, trained models written here.
|
|||
|
- `data_loaders`: write your data loaders here.
|
|||
|
- `models`: write your models here.
|
|||
|
|
|||
|
|
|||
|
## Usage
|
|||
|
|
|||
|
**Check `models/mlp.py` and `data_loaders/mnist.py` for fully working examples.**
|
|||
|
|
|||
|
You should run `source env.sh` on each new shell session. This activates the virtualenv and creates a nice alias for `run.py`:
|
|||
|
```bash
|
|||
|
$ cat env.sh
|
|||
|
source env/bin/activate
|
|||
|
alias run='python run.py'
|
|||
|
```
|
|||
|
|
|||
|
Most routines involve running a command like this:
|
|||
|
```bash
|
|||
|
# Usage: run [method] [save_dir] [model] [data_loader] [hparams...]
|
|||
|
run fit myexperiment1 mlp mnist --batch_size=32 --learning_rate=0.1
|
|||
|
```
|
|||
|
|
|||
|
Examples :
|
|||
|
```bash
|
|||
|
# Intent classification
|
|||
|
run fit intent-classifier-en-30072024 jisf --intent_num_labels=88 --slot_num_labels=17 --language=en
|
|||
|
run predict intent-classifier-fr-30072024 --intent_num_labels=7 --slot_num_labels=2 --language=fr
|
|||
|
|
|||
|
# Language classification
|
|||
|
run fit language-classifier-26082023 tflc
|
|||
|
run predict language-classifier-26082023
|
|||
|
run evaluate language-classifier-26082023
|
|||
|
```
|
|||
|
|
|||
|
where the `model` and `data_loader` args are the module names (i.e., the file names without the `.py`). The command above would run the Keras model's `fit` method, but it could be any custom as long as it accepts a data loader instance as argument.
|
|||
|
|
|||
|
**If `save_dir` already has a model**:
|
|||
|
- Only the first two arguments are required and the data loader may be changed, but respecifying the model is not allowed-- the existing model will always be used.
|
|||
|
- Specified hyperparameter values in the command line WILL override previously used ones
|
|||
|
(for this run only, not on disk).
|
|||
|
|
|||
|
|
|||
|
### `tfbp.Model`
|
|||
|
|
|||
|
Models pretty much follow the same rules as Keras models with very slight differences: the constructor's arguments should not be overriden (since the boilerplate code handles instantiation), and the `save` and `restore` methods don't need any arguments.
|
|||
|
|
|||
|
```python
|
|||
|
import tensorflow as tf
|
|||
|
import boilerplate as tfbp
|
|||
|
|
|||
|
@tfbp.default_export
|
|||
|
class MyModel(tfbp.Model):
|
|||
|
default_hparams = {
|
|||
|
"batch_size": 32,
|
|||
|
"hidden_size": 512,
|
|||
|
"learning_rate": 0.01,
|
|||
|
}
|
|||
|
|
|||
|
# Don't mess with the args and keyword args, `run.py` handles that.
|
|||
|
def __init__(self, *a, **kw):
|
|||
|
super().__init__(*a, **kw)
|
|||
|
|
|||
|
self.dense1 = tf.keras.layers.Dense(self.hparams.hidden_size)
|
|||
|
...
|
|||
|
|
|||
|
def call(self, x):
|
|||
|
z = self.dense1(x)
|
|||
|
...
|
|||
|
```
|
|||
|
|
|||
|
You can also write your own training loops à la pytorch by overriding the `fit` method
|
|||
|
or writing a custom method that you can invoke via `run.py` simply by adding the
|
|||
|
`@tfbp.runnable` decorator. Examples of both are available in `models/mlp.py`.
|
|||
|
|
|||
|
### `tfbp.DataLoader`
|
|||
|
|
|||
|
Since model methods invoked by `run.py` receive a data loader instance, you may name your data loader methods whatever you wish and call them in your model code. A good practice is to make the data loader handle anything that is specific to a particular dataset, which allows the model to be as general as possible.
|
|||
|
|
|||
|
```python
|
|||
|
import tensorflow as tf
|
|||
|
import boilerplate as tfbp
|
|||
|
|
|||
|
@tfbp.default_export
|
|||
|
class MyDataLoader(tfbp.DataLoader):
|
|||
|
default_hparams = {
|
|||
|
"batch_size": 32,
|
|||
|
}
|
|||
|
|
|||
|
def __call__(self):
|
|||
|
if self.method == "fit":
|
|||
|
train_data = tf.data.TextLineDataset("data/train.txt").shuffle(10000)
|
|||
|
valid_data = tf.data.TextLineDataset("data/valid.txt").shuffle(10000)
|
|||
|
return self.prep_dataset(train_data), self.prep_dataset(valid_data)
|
|||
|
|
|||
|
elif self.method == "eval":
|
|||
|
test_data = tf.data.TextLineDataset("data/test.txt")
|
|||
|
return self.prep_dataset(test_data)
|
|||
|
|
|||
|
def prep_dataset(self, ds):
|
|||
|
return ds.batch(self.hparams.batch_size).prefetch(1)
|
|||
|
```
|
|||
|
|
|||
|
### API
|
|||
|
API is built using FastAPI : https://fastapi.tiangolo.com/
|
|||
|
|
|||
|
Run the dev server in standalone with:
|
|||
|
```sh
|
|||
|
ENVIRONMENT=dev uvicorn main:app --host 0.0.0.0 --port 5000 --reload
|
|||
|
```
|
|||
|
|
|||
|
Run the project with Docker :
|
|||
|
```sh
|
|||
|
docker compose -f "docker-compose.yml" up -d --build
|
|||
|
```
|
|||
|
|
|||
|
## Pushing models to HuggingFace
|
|||
|
|
|||
|
Please refer to official HF documentation on how to host models : https://huggingface.co/docs/hub/en/repositories-getting-started
|
|||
|
|
|||
|
What is important to note is that big files should be tracked with git-lfs, which you can initialize with:
|
|||
|
|
|||
|
```
|
|||
|
git lfs install
|
|||
|
```
|
|||
|
|
|||
|
and if your files are larger than 5GB you’ll also need to run:
|
|||
|
|
|||
|
```
|
|||
|
huggingface-cli lfs-enable-largefiles .
|
|||
|
```
|
|||
|
|
|||
|
## Contributing
|
|||
|
We welcome contributions from the community! Whether you want to report a bug, suggest new features, or submit a pull request, your input is valuable to us.
|
|||
|
|
|||
|
Feel free to join us on [Discord](https://discord.gg/xnpWDYQMAq)
|
|||
|
|
|||
|
## License
|
|||
|
This software is licensed under the GNU Affero General Public License v3.0 (AGPLv3) with the following additional terms:
|
|||
|
|
|||
|
1. The name "Hexabot" is a trademark of Hexastack. You may not use this name in derivative works without express written permission.
|
|||
|
2. All derivative works must include clear attribution to the original creator and software, Hexastack and Hexabot, in a prominent location (e.g., in the software's "About" section, documentation, and README file).
|
|||
|
3. SaaS Restriction: This software, or any derivative of it, may not be used to offer a competing product or service (SaaS) without prior written consent from Hexastack. Offering the software as a service or using it in a commercial cloud environment without express permission is strictly prohibited.
|