Frequently Asked Questions
Yes and no! The whole premise of modern ML is that you almost never train from scratch. So all our models in our AutoML system are indeed pre-trained deep neural networks. Our AutoML system then tunes these models to your data and picks the best one. So "yes", all our training happens from pre-trained models, but "no" we don't expose these directly to the customers.
We also maintain a set of pre-trained functions that can be used either as-is or as starting points for your own functions.
All Machine learning suffers when so called “data drift” occurs. Basically, if the data that the model is deployed on differs in any significant way from the data that the model is trained on, the model performance suffers. Sometimes to disastrous levels. So at the very minimum, if you are planning to use an off-the-shelf model directly, test it on your own data. Meaning: pick 100 or so random samples (image, text, or tabular data, depending on what you are classifying) and annotate them with the desired output. Then check how well the off-the-shelf model does.
Here is the thing: once you have 100 or so annotated samples, you might as well tune the model using that same data! And this is exactly what we do at Nyckel. Our AutoML system uses so-called cross-validation that allows training and testing, iteratively, on the same data, making it very data efficient. That is why we are seeing such strong benchmark performance, in particular for small amounts of training data.
So tl;dr:
- Never use a off-the-shelf model without testing it on your own data
- In order to test it, sit down and annotate some of your own data
- Once you have annotated data, use some of it to tune the off-the-shelf model instead of using it directly!
Iterating on data is the best way to improve any machine learning system. By making the training loop as fast as possible, we allow you to do just that. This can be things like playing around with the set of labels, pre-processing inputs, selecting the right set of input columns; and much more.
Our infrastructure runs on AWS and all your data stays on AWS in standard AWS services (S3, RDS, ECS). Your training data is never shared with other customers and is not used to train functions for anyone else. We capture "informative" invoke data to add to your training set - for example, data where the model was not confident, data similar to ones where the model made mistakes, etc. This is intended to assist with manual review to make the model better and can be opted out of. This data, like other data, is not shared with anyone else.
Select nyckel employees have access to your data for troubleshooting and model accuracy improvement reasons.
Small variations in estimated accuracy are expected due to how the cross-validation splits are created. The deployed model performance should be identical.
The text itself can be up to 100,000 characters. However, Nyckel's machine learning system will (currently) only learn from the 300-500 first words of each sampele. See this page for more context.
Nyckel automatically re-trains functions after any change to the training data. For example, if you upload a new annotated sample, or if you change the annotation of an existing sample.
To force a re-training simply change the label of a sample and then change it right back again.
Nyckel and Vertex AI are both End-to-End AutoML platforms. At a high level, they both support the three key aspects of the machine learning lifecycle:
- Data Engine: tools to annotate, iterate and improve the training data.
- AutoML Training: machinery to find and train model based on your data.
- Deployment: infrastructure to deploy and serve the trained model.
However, there are many differences.
With respect to AutoML Training, we found that Nyckel trains around 100x faster, but provide more or less the same accuracy. This allows users to iterate much faster on different ways to label and pre-process the data. It also helps speed up data annoation, since Nyckel's models automatically suggest labels for new samples as you go along.
With respect to Deployment, the difference is mainly in pricing. Nyckel has an elastic pricing model, based on the number of invocations. Vertex AI, on the other side, charge an hourly fee to keep the invoke node up. Unless you turn the node off when you are not using it, this results in a monthly cost of $1000 per month (based on prices listed September 2023).
The largest difference, however, is in the Data Engine. Vertex AI does allow in-app annotation, but it is limited and time consuming. Nyckel, on the other hand, has a fast, interactive data annotation tool, that among other things, allow you to discover and fix data errors. Nyckel also implements Invoke Capture -- an automated way to help you improve you model by capturing and annotating production data.
A 409 indicate a conflict with an existing resource. Most often, this means you are trying to create a sample that already exists in the function. Since Nyckel does not allows duplicate samples, the API call is rejected.
If you are trying to add an annoatation to an existing sample, use the PUT samples endpoint instead.
The maximum image size is 5MB. For larger images, consider resizing or cropping the image before uploading. Using 1024 pixels on the longest side is a good rule of thumb.
The maximum CSV import size is 12 MB. If you need a larger import, please divide your file into smaller parts.
ExternalId
512 characters.