{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Quick start \n", "\n", "This tutorial assumes you know how to programm in python and have a basic understanding of the [pandas library](https://pandas.pydata.org/docs/index.html).\n", "\n", "To use @voc@ you need to follow these steps:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install @voc@\n", "\n", "You can install @voc@ using pip:\n", "(Run this directly in a notebook cell or in your terminal)\n", "```bash\n", "pip install avoca\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data\n", "\n", "\n", "Internally @voc@ uses pandas dataframe.\n", "\n", "The format must follow the specifed rules:\n", "\n", "* Columns are [Multiindex](https://pandas.pydata.org/docs/user_guide/advanced.html#multiindex-advanced-indexing) containing in the first row, the name of the compound and in the second row the name of the variable. \n", "* If a variable is shared among all compounds, the compound is `-` .\n", "* One variable is reserved for each compound is called `flag`. It will be used for assigning flagged values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "np.random.seed(31415)\n", "\n", "df = pd.DataFrame(\n", " np.random.randn(100, 4),\n", " columns=pd.MultiIndex.from_tuples(\n", " [\n", " (\"compA\", \"area\"),\n", " (\"compA\", \"C\"),\n", " (\"compB\", \"area\"),\n", " (\"compB\", \"C\"),\n", " ]\n", " ),\n", ")\n", "# Create an outlier to ensure we will have a flagged value\n", "df.loc[0, (\"compA\", \"C\")] = 3.0\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define the QA model\n", "\n", "Many models can be found in the [models](Models) .\n", "\n", "In this example we will use the simplest model: \n", "and we will use the simplest model: {py:class}`avoca.qa_class.zscore.ExtremeValues`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from avoca.qa_class.zscore import ExtremeValues\n", "\n", "model = ExtremeValues(\n", " # Here we define some parameters on which the model will be applied\n", " compounds=[\"compA\", \"compB\"],\n", " variable=\"C\",\n", " # Here are some parameters for the model itself\n", " threshold=2,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the QA model\n", "\n", "In an approach similar to Machine Learning, we will first fit the model to the data and then predict the bad values.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The model will calculate some statistics on the data\n", "model.fit(df)\n", "\n", "# Predict the outliers\n", "outliers = model.assign(df)\n", "outliers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see the indexes of the bad values.\n", "But the best way to see the results is to plot the data.\n", "\n", "For this purpose we can use the `plot` method of the model.\n", "\n", "It plots the training data and the outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.style.use(\"default\")\n", "\n", "model.plot()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting the flags\n", "\n", "Now that we have seen how the assigner works, we would like to set the flags\n", "to the data and then be able to export it.\n", "\n", "For this we can use the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from avoca.flagging import flag \n", "\n", "df_out = flag(df, model, outliers)\n", "df_out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the value we set at the start now received a flag value, as expected.\n", "\n", "The flag is 0 if no flag was set. Then each flag is a power of 2.\n", "Combining flags is done by adding the values together." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.flag" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export the data\n", "\n", "Finally we would like to share this data further.\n", "\n", "We can use the `to_csv` method from pandas to export the data to a csv file.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_out.to_csv(\"flagged_data.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However many programs will support custom flag formats. \n", "\n", "For this `avoca` provides bindings to other software.\n", "Have a look at the [bindings](https://avoca.readthedocs.io/en/latest/bindings/index.html) to see if your software is supported." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions\n", "\n", "\n", "Here we have showed on a toy example how to use @voc@ to detect bad values in a dataset.\n", "\n", "Note that we used the same dataset for training and prediction, but in a real scenario, you could have some cleaned data that you use for training and then apply the model to a new dataset." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 2 }