In my first post about KNIME, “How to Set Up an Oracle Connection” , I mentioned that the connection between KNIME and an Oracle Database. Now, we will take a glance at a simple classification workflow in KNIME.
Firstly, I would like to briefly talk about one of the advantages of KNIME provides.
Whatever analytical tool we use like Python, R, SQL etc., we need to use some coding things to carry out below general analytics steps. That means, no matter how good our theoretical knowledge is, it is quite difficult to do anything without coding skills. I have no intention to denigrate coding, on the contrary it is a crucial thing in data world and undoubtedly it will be. What I want to say is that coding can sometimes be an obstacle to one with high theoretical knowledge. In KNIME, our dependency on coding is reduced. That’ s the correct sentence. Generally, we can build analytical processes without any coding thing. That means, if we have a good theoretical knowledge about analytics, we are the king of the jungle!
Let’ s see the simple workflow shown below to prove that dependency on coding is reduced.
Before building a such workflow, we need to take a look at the following generalized steps,
- Data Collecting and Understanding
- Data Preprocessing
- Modelling
- Evaluating the Model
As we see above, we can easily built a classification model. KNIME can perform all processes with its nodes, all we need to do configuring the nodes. It will be quite easy, if we have a good theoretical knowledge.
Let’ s take a closer look at these four generalized analtyics steps in KNIME.
- Data Collecting and Understanding
Each analytics process starts collecting data. Our data can be stored in different sources like database, .csv, .xls, .xml etc. To handle this, KNIME has a corresponding node for each data source to collect data. Some of these nodes are Oracle/ MySQL/ Postgre Connector, CSV Reader, Excel Reader, XML Reader respectively. As each node has its own configuration node, we can easily collect data by using this specialized configurations.
The image shown above shows our data collecting and understanding part for our sample. As our data is stored in an oracle database, we’ ve used Oracle Connector node to connect the data source. DB Query Reader has been used for retrieving data with a SQL query.
For data understanding, we’ ve used Data Explorer node. With this node, we can take a rapid glance at our data i.e we can see the types of attributes, how many missing values the data contains and distribution of each attributes etc.
- Data Preprocessing
In our sample workflow, we don’t need to perform various preprocessing steps. For this data, all we need to do converting double typed attributes into integer and filling missing values. So, we’ ve used Double to Int and Missing Value nodes. Then, we’ ve used Partitioning node to perform splitting data into train and test set.
KNIME has various nodes for preprocessing like One To Many, String Manipulation, String to Number, Groupby etc. There are almost every node that may be needed for preprocessing. In parallel, KNIME has also another node for partitioning called X-Partitioner comes with X-Aggregator.
- Modelling
In KNIME, all algorithms represents different nodes like any other operations. For algorithms under the predictive analytics (Logistic Regression, SVM, …) roof represents two nodes, Learner and Predictor. We train our model with Learner node. Then, we test our model with Predictor node by using train data comes from output of Partitioning node.
In our sample workflow, we’ ve used simple Decision Tree algorithms, and one can see this predictive algorithm represents with two different nodes shown above.
- Evaluating the Model
At the end, builded model needs to be evaluated. To evaluate a model, Scorer nodes can be used. With this node, we can evaluate basic performance metrics (accuracy, recall, precision, Cohen’ s Kappa) of the corresponding model. Scorer node has two different types, Scorer and Scorer (JavaScript). While Scorer (JavaScript) offers an interactive output, Scorer offers simpler output, that’ s the difference between these two nodes.
One thing about Scorer nodes, these nodes can perform on a model with categorical target. If we have a model with numerical target like Regression, we need to use Numeric Scorer node.
So far, thanks to the advantages of KNIME provides, we’ ve seen that we can easily built an analytical model without any coding. Althought we are not dependent on coding here, sometimes it is necessary to use it. In case we need to use it, KNIME also handles this quite easily 🙂