Model provided by Asadullah Hamzah at XeTute Technologies.

Introducing HDM-V0.1

We introduce HANNA-MLP-HDM-V0.1, our 0.1-th version of our Heart-Disease-Model which uses the Multi-Layer-Perceptron architecture implemented in our native C++ HANNA library.
This model, while only 840 bytes small achieves 72.8% accuracy on the 3MB(in ftype CSV) large Heart Disease Cardivascular Dataset.

What it does

The tiny model classifies if a patient has Heart Disease with a 72.8% success-rate given (in correct order):

Age (in Days, has to be normalized)
Gender (binary, 0 = female & 1 = male)
Height (in CM, has to be normalized)
Weight (in KG, has to be normalized)
Systolic Blood Pressure (in mmHg, has to be normalized)
Diastolic Blood Pressure (in mmHg, has to be normalized)
Cholestrol (Values are '1' = normal, '2' = above normal and '3' = very well above normal; has to be normalized)
Glucose (Values are '1' = normal, '2' = above normal and '3' = very well above normal; has to be normalized)
Smokes (binary, 0 = doesn't smoke, 1 = does smoke)
Alcohol (binary, 0 = doesn't drink alc., 1 = does drink alc.)
Physical Activity (binary, 0 = no phys. activ., 1 = phys. activ.)

The normalisation used is Min-Max-Norm, implemented in C++ using:

struct normConf { float min, delta; }; // delta = max - min
void minmaxnorm(std::valarray<float>& a, normConf conf)
{
    if (conf.min == *std::max_element(&a[0], &a[a.size() - 1]))
    {
        a = 0.f;
        return;
    }

    long long int size = a.size();
    for (long long int i = 0; i < size; ++i)
        a[i] = (a[i] - conf.min) / conf.delta;
}

and normalized per column in the dataset, not row as some might do for whatever reason.

Training Details

Numbers and C++ speaks more than 1000 words they say, so here's the code:

#include <iostream>
#include <sstream>
#include <chrono>
#include <cmath>

#include "HANNA/HANNA.hpp"

float sigmoid(const float& x) { return 1.f / (1.f + std::exp(-x)); }
float sigmoidDV(const float& x) { return x * (1.f - x); }

struct normConf { float min, delta; }; // delta = max - min
void minmaxnorm(std::valarray<float>& a, normConf conf)
{
    if (conf.min == *std::max_element(&a[0], &a[a.size() - 1]))
    {
        a = 0.f;
        return;
    }

    long long int size = a.size();
    for (long long int i = 0; i < size; ++i)
        a[i] = (a[i] - conf.min) / conf.delta;
}

std::vector<std::valarray<float>> readCSV(std::string path)
{
    std::ifstream r(path, std::ios::in);
    if (!r || !r.is_open() || !r.good()) return std::vector<std::valarray<float>>(0);

    std::vector<std::valarray<float>> data(0);
    std::size_t elems = 0;

    std::string buffer("");
    std::string elem("");

    std::getline(r, buffer); // First line is header
    {
        std::stringstream header(buffer);
        while (std::getline(header, elem, ','))
            ++elems;
    }
    --elems; // first is 'id'

    while (std::getline(r, buffer))
    {
        std::stringstream row(buffer);
        std::valarray<float> rowvec(elems);
        
        std::getline(row, elem, ','); // First elem is 'id'
        for (std::size_t i = 0; std::getline(row, elem, ','); ++i)
            rowvec[i] = std::stof(elem);
        data.push_back(rowvec);
    }
    return data;
}

int main()
{
    std::chrono::high_resolution_clock::time_point tp[2] = { std::chrono::high_resolution_clock::now() };
    
    std::vector<std::valarray<float>> data = readCSV("cardio-hdd.csv");
    std::size_t rows = data.size();
    std::size_t cols = data[0].size();
    
    std::vector<std::valarray<float>> input (rows, std::valarray<float>(cols - 1));
    std::vector<std::valarray<float>> output(rows, std::valarray<float>(1));
    
    // Age, Gender, Height, Weight, ap_hi, ap_lo, cholesterol, gluc, smoke, alco, active, cardio
    std::vector<bool> normalizecol({ true, false, true, true, true, true, true, true, false, false, false, false });
    
    // Col1: Gender. Nobody knows why it's 1 = Male and 2 = Female instead of 0 and 1
    for (std::size_t row = 0; row < rows; ++row)
        --data[row][1];
    
    for (std::size_t col = 0; col < cols; ++col)
    {
        if (normalizecol[col])
        {
            std::valarray<float> coldata;
            coldata.resize(rows);
            for (std::size_t row = 0; row < rows; ++row)
                coldata[row] = data[row][col];
            
            float min = *std::min_element(&(coldata[0]), &(coldata[rows - 1]));
            normConf conf = { min, *std::max_element(&(coldata[0]), &(coldata[rows - 1])) - min };
            minmaxnorm(coldata, conf);
    
            for (std::size_t row = 0; row < rows; ++row)
                data[row][col] = coldata[row];
        }
    }
    
    std::size_t inputsize = input[0].size();
    for (std::size_t row = 0; row < rows; ++row)
    {
        for (std::size_t col = 0; col < inputsize; ++col)
            input[row][col] = data[row][col];
        output[row][0] = data[row][inputsize];
    }
    data.clear();
    
    tp[1] = std::chrono::high_resolution_clock::now();
    std::cout << "Red, initialized and formatted dataset in " << std::chrono::duration_cast<std::chrono::milliseconds>(tp[1] - tp[0]).count() << "ms.\n";
    
    MLP::MLP hdm({ inputsize, inputsize, inputsize / 2, 1});
    std::size_t epochs = 10000;
    
    hdm.enableTraining();
    hdm.lr = 0.05f;
    tp[0] = std::chrono::high_resolution_clock::now();
    hdm.train(input, output, sigmoid, sigmoidDV, epochs);
    tp[1] = std::chrono::high_resolution_clock::now();
    hdm.disableTraining();
    
    std::cout << "Took " << std::chrono::duration_cast<std::chrono::seconds>(tp[1] - tp[0]).count() << "s +- 500ms.\n";
    hdm.save("HDM.MLP");
    std::cout << "--- Lazy Test ---\n";
    
    std::size_t testspassed = 0;
    for (std::size_t i = 0; i < rows; ++i)
    {
        float out = 0.f;
        if (hdm.out(input[i], sigmoid)[0] >= 0.5) out = 1.f;
        if (out == output[i][0]) ++testspassed;
        if ((i % 7000) == 0)
            std::cout << float(float(i) / float(rows)) * 100.f << "% tested | " << float(float(testspassed) / float(rows)) * 100.f << "% accuracy.\n";
    }
    std::cout << "100% tested | " << float(float(testspassed) / float(rows)) * 100.f << "% accuracy.\n";
    
    hdm.suicide();

    return 0;
}

and the output on a 3.8GHz CPU running the code above on one thread only (formatted for readability, didn't change any numbers):

Red, initialized and formatted dataset in 123ms.
Took 1690s +- 500ms.
--- Lazy Test ---
0% tested   | 0.00142857% accuracy.
10% tested  | 7.29714% accuracy.
20% tested  | 14.5371% accuracy.
30% tested  | 21.88% accuracy.
40% tested  | 29.2014% accuracy.
50% tested  | 36.47% accuracy.
60% tested  | 43.7514% accuracy.
70% tested  | 51.0114% accuracy.
80% tested  | 58.3529% accuracy.
90% tested  | 65.6286% accuracy.
100% tested | 72.8871% accuracy.

Production

This model is not meant for production. Do not use it for production usage.

Long live the Islamic Federal Republic of Pakistan.
Long live our alliance with the People's Republic of China, and long live the People's Republic of China.

XeTute
/

HANNA-MLP-HDM-V0.1

Introducing HDM-V0.1

What it does

Training Details

Production

Dataset used to train XeTute/HANNA-MLP-HDM-V0.1