Some thoughts about Machine learning
Posted: Fri Nov 06, 2020 7:52 pm
In the last years I have read much about machine learning and tried to find places to introduce low level machine learning into our programs for enhancing user experience:
- proposals of values,
- validations
- data control for existing data
I must admit, that it is much more difficult than I thought:
- For machine learning much correctly classified data is needed. As long as few data is available, the algorithm is "stupid".
- The data is needed, where the learning algorithm is placed. If this is outside a company, it can be difficult to get the permission in times of the severe European data protection rules.
- The algorithm will not always give results the users think they need and needs help. So for every location where ML is used, there is needed also a "supervisor" who can adapt the results.
- Setting up an algorithm for a single task may take more work than thinking about decision rules or user-enabled rule creation.
- Machine learning may need resources that are not always available
- Manually defined decisions trees by experienced users are often much more precise than machine learning rules
Nevertheless I think that some strategies needed for machine learning are always worth to think about:
- A standard machine learning algorithm consists of several inputs and one output (answer). All tasks that need decisions have to be separated in those single tasks with several inputs and one output.
- The inputs and the output can be single numbers, strings, collections of numbers, images, ...
- Input information must be reduced as much as possible using domain knowledge
- If the inputs are time series (for example user responses at different times), then features like "last result", "average over the last results", "maximum over the last results" or "probabilities of different results" should be calculated
- Manually defined rules should not be programmed fix into the program. Instead macros or matrices with decision tree information stored in a database should be used.
- Often a variant of the "nearest neighbors algorithm" is sufficient to find good proposal values in database programs: The nearest value corresponds to the database record for the current user for the currently selected end customer for the last time. If the last time is too far away or the current user did not work for the end customer, then it can be the record for the end user for the last time.
Apart from this the user interface often needs visible changes. The user does not need to have one single result that should fit to everything. Instead he wants to have an immediate overview how many possibilities he has with which probabilities. And it is here that all the nice theory is reduced to simple statistics.
Of course machine learning is an interesting and important topic. It depends just if you do something for thousands or millions of users, or if you depend on data for below 100 or few hundreds of users.
- proposals of values,
- validations
- data control for existing data
I must admit, that it is much more difficult than I thought:
- For machine learning much correctly classified data is needed. As long as few data is available, the algorithm is "stupid".
- The data is needed, where the learning algorithm is placed. If this is outside a company, it can be difficult to get the permission in times of the severe European data protection rules.
- The algorithm will not always give results the users think they need and needs help. So for every location where ML is used, there is needed also a "supervisor" who can adapt the results.
- Setting up an algorithm for a single task may take more work than thinking about decision rules or user-enabled rule creation.
- Machine learning may need resources that are not always available
- Manually defined decisions trees by experienced users are often much more precise than machine learning rules
Nevertheless I think that some strategies needed for machine learning are always worth to think about:
- A standard machine learning algorithm consists of several inputs and one output (answer). All tasks that need decisions have to be separated in those single tasks with several inputs and one output.
- The inputs and the output can be single numbers, strings, collections of numbers, images, ...
- Input information must be reduced as much as possible using domain knowledge
- If the inputs are time series (for example user responses at different times), then features like "last result", "average over the last results", "maximum over the last results" or "probabilities of different results" should be calculated
- Manually defined rules should not be programmed fix into the program. Instead macros or matrices with decision tree information stored in a database should be used.
- Often a variant of the "nearest neighbors algorithm" is sufficient to find good proposal values in database programs: The nearest value corresponds to the database record for the current user for the currently selected end customer for the last time. If the last time is too far away or the current user did not work for the end customer, then it can be the record for the end user for the last time.
Apart from this the user interface often needs visible changes. The user does not need to have one single result that should fit to everything. Instead he wants to have an immediate overview how many possibilities he has with which probabilities. And it is here that all the nice theory is reduced to simple statistics.
Of course machine learning is an interesting and important topic. It depends just if you do something for thousands or millions of users, or if you depend on data for below 100 or few hundreds of users.