Hi Isaias,
I’m truly glad you found this article useful. The question you raise is actually a great question. And if attainable, will be quite useful. But I’m afraid, unless you can create a dataset and a CNN to reflect the problem you want to solve, this won’t be able to achieve. By the statement immediately before, I mean something like this.
Say you want to train a network to output the number of eyes, nose and mouth in an image. So your CNN takes the image as an input and output 3 values (one output node for each element you want to predict). And feed images and corresponding (number of eyes, number of noses, number of mouths) in the image as the output, to train the network.
Why it is difficult to achieve this with just a standard CNN ( a CNN predicting if a person present in the image is there or not), is because of two reasons (I can think of).
1. In a practical CNN there are many layers, and we have no idea which layer learns which (though I used ideal features as the features learnt by the CNN, that’s not the case in reality)
2. Even if we find out which layer learn these features, there are many redundant features, which makes it infeasible to find out which filter actually learn the useful features.
Hope this helps! :)