Neural Networks applications in valuation of banner ad creative efficiency

The technical aspects of building a solution to our problem of predicting advertisement banner efficiency. Rectified Linear Units activation function. A simple neural network architecture, trustworthy model. Visualizing convolutional neural networks.

Рубрика Программирование, компьютеры и кибернетика
Вид дипломная работа
Язык английский
Дата добавления 13.07.2020
Размер файла 2,8 M

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

As was mentioned previously, small datasets may be deemed a problem for some deep learning models. More times than not having more data is better than having less, so usually, researchers try to get as much data as possible. However, it is not always easy and comes at a cost of time, money, and other limited resources. As was previously mentioned, this is the case in some areas of medicine, and in the case of this work in advertisement banners. In the process of creating advertisements many samples are produced, but maybe two or three ever get a chance to become an advertisement that will be viewed or clicked. Medical image recognition uncovers some interesting challenges in gathering data, which will hopefully be overcome in the near future. These challenges include, but not limited to disease rarity, the privacy of patients, the need for expertise to label an image and effort and machinery required to actually acquire data. These obstacles have led many studies on using generative adversarial neural networks to augment the data for medical image datasets.

As a small commentary on data augmentation - when we are dreaming or imagining things, we can still imagine a purple cat. This is outside of the scope of this paper, but it can be argued, that purple cats do not occur a lot in nature. But one can definitely be created in a person's brain. This purple cat is still “labeled” as a cat. So even though purple cats are hard to come by, one can imagine it and still consider it like a cat. Thus, with a little philosophical flavor, we can argue, that we are trying to train our model to spot this “catness”, even if the data is augmented and does not appear in reality.

Moving forward to data augmentation techniques, some of the most popular will be discussed and at the beginning, these will be geometric transformation techniques, such as vertical and horizontal flipping, rotating, cropping, and translation. These are considered to be the easiest and safest image augmentation techniques because of post-transformation label preservation. One should not be hasty and make a sanity check if this holds for his problem or data. A cat, dog, bike, motorcycle, human, and many other things rotated will remain themselves, meaning they will preserve their “label”. Numbers, on the other hand, have a special case of six and nine. If you flip the nine it will become a six and the six will become nine. If flipping is carelessly used on MNIST dataset, even an advanced convolutional neural network will begin having trouble differentiating between nines and sixes, as it will become confused over why there are sixes labeled as nines and vice-versa. To overcome this, additional labels would have to be created, such as “rotated six”. This will come at a cost of time and manual labor, so this augmentation technique in this particular problem should better be avoided. Unfortunately, even with the safest techniques label distortion problems should be considered. There is no universal method to catch this problem, so some upfront planning and analysis should be conducted in order to be sure augmentation will not cause trouble. One technique that improved results the most on CIFAR-10 and ImageNet is horizontal flipping. It is much more common than vertical flipping, in fact. SVHN is another text recognition dataset where flipping is not preserving the label of data.

Digital images are usually represented as a three-dimension tensor, dimensions being height, width, and color channels. Color channels the majority of the time are RGB - red, green, and blue. Color augmentation is yet another highly effective and very practical strategy. It is also very easy, as augmentations may include color layer isolation, meaning turning off two of red, green, and blue color channels to zeros. In addition, these layers, as they are matrices, can easily increase or decrease the brightness of the color due to matrix multiplications. Such operations lie at the heart of most photo editing applications widely used today.

Rotation does exactly how it sounds - it rotates the image data left or right from 1 to 359 degrees. Contrary to flipping technique rotation can be in fact label preserving for textual image data. Referring to MNIST data set, if six and nine are rotated from 1 to 30 degrees it can still be said that they will preserve their label. However, rotating more degrees may in fact distort the label. If used considerably, this can be deemed as the safest label preserving the augmentation technique.

Cropping is another technique used in data augmentation. Usually, the whole image is not required to preserve its label. An object of interest may be cropped out of an image and safely be used as a separate data point. For example, a penguin on 224 by 224 pixels can be cropped to become a 200 by 200 pixels image and this will be safely treated as a new data point on one condition - such cropping should not alter the label. An example of label altering cropping augmentation could be number eight with its upper circle cropped. Eight with only one circle would just be zero and in that case, the label has been altered. The more the image is cropped proportionally, the more one should worry about the label.

It also serves well if the image is shifted positionally. If a model is only trained on perfectly centered images, like in MNIST it may have trouble recognizing real-world examples, when an object is located on the edge of the image or other way. When the image is translated, the remaining space is filled up with either random noise, or some specific value from 0 to 255. It should also be noted that translation and cropping are similar in some sense, that an object when it is cropped can also become dislocated relative to the pre-augmentation image.

Noise injection is a technique that injects values drawn from a Gaussian distribution into the data matrices. When the noise is added into the image it makes the model focus on higher-level features rather than on pixel-level details, thus the model generalizes better. This method is especially good at helping convolutional neural networks in learning.

So far geometric transformations have been serving us really well in solving positional biases in the image training data. Even if we have a whole dataset of completely centered images, this can be overcome by geometric transformation methods to change the position of the object in the image. Thus the model will learn to recognize the object and its features even in not so common positions. Other advantages of geometric transformations are that in most cases they are not altering the data label and that they are really easy to implement. Implementing these augmentations requires much less computing power than other methods. However, there are also disadvantages that should be noted. First of all, transformations will occupy additional memory, which is a trade-off that should be considered before implementing them. If one runs out of memory while creating augmentations for the model other operations might become unavailable due to lack of resources. Second, geometric augmentations will incur computing costs. Even though these are not much compared to other heavier transformations, this should not be neglected as well, as computing power, as well as budgets paying for the cloud, can run out unexpectedly. Third, of course, if the model needs to process more data the training time will also increase. Another disadvantage is that if random cropping is applied one may require to manually look over many images in the dataset to make sure that the label is not altered by the augmentation which can be costly, time-consuming, opinionated, and prone to human error. Once again, referring back to medical images examples - some scans may be quite hard to interpret as it is and additional interference will only make it worse. For example, if one small piece of the tumor is cut out of the scan a medical expert might not be able to say that the patient is ill. For that matter, it will also take the time of a high paid expert, meaning the money for researchers. Also, it should be noted, that if a group of tumors is dispositioned from one another as could have happened in translation augmentation it could alter the diagnosis. While it seems a very elegant way to generate more data and help train the model to detect objects regardless of their position in the image, one should keep in mind that geometric transformations have their drawbacks and can not be applied in all domains.

In this paper, only geometric transformations were used. The main motivation for its use was that it will not affect the label. However, other data augmentations were also considered. Only those which were considered will be discussed in detail and motivation on why they were not used in the end will be given. Other worth noting will only be mentioned to draw a clear picture of what could have been done.

As was mentioned previously, images are encoded as three-dimensional tensors. One dimension being color - red, green, and blue. So this tensor consists of three stacked matrixes for each corresponding color. Each value in these matrixes is between 0 and 255, corresponding to a color. Sometimes these values are biased by the lighting in which the image was produced. Nevertheless, in most scenarios, as let us say face recognition task, lighting should not be an issue, however, it is one of the most popular biases which occurs in image recognition problems. To make the model more efficient in recognition tasks it should be trained to overlook the lighting biases and extract the important features which would actually determine the presence of an object in the image. Thus augmenting an image to be in different shades, and given the label is still preserved, as usual, the model will be strengthened by more examples and will perform better on new data, which may come not so clean as the data provided in the dataset for its training. The transformation is also quite quick to achieve by adding or subtracting constant values to pixels in matrixes. There is also a way to manipulate color to splice out red, green, or blue matrixes individually. Another interesting way is to restrict the minimum and maximum values of colors. Image editing programs and not only can they provide filters, but that can also change the color characteristics of an image. The idea of color transformation can lead to many interesting strategies that can be deemed endless in their variations, so there is a lot of freedom for testing and creativity. This technique, however, was rejected because it is deemed to alter the label of our data. It is assumed that the same advertisement creative with different color schemes is actually two different advertisements. Also, it is worth mentioning that in practice advertisements with different color schemas are usually tested against each other, as it is assumed at the start that coloration impacts click-through rate. Thus it was decided to leave this technique aside because it can not be safely assumed that the creative would be considered to have an above-average click-through rate if its colors were altered. Unfortunately, the data with different colors should have been present in the data set in the beginning if it was ever to conclude that certain colors perform best over the whole set of creatives.

Another way to use color transformation is to convert an RGB image into a grayscale image. This means that it will no longer be a three-dimensional tensor, but a two-dimensional one - matrix. The dimensions of the matrix will be height and width. Roughly speaking, this is three times fewer data processing for the model and thus faster training. However, this comes at a cost of reduced accuracy performance (Chatfield, Simonyan, Vedaldi, Zisserman, 2014) on popular datasets like ImageNet and PASCAL VOC. Apart from RGB and grayscale, there are multiple other ways to represent the color, like HSV, YUV, CMY, and others.

Color transformations share the same disadvantages as geometric transformations do, like training time, increased memory requirements, and computation costs. But also, color transformations may ignore a very important role color plays in the label. For example, if snakes are classified, some of them can significantly differ by color and not by form, and thus altering a color will deteriorate the label. Simulating a dark environment in an image can also distort a label because it will no longer be clear what is in the image. In medicine, once again, certain color palettes of blood may mean different health conditions. So it should be noted, that while color transformations will remove color biases, color may still be an important feature and this will only confuse the model further.

Kernel filter is a method which lies at the very heart of sharpening and blurring images. It is a sliding equidimensional matrix which slides across an image and multiplies color values by itself in order to achieve the desired effect. Kernel filters for blurring images can make a model more resistant to not capturing a label in motion while sharpening an image may reveal additional features about an object. In spite of having achieved an increase in accuracy on CIFAR-10 (Kang, Dong, Zheng, Yang, 2017), it is a relatively unexplored area of image augmentation. Also, it holds the same disadvantages as internal layers of a convolutional neural network. That is it is not clear which kernel filter values should be used in order to augment image data, however model parameters learn how to represent an image from layer to layer. Thus it can be argued that it is better to implement kernel filters as a layer of convolutional neural network rather than an augmentation to a training dataset. This technique was a candidate to be used on a dataset used for this work, however due to it being a rather unexplored area, and thus having a risk of altering labels for advertisement banners, this technique was ruled out. One may argue that applying a wrong kernel filter may alter an image that it is no longer in its click-through rate category. Maybe the sharpness or the blurriness of an image is a significant attribute, which affects click-through rate of the banner.

Other and last data augmentation technique considered for this work and which did not make it is random erasing. Random erasing can be considered the same as the dropout regularization technique, but instead, it is present at the input data level rather than the neural network architecture level. This technique helps a model to battle occlusions - basically, anything which hides objects features, like an obstacle between a camera and an object. For example, it can be a leaf covering the head of a fox in an image. Thus a model is trained to spot more general features of a label, making it more resilient to occlusions. Therefore, the model will be able to generalize better and prevent overfitting at some certain images features. Also, it is worth mentioning that, given the label is preserved, of course, random erasing will result in a model paying attention to an image as a whole rather than part of it because this part might get erased in-process and the model will have to find other ways to identify a label.

Of course, random erasing comes with disadvantages. They are similar disadvantages to cropping, as it happens to be because as with cropping a part of an image is taken out. As with cropping labels may not be preserved if a certain element in an image is erased. For example, if a leg of number nine is erased, it can be interpreted as zero, based on its hat. The same goes for six being interpreted as zero, nine interpreted as three, and so on. Thus it becomes not so good of a choice for data augmentation when one is dealing with textual imagery. In other tasks, which would include very small details, the presence of which would change the label one must approach with caution when using random erasing. As with cropping, when generating random erasing imagery one may have to manually look through the generated data to identify if there has been any label distortion in data points, which can be time and money consuming.

So far all augmentations, which were considered were listed, however, they are not the full list of data augmentation techniques. Only the simplest augmentations were listed and considered for a few reasons. For starters, most techniques were not used out of fear of altering a label and thus not positively affecting models training. Other reasons included the inability to determine whether the label is preserved or not. Since the task is to define features affecting click-through rate, that means that upfront it is not known what these features are and because of that if the feature is somehow removed, but the label is not changed it will just confuse the model and may not identify this feature in other images. Also, an absence of resources to inspect every single data point to make sure that the label is preserved in the process is a factor, because there are many data points which would have to be inspected. With flipping and rotation, one can be sure that the label is preserved in our case because they do not alter the perception of the advertisement and it can be argued that it only served its best - helping to remove positional bias in the image.

It is also worth noting, that augmentations can be combined together. Nothing stops one to rotate an image and change its color, or rotate it and flip it vertically and horizontally. Thus it can provide nearly infinite possibilities to inflate a dataset. Of course, it should be approached with some caution. For starters, as was mentioned previously, the label may become altered. Another reason is that if the initial pre-augmentation dataset is relatively small the model may overfit on features that otherwise would not signal a model for a particular label. In other words, data augmentation does not improve the quality of data, but rather helps the model to look at it from different angles.

Now let's take a look at more complex data augmentation techniques. They will not be discussed as thoroughly, but are worth to be mentioned as they represent some very interesting approaches to image data augmentation and also point towards a very important notion. Data can be augmented not only in the input space, that is not only at the dataset level when we first augment the data and then feed it into the model, but rather on the model level, or even outside the model, but not on the dataset level.

One such technique is feature space augmentation. As it has been empirically proven, convolutional neural networks are really good at processing mapping inputs to low dimensional representations. Thus in sequential neural networks, these processed representations can be separated and isolated to refine the individual layers after model training is complete. A particular technique in feature space augmentation is called Synthetic Minority Over-sampling Technique, abbreviated as SMOTE (TE standing for TEchnique). It is used to deal with class imbalances so that the model does not overfit on prevalent labels present in the dataset. This technique has shown to improve loss on popular datasets (Tomohiko, Michiaki, 2018). However, it is not a technique one should use if they are looking for interpretability of augmentation data. It is possible to extract augmented images, however, it can take a lot of computing power. It also requires a lot of computing power while training. On both of these accounts, it was ruled out in this work.

And finally, one very fascinating technique is modeling using generative adversarial networks. Simply put - one is using another trained neural network to generate additional data for the original dataset. Currently, it is one of the leading techniques in performance results and speed. In the context of this work for many reasons, but one sufficient reason is the lack of input data, as well as the absence of a pre-trained neural network that could generate advertisement creatives.

Having so much to choose from is an opportunity to train a model to view the data from different angles and help it to generalize better. However, one should not rush and apply all available data augmentation techniques possible out there. Label distortion was discussed a lot previously, as for the first reason. But even if the augmentations do not alter the labels there are other problems, which may be encountered. Let's take a look at this particular example. Assuming one has a dataset consisting of a hundred images of giraffes and a hundred images of deers it was decided to apply multiple color augmentations, assuming another hundred colors. Now the dataset is increased by a hundred times, consisting of giraffes and bears of a lot of different colors. But now there appears to be another problem - since there is a very low variety of original photos a model trained on post-augmentation dataset may overfit on image-specific features, which is self-explanatory - the model has seen the same image with same spatial features for hundred times per image in original dataset, and now it may have trouble classifying new images. However, given this example one may argue that a lot of color augmentations were generated and that it would help were they fewer and they could be right. This brings a very important question of how one should know when to stop generating new data. Unfortunately, there is no one opinion on what is the best ratio of augmented data to original data. It is very important to consider the nature of the dataset and problem at hand. If the dataset is imbalanced, further data generation will only spike this imbalance leading to model overfitting on prevalent labels. The augmentation can not correct inherent problems which lie within the dataset itself. To use data augmentation techniques and be sure that they will not do worse it is best to make sure that data, both testing and training, are drawn from the same distribution.

To conclude, data augmentation is a very useful technique to improve models performance, but it should not be used out of context and should be used in moderation and with caution, otherwise the dataset may get very cluttered with undesirable data points, which will be quite hard to capture, as it will most likely be done by manual labor and is prone to human error.

Visualizing convolutional neural networks

In this section of the work explanations of deep learning models will be discussed, as well as a particular method called LIME. LIME stands for Local Interpretable Model-agnostic Explanations. In this work model predictions will be attempted to be explained using LIME through visualization. Deep learning model explanations is a very important topic, which is unfortunately usually overlooked and it will be discussed why it should not be this way.

Deep convolutional neural networks have shown excellent performance in classifying image data. The first tasks to prove this were the famous hand-written digits classification problem, hence the classical image classification problem - the MNIST dataset, and facial recognition problems. Of course, that did not stop just there, as convolutional neural networks have shown outstanding results in many other image classification tasks. For one, take the CIFAR-10 dataset as an early example. Since then, even more, progress has been made.

Results based on CIFAR-10 were first published in 2010 in the paper “Convolutional Deep Belief Networks on CIFAR-10” by Alex Krizhevsky. At that time, an error rate of 21.1% was achieved by the model discussed in the paper. And as of 2018, this error rate was reduced to merely 1.0% (Huang, Cheng, Bapna, Firat, Chen, Chen, Lee, Ngiam, Le, Wu, Chen, 2019) using a new state of the art architecture. However, convolutional neural networks for solving image recognition problems were introduced way back in 1989 (LeCun, Bottou, Bengio, Haffner, 1989). Nevertheless, it took twenty, or even thirty years before convolutional neural networks began to be widely used. Of course, there are several reasons for this. One is that as time progresses, so does technology and nowadays we are exposed to much more powerful GPUs, practically available on demand and without need to buy one physically. Another reason is the availability of data, as well as opportunities to get a lot of labeled data of many kinds.

Now that convolutional neural networks became available due to the progress in technology and data gathering, and show very high performance on image recognition tasks, they are very widely used. However, in spite of being available for use and showing a great result, it is still not very well explored on how they actually achieve that. Unfortunately, deep convolutional neural networks are very complex models. Some simple architectures may in fact account for hundreds of thousands of parameters, not to say that really large neural network architectures can have much more than that. Taking this into account, one should not expect to understand the inner workings of a convolutional neural network as easily as they would grasp a linear regression model. Not in the image recognition problem, of course, but in a prediction of continuous value problem that is why linear regression is sometimes chosen instead over a neural network - it is very easy to interpret the result of linear regression once the weights are determined. Managers working with data, if they need a quick prediction, would just need to add together the values multiplied by their respective weights in order to get their result. This will highly likely take a person less time to calculate it using a phone calculator than asking a neural network developer to run prediction with a neural network model. Maybe linear regression will not turn out as precise as a neural network model, but it will be much easier to interpret few weights in the regression model, than thousands in a regression.

Scientifically speaking, if one is unable to interpret their results they can not be sure if they are satisfied with the result. In this scenario, the only way to understand if the model is performing as expected is through trial and error, which in some cases can be a very expensive procedure. For example, it is highly undesirable to test the model on patients when human life is involved. A developer creating the neural network engine should at least have a high-level idea of why and on which basis does a model yield specific results, otherwise, the cost is too high. Apart from medical applications, there are practices of using machine learning in fighting crime (https://www.technologyreview.com/2019/02/13/137444/predictive-policing-algorithms-ai-crime-dirty-data/), terrorism (Singh, Chaudhary, Kaur, 2019), and many other areas, where an oversight in reasoning can lead to huge losses.

Given all that a very important concept arises, that without the trust of a human, machine learning model, no matter how good it is at the task, will simply not be used. Therefore a more philosophical question comes in place - how does one trust a model? Without answering this first it would be hard to understand whether one would be satisfied with models results. Because it is not only important that a machine learning model provides a correct answer, but also the basis on which it provides it. So a human has to acquire proof that a model came to a conclusion based on something they can agree upon, instead of just faithfully trusting in the power of a black box.

Two intertwined reasons for need of models interpretability were given already, that a cost of error can be high and even involve human lives, and that a model should be trusted by the user, there is also another reason which should not be overlooked. While the model may show great results in the development phase on training and testing data, its ultimate goal is to actually perform well in the production, meaning in the real world. Once a model is deployed it is highly likely to receive not so accurate data, as it did during the training. Of course, as was previously discussed in this work, there are ways to condition our model, for example creating data augmentations, or adding noise and so on. But no matter how creative the developers were in constructing a convolutional neural networks architecture or how great was the dataset, in production a model will highly likely encounter data that differs from the one it was trained on. May it be an unpredicted angle, or shape of an object and so on, but still having features which would signal a human on what this object is. Thus it is important to achieve consensus with how the model comes to a conclusion about the label beforehand.

For the purposes mentioned above visualization techniques were introduced to understand based on what features a neural network produces specific outputs. It goes as deep as finding the low-level features that motivate a model to produce an output at the individual layer level. These techniques will also enable one to observe how these features evolve during training and help to reveal problems in the model.

When speaking of interpreting how a model came to a certain prediction, it is meant that one is looking for visual, textual or even audio artifact, which hints the user and the developer of the model that there is a qualitative understanding of how models prediction relates to features on basis of which this prediction was made, and if that explanation is sensible then one can decide to trust a model. As was mentioned before, this is essential for building a trust between models and humans benefiting from its predictions.

Now it has to be pointed out, that it is usually required that a human making judgements on whether the model can be trusted or not with its conclusions should have prior experience and knowledge in the domain of application, otherwise they will not be able to understand the reasoning models visualization is providing and therefore will not be able to determine if the model is to be trusted. Also, it is worth noting, that not only the user as a service provider should be able to determine whether the model's prediction is well justified, but an end user, a consumer, may also want to know on which basis the system determined its result. For example, users were more likely to accept systems film recommendations if they were given the reasoning (Herlocker, Konstan, Riedl, 2000).

Since the notion of trust was introduced in this work, now a question arises how to measure it. At the development stage this trust is quantified by the metrics. It can be safely said, that 90% accuracy is better than 80%, other things equal. So one has to introduce criterias for trust in order to evaluate it. For this section, there will be references to experiments shown in the article “Why Should I Trust You?” Explaining the Predictions of Any Classifier by Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin. In their article they have shown a model, which predicted flu, and a model gave an explanation on what basis it made this prediction. Data considered for the decision were patients having sneezing, headache, absence of fatigue, as well as value of their age and weight. The model has shown that its prediction was positively driven by a patient having headache and sneezing, but it was negatively the probability of making such prediction by absence of fatigue. To a doctor this should make a lot of sense, because sneezing and headaches would be indeed considered symptoms of a flu. However, absence of fatigue may signal against the patient having a flu. The doctor might have been surprised to find out that a model would take into consideration age and weight of the patient, because in his experience flu is not an illness, which is common for certain age and weight groups. Thus a doctor may conclude whether they would trust this specific model or not. With this example it is seen that it is important to provide some context in how to determine a trustworthiness of a model, and thus no one metric can be created to assess trust because the tasks and the problems can be very different.

However, in spite of empirically determining if the model can be trusted, some things may go wrong in the process, while the model's trustworthiness is evaluated. First of all, it can be what is called a “data leakage”. That is an injection of data, which would otherwise not appear in the dataset or in the production phase. For example (Kaufman, Rosset, Perlich, 2011), patients' IDs were input in the model and somehow it heavily correlated with a particular class in both testing and training datasets. Normally, this should not happen, because an ID is given either arbitrarily or on a sequential basis, and should not be correlated with any measurement, at least if it was not intended in the beginning. Such a problem would not be easily tracked while observing input and output data alone, and as with the example with a doctor and patients having a flu, it is highly desirable to have an explanation why a model made a certain prediction. With that at hand, it would only take one inspection to identify a problem.

Previously it was discussed that training and testing data should be drawn from the same distribution in order for a model to successfully be able to predict it with higher accuracy. There exists a problem, where testing datasets and training datasets are quite different from one another and models accuracy may become quite low on testing data. That being said, the developer may start looking for a problem in the architecture of the model, and even doubt that it is possible to complete the task, while the problem was in the data split. This is referred to as dataset shift.

All in all, the model can be performing very well, having a very high accuracy on training and testing data, and also perform very well in the field - in production. The inherent data problems are not present in the dataset. So now the only thing that can be left for the developer is to identify the best architecture for the model. Usually developers decide between two or three architectures. It only makes sense to choose one with best accuracy, but this is not what actually happens in a business environment. Most of the time a developer should consider not only model metrics, but business metrics too. Let's look at an example provided in an article “Why Should I Trust You?” Explaining the Predictions of Any Classifier, where authors consider a case when a high accuracy model recommends most “clickbaity” articles to the users. Clickbait refers to a form of false advertising, motivating a user to click on an article with a misleading heading. While it is true that the click-through rate for clickbait article headlines is higher and it would increase views of articles in the short run, in the long run having more clickbait articles on a website would drive users away from the website and thus user retention would fall. User retention is most likely a highly important metric for any type of business and having a very accurate model, but one which drives users away defeats its purpose. Thus one can argue that a less accurate model would do better, if it does not include a lot of click bait in its recommendations. So a less accurate model with better explanations is better than a more accurate model, but with wrong reasoning.

When one speaks of explanation methods they should first develop a notion of what global characteristics are they looking for in an explanation. For starters, that can be interpretability of an explanation, which is, of course, obvious, but that does not take it off the list. Interpretability could be defined as qualitative, reasonable understanding between input and output data. Thus, if an explanation is to be interpretable, it should be interpreted by the user. In the case of this work, the user is a human being. Thus some explanation models, such as additive models, linear models or a gradient vector may not be interpretable depending on the given task. There may be a case when quite a lot of features, say thousands or even more, contribute to a made prediction, it will be quite challenging for a user to try to inspect all these features and not get lost, let alone interpret them as a whole. This leads to shaping a definition of interpretability to include an ease of understanding the explanation. Users' knowledge and experience should be taken into account as well.

After it is said that the explanation should be interpretable, one should have faith in this explanation. This means that a user should have faith that this explanation has been achieved through the same means of logical conclusion they would have. This is called a local fidelity. It is local, because unless a user is able to interpret how the model came to such a conclusion on every level of it, and that would be a global fidelity, they would only get a notion of it at one particular level. Global fidelity is considered a huge challenge in complex models.

Assuming that all models are inherently interpretable, there should be a particular method for each model, that can explain it. However, there should also be a way to explain any model. Such an explainer would be called model agnostic. Also, it is worth mentioning, that a global perspective should be provided to the model and predictions. As was shown previously, model accuracy may not be the main metric one would want to optimize, that is why it is important to determine if a metric optimized by the model correlates with the purpose of creating a classification or any other model. One such method is to be briefly discussed further.

In this work, a method called LIME (Local Interpretable Model-agnostic Explanations) will be used to explain the result of our model. This method, as well as its peers, has a lot of mathematics aimed to explain how this method actually works. This will be left out from this work and anyone interested should refer to a previously mentioned article “Why Should I Trust You?” Explaining the Predictions of Any Classifier. Nevertheless, intuition for how this method works will be provided for the reader to understand what is going on before the results of this work will be introduced using the LIME method.

In order to understand how the LIME method works, let's first imagine that there is no training data and a model, but only a black box, which can only output prediction when one gives it input data. There is no limit to how many times one can do this, it can go on as much as desired. This is done in order to understand how a model produces certain output. LIME algorithm observes these predictions when certain data variations are input into the model and then generates a dataset which basically consists of black box models' predictions and samples of input data. Then LIME trains a new model. This model is the interpretable model one should be looking for. Such a model can be a decision tree or a lasso, and is a good local approximation of the models' predictions, however it may not be as good globally. This was previously discussed and referred to as local fidelity.

LIME can be used for tabular and textual data, but since in this work image data is discussed, usage of LIME for these types of data will be omitted in this work. Nevertheless, it should be noted that LIME for images will work differently than for textual and tabular data. As was said previously, LIME produces permutations of data, but image data consists of many pixels, and it would not make much sense to loop through each pixel individually, since arguably one individual pixel does not contribute to the classification of an image that much. If one random pixel is changed this would probably not affect models' prediction at all. Thus, an image is segmented into what is called “superpixels”. Superpixels are just pixels grouped together based on similar color. These pixels are then turned “on” and “off”, replacing a shut down pixel with predefined arbitrary color, usually gray. An example of how LIME output looks like is shown in Figure 1 below. Example taken from an article “Why Should I Trust You?” Explaining the Predictions of Any Classifier.

Figure 1. Example of LIME explanation, trustworthy model. Example taken from “Why Should I Trust You?” Explaining the Predictions of Any Classifier by Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin

In Figure 1 an example of LIME output is shown with explanations of predictions. The original images a) are the input, and based on what researchers wanted to predict, they were given explanations by LIME explainer, from b) to d). In b), where one wanted to predict an electric guitar, there is a guitar handle with strings and human fingers and elbow. Handle makes a lot of sense, while human fingers and elbow may not be direct attributes of a guitar, but it can be that the dataset had many images of people holding the guitar and that would explain why human body parts got into an electric guitar explainer. In c), on the other hand, the explanation is not clear, at least to a non-expert, but if one knows that an acoustic guitar is the type of guitar in the original image it makes sense - the model has spotted the “body” of a guitar and is showing it in the image. But it also shows some parts of the clothing, which is strange at glance. That is where faith in models' predictions should come in, that the model has shown the integral part of an acoustic guitar and a user may be at peace that it has also taken into consideration the part of an image that does not attribute to a guitar itself. Same goes for example d) - the dog's head is present in the explanation and thus a user can be assured that an explanation is suitable and can be trusted, although the clothing part of the image can once again diminish that trust. That was an example of a model with trustworthy predictions, so let's take a look at a not so trustworthy output.

Figure 2. Example of LIME explanation, not trustworthy model. Example taken from “Why Should I Trust You?” Explaining the Predictions of Any Classifier by Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin

Figure 2 is an example also taken from “Why Should I Trust You?” Explaining the Predictions of Any Classifier. This is an example of bad explanation. The model in discussion was trained to predict whether a canine in the picture is a husky or a wolf. And the model showed high accuracy. However, when the predictions were run through a LIME explainer, it was shown that a model actually predicts whether an image is an image of a wolf based on the presence of snow in that image. It makes some sense, as it is not likely that the dataset includes a lot of images of wolves not in the wild, while huskies could have images of them inside a house. That is a speculation. Nevertheless, the explanation shows that the model is not trustworthy, and should not be relied upon when classifying a wolf or a husky, as it could literally mean a life or death situation.

Both figures above have shown examples of reasonable and unreasonable model explanations, and basically became a great introduction to how LIME works. Now let's discuss the advantages and disadvantages of the method.

One of the most critical advantages of LIME, is that it is truly model-agnostic, meaning that it can work with tabular data, texts and images. Its model agnosticism does not end there - in fact, even if the original machine learning model is replaced, local interpretable models can still be used for explanations. If one previously used a support vector machines model, but then decided to switch the model to let's say XGBoost, then they can still use a previous model for explanations. Also, LIME explanations are very easy to interpret and are suitable for lay people in a subject. However, because LIME also provides easy explanations, they can be used by machine learning developers to debug the models without consuming much time.

LIME also has disadvantages, one being that a correct definition of “neighborhood” remains an unsolved problem when used on tabular data. With that being said, sampling in LIME algorithms also has room for improvement - while sampling data, it ignores correlation between features of the data and assumes Gaussian distribution. It should also be noted that there is a big inherent problem - the explanations are not stable. This means that when trying to explain very similar data points, LIME explainers can come up with very different explanations. For example, with images of two quite similar dogs, one explainer can show the face and the paws of a dog, and for the other only its nose. Both of these explanations are reasonable, however it can be confusing on why it showed very different explanations for pretty similar images. Such instability can make it difficult trusting LIMEs explanations, so a researcher should stay vigilant.

In conclusion, LIME is a very promising method for machine learning explanations, but it is still in the development phase and therefore should be applied with some caution.

The application of neural networks in creative advertising strategies

The goal of this work was to provide value for communication agencies in their pursuit to deliver value to the client as efficiently as they can. Today every business industry strives to utilize technological advances. One which was most discussed in the section devoted to technical aspects of this work was, in fact, medicine because it demonstrates many pitfalls in using convolutional neural networks in image analysis. However, it does not mean that convolutional neural networks for image classification are not successfully applied in medicine. In fact, they are used more frequently as time progresses and new advancements take place. Convolutional neural networks are successfully used in many industries, such as banking, agriculture, robotics, and so on. In fact, there are projects that are already utilizing artificial intelligence for the benefit of the advertising industry.

The use of artificial intelligence in advertising is not something new. In fact, as was shown in the literature review in this work, large companies such as Amazon and Tencent already utilize convolutional neural networks in their advertising networks in order to predict the click-through rate of an advertisement creative. However, the use of artificial intelligence in the advertising industry is not exclusive to large technology companies which actually comprise the online advertising market. For example, Omnicom Media Group has previously acquired a company called Smart Digital in order to strengthen its internal AI-based user personalization systems (https://www.omnicomgroup.com/newsroom/omnicom-strengthens-its-ai-based-user-centric-personalization-solutions-by-acquiring-smart-digital/). The same company has also created an internal tool for advertisement banners, which helped to improve the efficiency of advertisement campaigns in the process (https://www.businessinsider.com/the-ad-agency-giant-omnicom-has-created-a-new-ai-tool-that-is-poised-to-completely-change-how-ads-get-made-2018-7). However, there are also cases when agencies utilize the power of neural networks to identify what parts of advertisement are appealing to the consumer (https://richards.com/blog/getting-creative-machine-learning-artificial-intelligence-no-really-possible/).

So, artificial intelligence has been in use or at least was discussed by advertising agencies for quite a while. However, artificial intelligence is a very broad concept and advertising agencies have a lot of operations that could be better off with automation, the absence of human error, and efficiency. It can be said that whatever a human hand touches, it would be better off if that was done by something less prone to error. Moreover, advertising agencies are always in pursuit of new talents as well as trying to keep the expertise in house. That would be a great topic for a human resources research article since keeping experts can be trouble. That being said, artificial intelligence could be used not only to do operations without shaking hands but to also acquire and keep the expertise, and thus remain with the agency in the long run. That being said, this idea was a great inspiration for me to start this work. However, the choice of the topic was not only dictated by my interest in the subject, but also through discovery of a particular need in the process of creating advertisement campaigns for clients.

Companies need advertising for many reasons, and one of them is improving their top of mind awareness metric. Simply put, if you were to ask ten people what fizzy drink brands they know and nine would say Coca Cola and one would go Sprite, Coca Cola would have 90% top of mind awareness. For this purpose, they conduct large scale advertising campaigns with mostly always being backed by an advertising agency as its contractor. The advertising agency does many services for the client on a turn-key basis, from the research to execution and anything in between. As a result, it is in agencies best interests, that the client gets the most out of the whole experience, otherwise an agency will lose its business with them. And in order to execute an advertising campaign well, one must prepare a communication strategy.

There are many frameworks which approach the communication strategy build, but in this paper RACE (https://2012books.lardbucket.org/books/public-relations/s10-the-public-relations-process-r.html) by John Marston will be examined. RACE is an acronym and it stands for Research, Action planning, Communication, and Evaluation. Research step has the goal to analyze the current situation of the company, its public relations, customers preferences and etc. Strategic action plan is then developed based on this research, given the goals and objectives are clear, and thus effective strategies and tactics are laid out. Afterwards, in the communication part tools and tasks are considered that will help to achieve the goals. And finally, one has to be able to measure the effect of the executed strategy, as if it could not be evaluated, how could it be possible to determine if it was successful. In this work, emphasis is made on the research stage, because the need for the research became obvious while analyzing the process of creating advertising banners and launching them into the world to see.


Подобные документы

  • Понятие о нейронных сетях и параллели из биологии. Базовая искусственная модель, свойства и применение сетей. Классификация, структура и принципы работы, сбор данных для сети. Использование пакета ST Neural Networks для распознавания значимых переменных.

    реферат [435,1 K], добавлен 16.02.2015

  • Решение задач прогнозирования цен на акции "Мазут" на 5 дней, построение прогноза для переменной "LOW". Работа в модуле "Neural networks", назначение вкладок и их характеристика. Построение системы "Набор программистов" нечеткого логического вывода.

    курсовая работа [3,2 M], добавлен 26.12.2016

  • Модели оценки кредитоспособности физических лиц в российских банках. Нейронные сети как метод решения задачи классификации. Описание возможностей программы STATISTICA 8 Neural Networks. Общая характеристика основных этапов нейросетевого моделирования.

    дипломная работа [1,4 M], добавлен 21.10.2013

  • Технологии решения задач с использованием нейронных сетей в пакетах расширения Neural Networks Toolbox и Simulink. Создание этого вида сети, анализ сценария формирования и степени достоверности результатов вычислений на тестовом массиве входных векторов.

    лабораторная работа [352,2 K], добавлен 20.05.2013

  • Overview of social networks for citizens of the Republic of Kazakhstan. Evaluation of these popular means of communication. Research design, interface friendliness of the major social networks. Defining features of social networking for business.

    реферат [1,1 M], добавлен 07.01.2016

  • Information security problems of modern computer companies networks. The levels of network security of the company. Methods of protection organization's computer network from unauthorized access from the Internet. Information Security in the Internet.

    реферат [20,9 K], добавлен 19.12.2013

  • Тестування і діагностика є необхідним аспектом при розробці й обслуговуванні обчислювальних мереж. Компанія Fluke Networks є лідером розробок таких приладів. Такими приладами є аналізатори EtherScope, OptіVіew Fluke Networks, AnalyzeAir та InterpretAir.

    реферат [370,5 K], добавлен 06.01.2009

  • Сущность и понятие кластеризации, ее цель, задачи, алгоритмы; использование искусственных нейронных сетей для кластеризации данных. Сеть Кохонена, самоорганизующиеся нейронные сети: структура, архитектура; моделирование кластеризации данных в MATLAB NNT.

    дипломная работа [3,1 M], добавлен 21.03.2011

  • Consideration of a systematic approach to the identification of the organization's processes for improving management efficiency. Approaches to the identification of business processes. Architecture of an Integrated Information Systems methodology.

    реферат [195,5 K], добавлен 12.02.2016

  • Description of a program for building routes through sidewalks in Moscow taking into account quality of the road surface. Guidelines of working with maps. Technical requirements for the program, user interface of master. Dispay rated pedestrian areas.

    реферат [3,5 M], добавлен 22.01.2016

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.