Measuring Task Learning Curve with Usage Graph Eccentricity Distribution Peaks

Interaction logs (or usage data) are abundant in the era of Big Data, but making sense of these data having Human-Computer Interaction (HCI) in mind is becoming a bigger challenge. Interaction Log Analysis involves tackling problems as automatic task identification, modeling task deviation, and computing task learning curve. In this work, we propose a way of measuring task learning curve empirically, based on how task deviations (represented as eccentricity distribution peaks) decrease over time. From the analysis of 427 event-by-event logged sessions (captured under users’ consent) of a technical reference website, this work shows the different types of learning curves obtained through the computation of how deviations decrease over time. The proposed technique supported the identification of 6 different task learning curves in the set of 17 tasks, allowing differentiating tasks easy to perform (e.g., view content and login) from tasks users face more difficulties (e.g., register user and delete content). With such results, HCI specialists can focus on reviewing specific tasks users faced difficulties during real interaction, from large datasets. Keywords— Interaction log analysis, task modeling, task deviation, emprirical user studies, client-side events, usage logging, usage modeling, usability.


I. INTRODUCTION
This work is part of a long-term project on the understanding of interaction logs, aiming at making sense of such data, supported by Human-Computer Interaction (HCI) theory, techniques, and methods.A proposal about using eccentricity distribution peaks (EDPs) as a way of modeling task deviations was published in the proceedings of the XVI Brazilian Symposium on Human Factors in Computing Systems.Eccentricity is a vertex metric -in the context of Graph Theory-that measures the longest possible number of hops to take in a graph without revisiting vertices [27].From the analysis of eccentricity distribution of a graph, it is possible to grasp graph topology in a summarized way.Hence, if there is a peak in an eccentricity distribution for a directed and connected graph, it means that there are multiple ways of reaching the most distant node.For graphs representing detailed interaction events as mouse events, keyboard actions, and browser functioning, this means that there are multiple ways of performing a certain task.
The overlap between the previous work and this one encompasses the dataset studied and the use of EDPs.
However, this work expands the previous one by using the EDPs as a metric for measuring task learning curve.
The literature counts on different tools and models that support the understanding of user actions while interacting with websites by the analysis of server log files.This approach is being considered for long for different reasons (e.g., ease of obtaining such data from Web servers).Examples of such tools are Descubridor de Conhecimento en la Web [9], LumberJack [7], Web Utilization Miner [23], WebCANVAS [5], WebQuilt [26], and WebSIFT [8].However, server logs do not provide details on how users interacted with user interface (UI) elements.More recent initiatives consider client-side data in order to understand user actions in details, for instance, MouseTrack [1], MultimodalWebRemUSINE [19], UsaProxy [2], WELFIT [22], WUP [6], WebHint [25], and USABILICS [24].Hence, client-side data emerged as a way of gathering detailed interaction data, allowing a better understanding of user actions while they are interacting with UIs.
One of the invariants regarding the existing systems is that tools focus on providing insights about the usability level of the evaluated UIs.According to the International Organization for Standardization (ISO) [13], usability is the capacity of a product to be used by specific users to realize certain tasks with efficacy, efficiency, and satisfaction, in a certain context of use.Nielsen presents that usability can also be defined in terms of 5 quality components [17]: • Learnability: How easy users accomplish basic tasks the first time they use the design?
• Efficiency: Once users have learned the design, how quickly can they perform tasks?
• Memorability: When users return to the design after a period, how easily can they reestablish proficiency?
• Errors: How many errors occur, how severe are they, and how easily can users recover from them?
• Satisfaction: How pleasant is for the user to use the design?
Considering these definitions, task emerges as a key term.According to Lewis and Rieman [15], "To get a good interface you have to figure out who is going to use it to do what."Thus, supporting the understanding of the task learning curve is fundamental for grasping the overall user behavior during interaction and the usability level of the UI being used.Key questions emerging in the context task learning curve are: • How can we measure learnability?
• How can we measure memorability?
• How can we measure task learning curve from real and detailed interaction logs?
When dealing with big datasets, answering these questions can reduce the amount of data to consider in in-depth analysis.For example, knowing what task (or type of task) is hard for certain users, supports filtering data considering specific tasks for specific user profiles.In addition, previous work [21] presented first results on how graph topology (e.g., number of shortest paths and betweenness) can represent proxies for usability metrics considering data gathered during a user test that counted on 10 participants, run in a controlled environment.In this paper we extend the idea of usability proxies to investigate learning curves from a real dataset of detailed interaction data.Our proposal involves using EDPs to measure task learning curve from the usage data logged during real tasks, remotely, and asynchronously, totaling 427 sessions.Hence, this work contributes with a method to measure task learning curve by using EDPs and the clusters of sessions based on the number of peaks.This work is organized as follows: section II presents related works; section III details the method, dataset, and the graph structure considered; section IV shows and discusses the results obtained, and; section V concludes by presenting outcomes, limitations of this study, and future steps.

II. RELATED WORK
Initiatives on modeling user interface usage and tasks in Human Computer Interaction grown after initiatives as GOMS (Goals, Operators, Methods, and Selection rules) [12] and CTT (Concur Task Trees) [18].Task modeling aims at decomposing tasks into smaller activities and defining distance between the task specification and the task executions.When analyzing this distance overtime, it is possible to identify learning curve.However, such approaches depend on creating and maintaining task models as system evolves.
One alternative to task modeling involves data-driven approaches.Literature counts on proposals for representing actions as graphs considering clicks [16] or detailed client-side data, including the identification of interaction patterns and usability problems [20].In addition, probabilistic models also are present in the literature, supporting the analysis of common taken paths [3].Moreover, additional approaches involving graphs consider user profile [9] or search query logs performed by users [10].Data-driven approaches represent valuable initiatives towards evaluating interaction data at scale.However, to the best of our knowledge, none of the presented prior-art count on a method to identify task learning curves from detailed interaction data.Finally, although the literature counts on approaches for detecting deviation from previously defined task models or data-driven approaches for identifying interaction patterns via graphs, queries, and profiles, existing works do not provide means of evaluating task learning curve from interaction logs at scale, without depending on task specifications.

III. METHOD
This section details the dataset analyzed, the data structure used, and how the analysis was performed.

A. Dataset
The dataset considered in this work is composed of 427 logged sessions captured during a two-year period 1 .The website where the sessions occurred is called WARAU (Websites Adaptation to Requirements of Accessibility and Usability).WARAU is a technical reference and UI evaluations repository.The website supports the development of high quality websites integrating technologies as HTML, CSS, and JavaScript, aiming at Accessibility and Usability.
The event streams related to the 427 sessions analyzed were captured by the evaluation tool WELFIT (Web Event Logger and Flow Identification Tool) [22].The logging of UI events occurred remotely under users' consent, after the acceptance of an invitation to participate in this study.The invite was presented once for every user that accessed the website.The data logged comprises all events triggered at the UI while users performed real tasks.The dataset counts on 241,413 events (mean of 564.4 events per session).
The following list presents descriptive information regarding accesses to the reference website in the period this study took place: • Total of 220,448 sessions (mean of 9,185 per month); • New sessions represent 89.03%; • The average duration of the session is 38 seconds; • Users view in average 1.21 page per session.These characteristics highlight the role of the website as a source of technical information, since most of the users land in the website coming from a search engine, interact with the content, and then leave the website.
The following tasks were identified in the dataset: This dataset was considered because it counts on details performed during the usage, allowing the present analysis of showing task deviations.Moreover, since the website is commonly used as a reference, it would be interesting to identify task deviations and tasks completion characteristics in order to characterize how users use a technical reference website.

B. Data structure
In order to perform the data analysis and to compare with other techniques summarizing usability information of observational data, the logged data was structured according to the technique presented in [20].The graph structure considered (also called as usage graph) is a weighted directed graph G = (V, E, w), where: end} is the set of actions/events triggered at a certain user interface element (e.g., the event mouseover triggered on a submit button is represented by one vertex, say vi, and then a click over the same submit button is represented by a second vertex, say vj).The vertices start and end represent the start and the end of the logged session, visit, pageview, or any other period being represented in the usage graph.
• Each v ⊂ V counts on information representing the mean distance in hops from start to v, represented as d(v), the mean timestamp in milliseconds from the start to v, represented as t(v), and the total occurrences of the same event over the same UI element.
• E ⊆ V x V is the set of directed edges, where e connects two vertices vi and vj if vj occurred immediately after vi in the logged data, represented as (vi, vj).
• w: the total occurrences of (vi, vj) in the event stream.
• A vertex vi is marked as usability problem candidate according to the following heuristic: if d(vi) > mean(d(vj)) + 2 stdev(d(vi)), for vj representing all outgoing vertices of vi.The intuition behind this is to identify cyclic actions, indicating repeated attempts of performing a task or using UI elements.
• Considering time differences, nodes are also marked as usability problems candidates if any of the following is true:

t(vi) > mean(t(vj)) + 2 stdev(t(vi)); or 2. t(vi) -t(vj) > 10 seconds.
Where vj represents all outgoing vertices of vi.The 10 seconds limit follows Nielsen's 3 Important Limits , which presents that 10 seconds is about the time limit for users to keep attention on the task at hand. Figure 1 presents an example of the usage graph of one of the sessions analyzed and how the usability problems candidates are pointed out by the heuristic used.In the figure the ellipses represent UI events; boxes represent UI elements.This example shows how cyclic actions impact in the distances (d) and how usability problem candidates are pointed out in highlighted ellipses.

C. Data analysis
As mentioned previously, data capture was performed via WELFIT [22].Then, raw logs were downloaded from the tool and usage graphs were built following the algorithm presented in [20].Usage graphs were built for each of the sessions in separate DOT files.DOT file is a graph representation format used by Graphviz2 software.Graphviz was used to generate visualizations of usage graphs as Figure 1, Figure 2 (b), Figure 3 (b), and Figure 4 (b).In addition, DOT files were also used as input for computing metrics related to diameter, centrality, degree, community detection, among others, via Gephi3 software.Finally, for each session, the eccentricity distribution was analyzed and the main characteristics of the distributions were summarized as: • The presence of peaks in the eccentricity distribution; • Number of peaks; The eccentricity of vertex v in a connected graph G is the maximum graph distance between v and any other vertex u of G [27].In the eccentricity distribution, a peak is considered a point in the distribution, say x, with a respective count value f(x) surrounded by x-1 and x+1, so that f(x-1) < f(x) and f(x) > f(x+1).
The next examples show how the EDP supports insights in relation to task deviations.Moreover, it also allows the comparison of large datasets of detailed actions, supporting the understanding of how users performed tasks and when in the session deviations/cyclic actions occurred.Moreover, consider the eccentricity distribution for the resulting graph after considering that a task deviation occurred (Figure 3).Note that the peak represented in the Figure 3 indicates that a deviation occurred in some of the nodes with eccentricity equal to the value indicated in the peak, in this case, nodes C and D have eccentricity equal to 2. Now consider a simple graph showing a task deviation at the beginning of the task and then actions leading to task conclusion.In [21], one motif representing this type of task deviation is presented in Figure 4 (a).Building on top of this result, the eccentricity distribution for the motif is presented in Figure 4 (b), summarizing the same concept of task deviation.Note that the EDP represents inversely when, in the session, the deviation occurred.Hence, Figure 4 (b) presents that the deviation occurred in the first quarter of the session.Once that graph metrics were calculated, correlations were computed and the eccentricity distributions were analyzed, highlighting deviations from task and how the summarized results can provide details of how users performed tasks.
The information regarding the number of peaks was used to cluster sessions in order to point out tasks that users faced difficulties.After clustering sessions based on the number of peaks, each of the sessions was analyzed in order to identify the tasks they relate to.The rationale here is to cluster sessions that count on similar number of tasks deviations so that similar distributions will correlate similar behaviors related to task deviations across the evaluated website.Then, each task was evaluated considering the presence in the clusters containing different number of EDPs and how these peaks change for a specific task over time.Finally, task learning curves were analyzed in order to reveal patterns associated to certain types of tasks, aiming the analysis of tasks that need to be redesigned, restructured, or simplified.

IV. RESULTS
Table 1 presents a summary of descriptive information involving the 427 usage graphs.The presented information was generated for each usage graph through WELFIT and Gephi software.It is possible to see that the high standard deviation values are related to the multiplicity of tasks, i.e., some tasks resulting in small usage graphs with few tens of vertices, while other sessions resulted in usage graphs with few thousand vertices.This effect can also be seen in the number of shortest paths, vertices, edges, among others.On the other hand, the eccentricity distribution is proposed as a more valuable metric, highlighting deviations and providing a richer semantic result than a sole number, e.g., task deviations occurred mostly during the first quarter of the sessions.Considering correlations, Spearman test (ρ) was applied in order to find significant correlations between EDPs and other metrics computed from usage graphs.Spearman test is a nonparametric way of evaluating correlations between two variables.In this study we used R software to perform correlation analysis.Next, we present metrics with significant positive correlation with the number of EDPs: • Average path length (ρ = 0.618, p-value < 0.001); • Modularity (ρ = 0.601, p-value < 0.001); • Diameter (ρ = 0.595, p-value < 0.001); • Number of shortest paths (ρ = 0.413, p-value < 0.001); • Number of communities (ρ = 0.380, p-value < 0.001); • Number of usability problem candidates (ρ = 0.354, p-value < 0.001).
Figure 5 shows the heat map of all eccentricity distributions and provides a high level overview on how tasks were performed in the studied website.It shows that task deviations usually occur in the first quarter of the session.Recall that the EDP represents inversely when in the session the deviation occurred.In order to relate tasks and eccentricity distribution, each of the event streams was analyzed in detail to identify the tasks the users were performing, then clusters were generated based on the number of EDPs and tasks performed, allowing the analysis of tasks that users faced difficulties.Fig. 6 shows tasks occurrence in the clusters generated by considering the number of peaks in the eccentricity distribution.It points that, in absolute numbers, sessions in the cluster of 5 peaks require detailed analysis, more specifically how users performed tasks 3 (View content), 4 (View accessibility evaluation form sample), 5 (View the "about page" presentation), 6 (View heuristic evaluation form sample), and 7 (View topics index).The presence of the same task in different clusters considering EDP shows that the same task is performed considering different possible paths present in the event streams, which paves the path for measuring task learning curve considering the presence of peaks in the eccentricity distributions.Moreover, in huge datasets it is also possible to identify the minimum number of EDPs for a certain task, for instance, tasks 10 and 14.Thus, the eccentricity distribution used to measure task learning curve can be considered to summarize a huge number of sessions that represent detailed interaction data.Moreover, Fig. 6 shows that the tasks that are present in the sessions with 4 and 5 peaks are not the most common ones, highlighting that these tasks need to be reviewed and that related UI components might need improvement.
Figure 7 shows the all 17 learning curves resulting from the analysis of EDPs in the dataset.It is possible to see that most tasks (9 out of 17) count on distributions with highest frequencies in the 1-peak mark, which is expected given that the website does not require login for most of identified tasks and most of the tasks identified are related to simply viewing different types of content.In addition, tasks with isolated peaks often counted on fewer observations.Task 4 (View accessibility evaluation form sample) has a high frequency on cluster of 5 EDPs, indicating that this task should be analyzed in detail to verify what happened and how it can be simplified.Task 7 (View topics index) and task 11 (Access the administration page) have high frequency on the cluster of with 3 EDPs, which indicates that these tasks require further analysis aiming at simplifying, causing them to appear in the clusters containing eccentricity distribution with less than 3 peaks.No specific task was identified with 0 EDP, indicating that, in this dataset, most tasks identified have a minimum of 1 EDP.
The following analysis involved grouping tasks with similar learning curves.Task 13 (View a comment), task 15 (Create a heuristic evaluation), and task 17 (View the "about page") represent the fastest learning curves in this dataset (Figure 8).There is no occurrence of sessions with 5 and 4 EDPs, a less frequent occurrence of sessions with 3 and 2 EDPs, and predominant occurrence of sessions with 1 EDP.
Figure 9 shows a group of tasks with slightly different shape.They count on few instances with 4 and 3 EDPs.Although, most of observations of these tasks resulted in eccentricity distributions with 1 or 2 EDPs.
Figure 10 shows a group of learning curves with a valleylike shape, with important number of occurrences of sessions with 5, 4, and 3 EDPs; however, observations most commons related to these tasks convey eccentricity distributions with 1 peak.Finally, Figures 11, 12, and 13, show the groups of learning curves presenting observations with worst cases for eccentricity distributions in the dataset.Sessions with 1 EDP are less frequent in these groups and curves are shifted towards the 3, 4, or 5-peak marks.

A. Discussion
The correlations found reveal that the EDPs relate to different topology metrics.The average path length and diameter are impacted as more task deviations occur, although these graph-wide metrics represent less detailed information regarding performance than eccentricity distributions.Modularity and community detection were not as correlated with EDPs, probably due to the way the directed graph is built from detailed interaction data.
Bearing in mind different types of learning curves found, the proposed method can be used to support redesign of UI and user tasks.The proposed method of measuring task learning curves can be used to identify different ways users perform the same task.Moreover, it is possible to identify how EDPs change over time and the empirical minimum/maximum values reached by real users.In these cases, values differing from expected would reveal situations unforeseen during project phases and reinforcing approaches involving real users, in real environments, performing real tasks.V. CONCLUSIONS This work presented outcomes gathered from an investigation on how to measure task learning curves from detailed interaction logs captured while users performed real tasks.The proposed way of measuring task learning curves builds on top of modeling task deviations as eccentricity distributions peaks (EDPs).Hence, from the analysis of how frequent these deviations are for groups of tasks, it is possible to evaluate big datasets and to identify tasks that appear in sessions with higher/lower occurrence of EDPs.Thus, this approach supports the identification of the most/least complicated tasks based on task learning curves.This can be used to represent learning curve for the population of users, supporting the analysis of how task deviations decay over time.
The proposed way of measuring learning curve supports summarization of multiple sessions represented by event streams of highly detailed interaction data, allowing HCI practitioners to select groups of tasks, sessions, or groups of users related to the high occurrences of EDPs.Regarding how smooth the tasks were performed, the task learning curves based on EDPs are an interesting proxy, since the lesser the number of deviations from tasks, the smoother the eccentricity distribution will be.Regarding learning curves, the smoother the eccentricity distribution is, the closer to the 0 or 1-peak marks the resulting learning curve will be.
The proposed way of measuring task learning curves can be used by HCI practitioners and Data Scientists on multiple cases.For instance, in A/B tests comparing two solutions or in usability tests as a quantitative way of measuring learnability, memorability, or error rate.In the case of automated tools, the proposed way of measuring learning curves can be used in dashboards from whole systems to UIs elements in order to identify situations where users are facing difficulties and thus to offer online support, e.g., consider a shopping cart page of an e-commerce website.This work is part of a long-term initiative to build a usage behavior model based on detailed logged data, identifying how to identify tasks, model task deviation, and measure learning curves.Regarding limitations of this work, the dataset considered is about a technical website.The target audience is composed of developers, content producers, and digital designers.Thus, it does not represent all the Web nor the whole Web audience.Although, the focus of this paper is to present how the eccentricity distribution supports depicting task learning curves.Another point to consider as a limitation is that from 220,448 sessions occurred in the last two years, only 427 (0.19%) were logged.Besides representing a small part of the users of the studied website, this occurred due to the need for users to accept participating in the study, allowing the data logger do capture detailed interaction data.Thus, the number of participants was impacted in favor of privacy and users' choice on providing or not detailed data related on how they perform tasks.Moreover, this is a requirement of the tool used.
Future work involves differentiating UI learning curves from content learning curves based on EDPs and interaction log analysis.

Fig. 1 .
Fig. 1.Example of the usage graph of one of the analyzed sessions; highlighted nodes are the usability problems candidates pointed out by the heuristic.

Fig. 3 .
Fig. 3. Simple graph with a cyclic action (a) and its corresponding eccentricity distribution (b); the point (2,2) represents a peak in the distribution.

Fig. 5 .
Fig. 5. Eccentricity distribution heat map for all the studied sessions; eccentricity and frequency are normalized.

Fig. 6 .Fig. 7 .
Fig. 6.Tasks presence in the clusters of sessions built based on the number of EDPs; size of bubbles represents the number of occurrences found in the sessions.

TABLE I .
SUMMARY OF THE 427 USAGE GRAPHS ANALYZED.