christop.club

Reviews for “Web Usage Patterns of Developers”

2015-08-01T00:00:00+00:00

I’m really racking them up this year at ICSME’15! This one is an Industry track paper I did with the good folks over at Codealike.

Here is the preprint. You’ll find more information about this paper on my publications page as I add it.

Abstract

Developers often rely on the web-based tools for troubleshooting, collaboration, issue tracking, code reviewing, documentation viewing, and a myriad of other uses. Developers also use the web for non-development purposes, such as reading news or social media. In this paper we explore whether web usage is detriment to a developer’s focus on work from a sample of over 150 developers. Additionally, we investigate if highly-focused developers use the web differently than other developers. Our qualitative findings suggest highly-focused developers use the web differently, but we are unable to predict a developer’s focused based on web usage alone. Further quantitative findings suggest that web usage has does not negatively impact on a developer’s focus.

Review 1

This is an interesting paper that primarily presents data - which is great.

In a few cases - such as the Office Collaboration tools - I would have like to see some attempt at better understanding why the quartiles were so inconsistent (or didn’t fall off in what might be an expected way). I don’t know if you have anything in your data that might be revealing.

Review 2

Summary

The authers studied how developers use online content and to which degree their development focus is influenced. The aim of their study is to find out if software quality might be affected by any internet content while online tools in general are required for development. They concluded that usingthe internet for general purposes must not influence developers, but highly focused developers use the web rarely, anyway.

Strong Aspects

In general, it is a good direction of research to better understand developers behaviour, and especially state-of-the-art influences on their daily work. Studying the open web content and the raising accessibility to it during work is an interesting field as well. The paper described the data base and their processing in a good manner and the paper is written in good style.

Week Aspects

Accept of the overall topic and the paper’s quality, I miss a more qualitative discussion. Perhaps, the authors should get in contact with psychological groups. Especially in the topic of attention and productivity.

For example, I miss a discussion of productivity patterns such as the promodoro principle (i.e., no one is able to keep attention active during the whole day, so intentially get distracted in a time boxed manner). Furthermore, social interaction is very important for anyone also to raise productivity. In a similar manner, people differ how much interaction they need to be as much efficient / focused as they can. The authers did not study the effect for each individual how it might be affected for example if the access to social networks is blocked. Additionally, the authers did not discuss if some participants might worked for companies with already blocked social networks.

On a detailed level I miss a more clear speration between the web content. For example, how did you decide between technical blogs and general purpose blogs not relevant for their work?

Finally, there is no discussion of a threat of validity due to the influence on the participants by the measurement itself. If they knew that their behaviour is analysed, they might have act differently.

Reviews for “Exploring the Use of Deep Learning for Feature Location”

2015-07-25T00:00:00+00:00

I’ve been blessed with a second publication at ICSME’15! This one is an Early Research Achievements (ERA) track paper on using Gensim’s Doc2Vec for feature location.

Overall, I am very happy with the comments we received! Below are the reviews.

Here is the preprint. You’ll find more information about this paper on my publications page as I add it.

Review 1

Summary:

The goal of this work is to support feature location using deep learning approaches. The authors claim that deep learning provides the ability to incorporate the order of terms as opposed to traditional feature location techniques, which have treated source code as an unordered set of terms.The authors report improvements in performance (using mean reciprocal rank) using a particular deep learning model over a set of six software systems. Additionally the authors estimate the average time to rank per query and the model training time.

Strengths:

+Emerging area in SE research
+Using real systems

Weaknesses:

-Conceptual gaps in the presentation
-Study design

Comments:

Feature location is a critical maintenance task, and bringing deep learning models to bear is a new approach certainly worth examination. Please consider the following comments and questions to strengthen the paper.

“Therefore, when querying for diagram, program elements where redraw is also present are considered more relevant and thus are boosted in the rankings.” Technically, LDA would consider the co-occurrence in this example too. (Yes, LDA would discard information on the order.) So what precisely will distinguish the approach based on deep learning models in this respect? Is it exclusively the order of terms? Would n-gram topic models be relevant here?

“We also suggest directions for future work on the use of DVs (or other deep learning models) to improve developer effectiveness in feature location.” I recommend rewording this statement since the concern is not improving developer effectiveness per se. The concern is improving the effectiveness of feature location engines.

“A deep learning neural network encodes source code identifiers, in the order they appear in the source code.” Technically, this statement is not correct. Imagine a software corpus with

source code identifiers. Consider a neural network with several hidden layers and an input layer with size

, where each unit in the input layer corresponds to an identifier. If we represent source code files as vectors of relative frequencies of identifiers and train the model to learn a compact representation of its input, then this deep learning neural network does not encode identifiers in the order they appear in the source code.

The semantic similarity example given in Sec. II.B should be ported to the SE domain for effect. Moreover, at the end of Sec. II, I still don’t have a good understanding of the model nor how the model is designed to address concerns inherent to feature location in a new or more efficient way.

Please consider adding research questions to Sec. III.

A general summary of tests supporting the statistical significance and effect of the results reported in Sec. III would rigorously support the authors’ claims on the performance gains.

Re: Tab. IV: Is model training time really a concern? LDA—the baseline model—appears to be on the order of one second (for 100 topics).

Why is the related work started by referring to n-grams, which do not appear to bear on the problem at hand? I would expect a reference to n-grams in the paper (aside from the second sentence of the abstract) if they are in fact related.

“statistical models of natural language text able to capture more complex patterns while being trained using smaller corpora relative to the n-gram model [3] [8]” It is not clear how these references substantiate the claim of using smaller corpora. I also don’t see the need to even emphasize smaller corpora.

Another related paper that should be discussed is on configuring LDA for SE tasks: Panichella, A., Dit, B., Oliveto, R., Di Penta, M., Poshyvanyk, D., and De Lucia, A., “How to Effectively Use Topic Models for Software Engineering Tasks? An Approach based on Genetic Algorithms”, in Proceedings of 35th IEEE/ACM International Conference on Software Engineering (ICSE’13), San Francisco, CA, May 18-26, 2013, pp. 522-531

Minor points:

“Deep learning models are a class of neural networks.” I recommend rewording this statement. Fundamentally, deep learning models are characterized by multiple “levels” of nonlinear transformations. Neural networks are a convenient abstraction for deep learning, but I would shy away from explicitly subtyping deep learning models from neural networks even though neural networks dominate deep learning applications;
“that has shown promising results in modeling natural language” needs citation(s);
In the third paragraph of the Introduction, the last sentence needs a citation;
“4Hz processor” should probably be 4GHz processor.

There are several grammatical and typographical issues in the current version. Should the paper be accepted, the authors should fix these issues to ensure the paper is in the best possible shape for the camera-ready version.

Review 2

Summary

The paper describes an initial study of using a neural network machine learning approach for feature location. They use document vectors (DV) and compare it to LDA.

Comments

Pretty straightforward paper without any big surprises or critical missing information. They leveraged Dit et al work for the experiment and setup.

DV appears to be about the same accuracy as LDA but looks to be much faster in query time and training time. So some trade offs.

DV is a technique not previously applied to this problem and could be another useful tool for addressing SE problems.

Overall I see these as pretty interesting results that would be a nice addition to the technical program.

Review 3

In this paper, the authors investigate the use of a deep learning model, document vectors (DV), for feature location. The authors compare DV with LDA on 633 queries from 6 versions of 4 software systems.

The paper is clearly written and presents an intriguing new approach worthy of further investigation. Although DV has been applied to source code in the past by White, et al., this work is the first time it has been trained on specific software systems, rather than a larger corpus.

The authors cited a number of FLTs in the related work. This paper would be strengthened by comparing with other approaches. I was disappointed not to see a simple baseline such as tf-idf included. At minimum, the choice of LDA should be justified. Is it the current most accurate FLT out there? Is it the most similar to DV?

Key points:

- results of study intriguing for future FLT investigation (important for community)
- clearly written
- stronger case if other FLTs included

Specific comments:

I applaud the author’s use of the standard IR measure of mean reciprocal rank (MRR) to evaluate their proposed FLT. However, the authors incorrectly attribute the definition of MRR to Poshyvanyk, et al. It would be more appropriate to say something like: “Similar to the study by Poshyvanyk, et al., we use MRR to compare…”

Reviews for “Modeling Changeset Topics for Feature Location”

2015-07-03T00:00:00+00:00

Here are the set of reviews for my ICSME’15 main track paper! Unfortunately, this bad boy was initially rejected from SANER’15, but we made many changes to the paper and it got in at ICSME. You can find a link to the PDF, code, and everything else in my publications.

SANER Rejection

Reviewer 1

This paper proposes to apply topic-modelling based information retrieval techniques (i.e., LDA and LSI) for feature location from the incremental changesets of source code. As an online learning algorithm based on changesets is adopted, it is not necessary to do retraining and get the updated topic models frequently. The authors further conduct evaluation on 14 open source Java projects to show the feasibility and effectiveness of the changesets approach.

Overall, this paper presents an interesting idea of using changesets for better feature location. Although LDA and LSI have been widely investigated in feature location domain, it is innovative to use the changesets from the version control system (e.g., SVN or Git) for feature location. The approach of modelling changeset topics is originally from reference [7]. This paper’s contributions mainly lie in the application of the approach of modelling changeset to feature location problem. The evaluation also seems to be solid. The authors publish the experiment data for public review. In the evaluation, it is good to use Wilcoxon signed-rank test with Holm correction to determine the statistical significance of the difference between results from LDA and LSI. However, as the authors mention in evaluation, few of the evaluated systems presented a statistically significant value between Snapshot based approach and changeset based approach.

The following issues need to be clarified: First, in this paper the authors use commit message in Git and SVN as the representation of a changeset in a version control system. Although the information among multiple versions of the project is used, the paper still focuses on feature location in a single version of the product. My concern is that if a feature that needs to be located is involved in several changes, how good can the proposed approach handle it? The authors may also better to show the effectiveness of the approach for features that are relevant and irrelevant to commit messages.

Second, the authors may also need to relate their work to the work on feature location on multiple versions of products. They may refer to the following literature and discuss about the application of modelling changeset topics for feature location in multiple versions.

Yinxing Xue, Zhenchang Xing, Stan Jarzabek: Feature Location in a Collection of Product Variants. WCRE 2012: 145-154

Third, for evaluation, the authors may try different parameter setup and measures of retrieval accuracy. Currently, the number of topic is set to 500. Actually, 500 topics may work well for normal documents based on natural language like English (see S.T. Dumais. LSI meets TREC: A status report, in Proceeding of Text Retrieval Conference, pp. 137-152. 1992), but a larger size of topic may be preferred for information retrieval on source code, considering more identifiers in source code. With regard to the measures, the authors only use the mean reciprocal rank (MRR). The authors may also consider some measures used in information retrieval domain, like Percentage of Relevant Queries (PRQ), Mean Average Precision (MAP) and Average Percentage of Code Units Investigated (APCUI). The different measures may reveal the different aspects of the results.

Below are also some detailed comments on the presentation and language of the paper:

In introduction, in the last third paragraph, “Our results show that not only is our changeset approach feasible and practical, but in some cases out-performs current snapshot approaches.” Here, the authors should be more specific about the cases in which the proposed approach performs better.
The approach section is a bit too simple. You may add more details, or merge it with some content in Section II.A and Section II.B.
In the fourth paragraph of section IV.C, in the first sentence, “we our partitioning is inclusive of that commit.” should be “our partitioning is inclusive of that commit.”

Reviewer 2

Note: I had to leave this beauty verbatim…

Review follows A.J. Smith 4/90 IEEE/Computer 

Recommendation. 
maybe

Summary and Significance. 
 What is the purpose?   Is the the problem clearly stated? 
	Incremental modeling of text-based retrieval systems for program comprehension. This is a significant goal for the SANER audience.

Is there an early description of the accomplishments? 
          No. in particular, the authors fail to mention that the method works only sometimes. 

 Is the problem new?   Using I/R for program comprehension is not; the incremental change set approach is.

     Has the design been built before?  no

     Has the problem been solved before?  no

     Is this a trivial variation on or an extension of a previous result?  no

    Is the author aware of of related work? yes

     Does the author cite previous work and make distinctions from  it?  yes

    If an implementation, are there new ideas?  yes

 Is the method of approach valid?  yes

        Is the approach sufficient for the purpose? yes

        Sufficient discussion of new ideas? no; reasons for failure to reject null hypothesis need to be clarified.

Is the actual execution of the research correct? yes

      Algorithms correct? Convincing?  yes

      Did the author do what was claimed?  no

 Are the correct conclusions drawn from the results?  no

      What are the applications/implications of the results?  I'm not sure.

      Adequate discussion of these results? 
         There is discussion of what happened; not why it happened.

 Is the presentation satisfactory? 

      Readability? yes

      Does abstract describe the paper?  please use a structured abstract.

     Does the introduction describe the problem and the framework?  yes

     Appropriate amount of detail?  yes

     Figures/tables appropriate? too many.

     Self-contained? yes

Reviewer 3

The authors present an incremental topic-model approach to feature location based on change sets. They evaluate the technique by comparing changes sets, snapshots, and temporal change sets.

Excellent job sharing materials and making the work replicable by others.

Although the writing was clear, it was difficult to follow the thread of the research and how the study design answered the research questions. Especially missing are the big take away messages — what should a researcher or practitioner take away from this study in using change sets or snapshots for FLT?

Specific comments:

The explanation in section III seems unclear. Intuitively, I would think the topic model is run once on a snapshot, and then run incrementally on all the change sets after that point (up to the commit being searched). This approach is hinted at in the introduction (“online topic models can be instantiated once and incrementally updated over time.“) However, the wording in the following sentence:

“The changeset topic modeling approach requires two types of document extraction: one for the snapshot of the state of source code at a commit of interest, such as a tagged release, and one for the every changeset in the source code history leading up to that commit.”

Sounds like topic modeling is run on all the changes leading up to the snapshot. Is this the target usage scenario? Please clarify the writing to make the target usage scenario & algorithmic steps more clear. Figure 1 is a good start, but doesn’t clearly show how the change sets are involved. Figure 1 seems to show that the topic modeler is run on the whole snapshot every time, which I thought the purpose of the work was to avoid this?

I think the key insight behind the approach — “The key intuition to our approach is that a topic model such as LDA or LSI can infer any given document’s topic proportions regardless of the documents used to train the model.” — needs to be expanded. Isn’t this idea one of the main contributions of the work? A concrete example showing why this intuition is valid would help.

In section IV.C., the purpose of \theta_Queries is not yet clear, and it is difficult to see how this fits in to the larger study. It would be helpful if there were a big picture paragraph in the methodology section describing the parts of the study and how they are used to answer the research questions before diving in to the details. For instance, in this section I don’t yet know what temporal simulations look like, although that is one of the contributions of the work. It seems as if someone within the research team would perfectly comprehend section IV.C, but is not written so that a reader familiar with feature location can discern what is being evaluated and why when reading the paper from beginning to end.

Section IV.E: “To answer RQ1, we run the experiment on the snapshot and changeset datasets as outlined in Section IV-C. We then calculate the MRR between the two.” What two? How does this comparison help us answer RQ1? And then: “To answer RQ2, we run the experiment temporally as outlined in Section IV-C” the high-level goals of the temporal experiment and how it differs from a traditional experiment have not yet been described. Why are traceability links important to answering the research questions? It seems that the authors had some trouble making use of the Moreno data set. What is the advantage to keeping it in? More replications? Why include both Tables I & II, if only the data from Table II is used in the study?

It seems as if some of the high-level information I’m looking for might be partially buried in the discussion section in G, rather than being up front to help the reader understand the design of the experiment.

The work of Rao, et al. Seems closely related. In section II, can you differentiate why such an approach is less desirable than your proposed approach? (or evaluate?)

Typos:

p. 5 C: “the process is varies slightly”
p. 5 C: “we our partitioning is“
conclusion: In this paper WE? conducted a study

(the mytical) Reviewer 4

Dear authors,

We would like to thank you for your submission that has lead to a lively discussion in the program committee. The main concerns raised by the committee pertain to:

the paper’s claim that the proposed approach analyzes multi-versions of changeset data, yet it seems that the paper did not really make good use of multi-version changeset data in the proposed approach and in the evaluation.
the fact that one of the reviewers familiar with this domain was not able to understand the approach since the paper has multiple issues making key points clear

ICSME Acceptance

Reviewer 1

The authors present a new approach in the context of feature location. They use information available in the a software configuration management system to incrementally perform concept location, so reducing time to perform such a kind of task. I found the idea behind the authors’ proposal very interesting even if it is not completely new in the context of software maintenance. The results support the validity of the new approach. The paper flow is adequate even if in some points I had some difficulties. For such difficulty I was not able to be completely confident with the the work done. Also, further details and justifications could be provided by the authors in the experimental part of the paper. All in all, I’m happy enough with the work done. It is one of the best papers I reviewed till now this year at ICSME.

In the following I’ll elaborate on the weakness points I see. I hope the authors will found them useful.

In the motivation part of the introduction, there are some points that seem contrasting each other. In particular, the authors wrote: “Indeed, given the current state-of-the-art in TR, it is impossible for an FLT to satisfy all three criteria while following the standard methodology.” Did Rao [10] and Hoffman at al. [9] make a contribution to satisfy all the three criteria? Reading the paper (and the Introduction, in particular) it seems YES.

Online (using fold-in and fold-out) LSI has been also applied in the context of architecture recovery. Mentioning this paper in the introduction section could further motivate your wok:

Michele Risi, Giuseppe Scanniello, Genoveffa Tortora: Using fold-in and fold-out in the architecture recovery of software systems. Formal Asp. Comput. 24(3): 307-330 (2012)

The part where the approach is highlighted in the introduction section needs to be rewritten because in the current form is not easy to follow. I read that paragraph more and more, but my comprehension level did not change: completely unclear.

Please discuss better on [10] and [18] in the related work section. In addition, it is not completely clear to me what the difference is between the proposed approach and [28].

Regarding the experimental part of the paper, I found very hard to understand the methodology (especially second paragraph). Last paragraph, the authors mentioned the dataset by Dit et al. Was the dataset by Moreno et al treated differently? Why?

Reading the description of the experiment, I was not able to understand whether the authors simulated the use of GitHub. I mean, were all the applications and the change sets in the used datasets in GitHub?

Last paragraph in section IV.E is not clear. I mean the place where the authors justify why RQ2 has been studied only on one dataset.

In section IV.F, the authors discussed on the fact that the p-value was greater than 0.05. In particular, they wrote: “This suggests that changeset topics are just as accurate as snapshot topics at the method-level, especially since there is a lack of statistical significance for any of the cases.” Since the null hypothesis has not been rejected, the authors can only discuss on descriptive statistics. That it, it seems that the authors accept the null hypothesis and this is definitively incorrect.

A statistical test (i.e., that chosen) verifies the presence of significant difference between two groups (in your case), but it does not provide any information about the magnitude of such a difference (if present). The magnitude of such a difference could be computed using a (non-parametric) effect size measure (e.g., Cliff’s d). You could also use the average percentage improvement/reduction.

Why the authors did not analyze execution time?

In the threats to validity you should also consider biases related to the statistical analysis performed (Conclusion validity). The readability could improve organizing threats in: Internal, External, Conclusion, and Construct.

Typing and formatting minor issues:

At the end of section III.C, there is (between brackets) a strange symbol.

Figure 2 is not compressible if the paper is printed black and white.

Please remove orphans.

Section 4.B - it is not so good reading the description of the experimental objects as the authors did.

Reviewer 2

This paper proposes a topic-modeling-based feature location technique in which the text retrieval model (i.e., topic model) is built incrementally from source code history. The technique uses an online learning algorithm to train topic models based on change sets, and thus can maintain an up-to-date model without incurring computational cost associated with retraining traditional snapshot-based topic models. The proposed technique has been evaluated and the results indicate that the accuracy of the technique is similar to that of a snapshot-based feature location technique.

This paper reports an interesting exploration of applying incrementally built topic models for feature location. It has the potential of improving current IR-based feture location methods with lower computational cost on building text retrieval models. But I think the paper still has a large space to improve.

First, the motivation of the paper is not clear and it is not well reflected in the evaluation. It seems that the main benefit of the proposed technique is the saving of computational cost associated with retraining traditional snapshot-based topic models. However, there is no analysis about how much computational cost can be saved. If the training of a snapshot-based topic model only takes a short time (e.g., several minutes), it is acceptable that the topic model is retrained for each release. Moreover, the saving of computational cost is not evaluated in the experimental study.

Second, the proposed technique is not well described. In the section presenting the technique (Section III), Section III-A and III-C respectively presents terminology and explains the reason why change set is used. Section III-B introduces the proposed technique, which is very short. Some important details are missing, for example how change set corpus are combined with snapshot corpus in training topic models? The process described in Figure 1 (B) does not reflect the incremental manner of the proposed technique.

Reviewer 3

The paper presents a topic-modeling-based Feature Location Technique (FLT) where, to reduce the computational cost, the model is updated incrementally from the changesets of commits from the project history instead of entire snapshots. The approach is evaluated on 1,200 defects on publicly available dataset (from 14 open-source Java projects) and is shown to exhibit accuracy not lower than the accuracy of more traditional models built on entire snapshots. The data and source code for the analysis are provided in an online appendix.

The idea is novel and the approach has potential. Not much work has addressed the issue of incremental model building in IR based feature location (the paper misses some related work – see below). The motivation behind building a model incrementally is to reduce the computational cost of rebuilding a model from every snapshot. The approach presented in the paper is sensible and the results indicate that it is a direction worth following. However, the paper also has several points where it needs some improvement.

The original motivation suggests that the changesets will update the model incrementally. My expectation was that every changeset will be considered separately, i.e., the model will be updated using a changeset. However, neither of the two research questions actually evaluates the approach in that setting. In RQ1 the changeset-based model is built using all changesets at once. In RQ2, the changesets are grouped into partitions based on the bug report that they are linked to and the model is updated using a partition. The first question here is why grouping changesets and why not updating the model after each commit? And then if a grouping is to be made, why not approximate a more realistic setting, i.e., update the model with every consecutive 10 commits, for example. Consecutive commits will address different bugs and thus will certainly have different topic distribution. My doubt here is to what extent the grouping in RQ2 may have introduced a bias in the results? By the w! ay, the part describing the historical simulation is somewhat confusing – at least I had to read it twice to fully understand what exactly is being done.

When investigating the accuracy of the models built on the changesets the thresholds are selected without justification and no tuning. For instance, for the number topic models in all analyzed projects is fixed to 500. The paper justifies the lack of parameter tuning with the fact that the “goal is to show the performance of the changeset-based FLT against snapshot-based FLT under the same conditions” and that “the measurements collected are fair and that the results are not influenced by selective parameter tweaking”. However, poor selection of the parameters may lead to poor results and thus unrealistic optimism that the proposed changeset-based FLT performs as good as traditional snapshot-based FLTs. This doubt is somewhat confirmed by results shown in Tables 1 and 2: The Mean Reciprocal Rank (MRR) is used to measure the effectiveness of a FLTs for a set of queries; the higher the value the better the result. The values for MRR shown in Tables 1 and 2 are quite low and this is true for both models. For example, for the project Pig v0.8.0, the MRR is ~ 0.011. This score of MRR would mean that the minimum rank for a relevant class would be on average ~ 90 (out of 442 classes in this project). The MRR reported by Moreno et al. varies depending on the settings and type of information that is considered but stays between 0.18 and 0.26 for the same project. This corresponds to ranks 6 and 4 (again out of 442). Thus, the doubt here is that the results of the snapshot-based FLT using the selected parameters are poor and the only thing that one can conclude is that the changeset-based FLT is not making the poor results worse. Now whether the poor results are due to the underlying techniques (i.e., LDA and-or LSI) or to the parameter selection only is not clear but is probably worth investigating.

RQ1 should be rephrased maybe as a hypothesis “Changeset-based FLT is less accurate than snapshot based FLT”. Then the data shows that this cannot be proved.

Regarding RQ2, it is not clear how the accuracy’s “fluctuation” of the CFL technique is measured as a project evolves. I do not think the MRR metric by itself measures such fluctuation, or at least this is not explained in the paper. The MRR only measures accuracy. I would think that series analysis on the MRRs across time would be the way to go or other analysis of this kind. Now, it seems that the goal was onlu to compare the accuracy when changesets are used to incrementally update the topic model, as opposed to update the model at once with all the changesets. Unfortunately, it is not clear whether what the goal really is. I suggest to clarify this issue and perhaps reformulate RQ2. After all, the main goal of the paper is to test how the CFL would perform in a realistic environment where the model is incrementally updated with changes in commits.

The paper omits the LSI results “for brevity”. If they are omitted completely, it is best not to even mention them. The best thing to do is to at least mention how they compare wrt LDA.

Detailed comments:

p1:

“By training an online learning algorithm using changesets, the FLT maintains an up-to-date model without incurring the non-trivial computational cost associated with retraining traditional FLTs.”: As shown in Fig. 1 the snapshots are still used for indexing. Thus, the computational cost is saved in the process of building the topic model. What is exactly the saved computational cost? To better motivate the paper I would recommend to give a citation or an example of how long it takes to create a topic model for a large system such as eclipse using LDA. Also, it is a good idea to provide the cost saving of the Online LDA technique, compared to the standard LDA.
“It follows from the first two observations (1: Like a class/method definition, a changeset has program text; 2: Unlike a class/method definition, a changeset is immutable.) that it is possible for an FLT following our methodology to satisfy all three of the criteria above. “: It is not clear how the first criterion is satisfied, i.e., “(1) accurate like a TM-based FLT”
“We then used a subset of over 600 defects and features to conduct a historical simulation that demonstrates how the FLTs perform as a project evolves.”: Why 600?
The preprocessing often includes stemming, but stemming is not mentioned here. Later (p.6, Section IV Study) it becomes clear that no stemming is applied without justifying why.

p2:

“Normalizing: replace each upper case letter with the corresponding lower case letter”: Lawrie et al. use “normalization” for vocabulary normalization (i.e., the alignment of the vocabulary found in source code with that found in other software artifacts). See: D. Lawrie, D. Binkley, and C. Morrell. Normalizing source code vocabulary. In Proceedings of the Working Conference on Reverse Engineering (WCRE), pages 3-12, 2010
“corpus is a set of documents (i.e., methods)”: “i.e.,” -> “e.g.,”

p5:

Section IV.C. (Methodology) can be broken down into subsections based on the RQs.
To answer RQ2 (Does the accuracy of a changeset-based FLT fluctuate as a project evolves?), the paper describes the so-called historical simulation where commits are related to each query (or issue) and partitions of mini-batches of changesets are created. The model is then updated using a mini-batch. An index of topic distributions with the snapshot corpus is then inferred. I don’t understand why for the historical simulation, commits are grouped into partitions of mini-batches instead of updating the model after every commit.
“on all documents extracted.” -> extracted documents

p6:

The paragraph starting with “After extracting tokens, we split … “ is not needed. The preprocessing, except the stemming, is already explained in Section II.A.
Thresholds are missing justifications: K, the number of topics, is chosen to be 500; the two parameters that control how much influence a new mini-batch has on the model when training are 1024 and 0.9. No justification is given for the selected values. What are the values selected in related works?

p10:

Ref. [2]: the publication date is 2013.
The references should be consistent. For example, the venue of the references 7, 17, 19 and 20 have the following form: “Software Engineering, IEEE Transactions on”; instead of “IEEE Transactions on Software Engineering”.

Missing references to related work:

Hsin-yi Jiang, Tien N. Nguyen, Carl K. Chang, and Fei Dong, “Traceability Link Evolution Management with Incremental Latent Semantic Indexing”, in Proceedings of the 31st IEEE International Computer Software and Applications Conference (IEEE COMPSAC 2007), pages 309-316, July 24-27,2007

Hsin-yi Jiang, Tien N. Nguyen, Ing-Xiang Chen, Hojun Jaygarl, Carl K. Chang, “Incremental Latent Semantic Indexing for Automatic Traceability Link Evolution Management”, in Proceedings of the 23rd ACM/IEEE International Conference on Automated Software Engineering (ACM/IEEE ASE 2008), September 15-19, 2008

Ratanotayanon, Sukanya, Hye Jung Choi, and Susan Elliott Sim. “Using transitive changesets to support feature location.” Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, 2010

papers.bib or: How I Learned to Stop Worrying and Love the Reference Manager

2015-05-23T00:00:00+00:00

I recently completed and passed my phd thesis proposal. During my time struggling to get myself together and organized, I gave up on trying to manage BibTeX file by hand. Here, I’m going to describe the software and strict workflow I’ve been using to manage a single thesis bibliography, papers.bib.

My criteria were the following:

Cross-platform
Open source as heck
Keeps all machines in sync
Easy to restore an old version
On-demand opening/viewing of the source PDF
Usable offline

Software

I use three primary programs to manage my papers.bib file:

JabRef
- I chose JabRef as the reference manager because it is open source, cross-platform, and can open PDFs or URLs directly.
- As an added plus, it has support for plugins and a BibTeX downloader.
git
- git was the obvious choice for versioning of the papers.bib.
- Easy to back up on Github/Bitbucket/whatever.
Syncthing
- Syncthing was chosen because it was a solid open-source replacement for Dropbox.
- Used for keeping previously downloaded PDF files in sync.

I can satisfy all my criteria with a combination of these three tools.

Setup

The first thing I do (did?) is to start a git repository in ~/papers/. In this folder I place my main papers.bib file for JabRef to manage.

JabRef

There’s only one essential configuration option I rely on: the BibTeX key generator. In JabRef’s preferences, I set the default pattern to be [auth.etal]_[year]. You can use whatever you fancy, but be sure to use something with file system safe characters (e.g., avoid special characters like :).

Syncthing

I set up ~/papers for syncing within Syncthing. In Syncthing, you can set up ignore patterns for files it should ignore during sync. I use these:

.git
.git*
papers.bib

This means it will skip git-related stuff and the main papers.bib, but git will take care of all of those.

git

Likewise, I set git up to ignore the PDF files by placing this into the .gitignore: *.pdf. Syncthing will be managing those PDFs.

Workflow

When I run across a paper I want to cite, or even read, I download all of the following:

BibTeX: sometimes I copy in an entry directly from the web source, and other times I use the JabRef web search & downloading feature. Very rarely have I needed to make entries manually. Manual entries are usually theses or books.
DOI/URL: All papers must have the DOI or URL. If I by chance lose a PDF, I can find the original source again.
PDF: Every paper must have the PDF associated with it. The only exception are library sources, which if possible I will make sure the URL points to the library online catalog entry or a place to buy it online (Amazon).

Once the BibTeX is in JabRef, the first thing I make sure to do is insert the DOI/URL if it doesn’t already have one. One thing I noticed is that sites like IEEEXplore don’t include the DOI in the downloaded BibTeX, but list it on the paper’s web page. I make sure to grab that.

The second thing I do is attach the PDF file. If you right click an entry, “Attach file” will be in the menu. Normally, the downloaded PDF name is horrible and gross. Hence, I use a handy plugin to help with that.

renameFile

There’s a plugin that is critical for my JabRef use: renameFile.

renameFile comes with two configuration options: folder and name pattern. I use both, leaving the “folder” blank (i.e., it uses whatever directory papers.bib is in to place PDFs). Because of the way we’ve configured JabRef’s key generation option, I leave the name pattern as [bibtexkey].

After attaching a file, I simply hit “rename” in the plugin window, verify the file is being renamed as expected, and I’m done. It will rename the PDF file to match the BibTeX key and move the file to the ~/papers/ folder. Yay!

git commit

After I finish up adding new sources or finish writing for the day, I make sure to check in the papers.bib file into git. When committing, I always check the git diff to make sure nothing was removed, only added. That last bit is critical, cause it can tell you when something is amiss. I also push the changes to a public-facing Github repository.

Writing and collaboration

Having a single, dedicated papers.bib comes with one major caveat when trying to collaborate: people are going to insert things into the working bibliography, hence breaking the workflow of the central bibliography entirely! Not sure there’s much to do about that, but here’s my current workaround.

Each paper I work on has it’s own separate git repo. I always merge in my papers.bib file as the “main” source and check it into git. That means git is managing two separate versions of the “same file” in two separate repos, which can certainly be confusing. Luckily, diff makes it easy to determine the differences between the working and central bibliographies.

Whenever someone makes a change to the working bibliography, I make sure to immediately merge the new entries into my central bibliography by following the workflow I describe above. If it is going to be in a paper with my name in it, I am going to have it for future reference. I do this by literally checking diff -us ~/papers/papers.bib path/to/collab/papers.bib manually every time I begin writing. I know, this part sucks. You could also make sure by checking git whatchanged after a git pull.

After the new changes are merged into the central bibliography, I overwrite the working one with the central one. This ensures I can see whenever a change is introduced after I add in the DOI, URL, or PDF fields.

Summary

I know that seems like a lot of work – oh, it is – but trust me, it becomes so much easier to use after it is setup and working. Be vigilant in maintaining it and future you will thank you for having a central source for the references, along with links and PDFs.

One immediate need I’ve noted has to do with collaboration. While the workflow worked really well for my proposal as I was the only one working on it, collaboration immediately exposed flaws. For now, I’m manually working around this limitation.

My reviews from ICMSE2014 Tool track

2015-01-09T00:00:00+00:00

This past year, I had the privilege to serve on the ICSME2014 Tool demo track.

Of the four papers I helped review, two were accepted. Here are those reviews.

Paper 1

Context-sensitive Code Completion Tool for Better API Usability

By Muhammad Asaduzzaman, Chanchal K. Roy, Kevin Schneider and Daqing Hou.

Overall: 3 (strong accept)
Confidence: 3 (medium)

This paper presents a tool for code completion. In particular, it builds a model of common patterns of API usage and uses the context of the code currently being written to find a similar pattern for suggestions. Benefits of this model is that the autocompletion is quick and can recommend without needing to know what the developer is looking for (e.g., any method starting with a typed letter).

Suggestions for improvement:

References 10-13 would be better off as footnote URLs.
There is a bad citation at the top of the second column of the first page.

Overall, this paper is clean and straightforward. I like the context usage of the current code being written. While the demo video was geared toward code being written for the first time, I wonder how it performs in a maintenance context.

Paper 2

Reviewer Recommender of Pull-Request in GitHub

By Yue Yu, Huaimin Wang, Gang Yin and Charles Ling.

Overall: -1 (weak reject)
Confidence: 4 (high)

This paper presents a tool for automatically recommending code reviewers to pull requests (PR) on Github. A reviewer is considered as anyone that has commented on a PR in the past. Using past PRs, they combine the semantic similarity of the text of the new PR and the social network of developers of previous PRs. The semantic similarity is a simple VSM. The social network is built by extracting developer mentions in the comments. They report on a study of several popular Github projects, reaching 0.74 precision and 0.71 recall for top-1 and top-10 recommendation, respectively.

Problems:

In the approach, what stemmer is used?
What are the list of stopwords?
It is unclear if developers commenting on their own PR are included. Several projects use Github PRs as a code review tool, and a conversation occurs between contributors, including the PR requester. Including or excluding the original requester based on their developer status at the time may affect the results.
It is unclear exactly how the recommendation from the vector space model is combined with the social network. Is more weight put in to the semantic similarity or the network? Subsection 3-D, reviewer recommendation, needs elaboration. It is a key factor to how the approach works.
I could not find a way to download and use the tool on the given website. How do I run this on my own projects? The website presented seems mostly like a browser for output of the actual tool.
The demo video seems more of a presentation than a demo. Perhaps this is due to my previous bullet point.

Overall, I think the approach is interesting. But I don’t see how I can apply this tool on other projects.

Reviews from MUD2014

2015-01-09T00:00:00+00:00

To keep up with practicing some open science, here are the reviews to the MUD’2014 paper I “recently” published.

You can find a link to the PDF, code, slides, and talk in my publications.

Review #1

This paper describes an evaluation of the inputs to LDA topic models. Topic models are a very valuable tool in software engineering research, and too often they are used without much configuration. This paper presents a study of an aspect of this configuration, to help other researchers: whether change sets or “snapshots” produce more-distinct topics and use the same vocabulary.

The authors found mixed results, in that the changesets did seem to result in more-distinct topics for 2 systems, but in the another 2 systems, there were not noticeable differences. Likewise, the vocabulary used in the changesets was measurably different than the snapshots.

While the scale of the study is small, and the results somewhat mixed, the paper does have the capacity to cause good discussion at the workshop, given the importance of topic models in SE. For example, researchers using SE can take guidance from this paper that it may be necessary to try both changesets and snapshots.

The chief improvement to this paper would be to increase the number of programs that are studied. With more systems, it might be possible for the paper to make a recommendation more-strongly for one or the other dataset.

Review #2

Summary: The paper investigates whether topics extracted from chagesets are different from topics extracted from snapshots. The study has been performed on four systems and the authors exploited LDA to extract topics. Results are somehow in contrast between the four object systems.

Evaluation: The paper is well written and easy to follow. The posed research questions make sense, and the paper’s topic is for sure of interest for the MUD audience. However, I am not sure what I can learn from such a paper.

I mean, I cannot understand how the findings reported in the paper can be used in any SE application or can impact the way of conducting SE Empirical studies. The authors should spend some words (during the results discussion and the conclusions) to explain why their findings are of interest for the research community. For instance, what should I learn from the fact that the cosine distance between the two corpus (i.e., changeset and release) is very small for three out of the four systems? Has PostgreSQL something special? The authors could remove Figure 3 (not useful at all) and use the saved space to better present and discuss the implications behind their findings.

Review #3

Desc.:

Most bug localization, feature location, and link traceability studies extract topics from one snapshot of a software repository. Rather than extracting topics from one snapshot of a software repository, another alternative is to extract topics from the differences (lines added and lines added) between two consecutive revisions in a repository. The paper extracts topics this way and evaluate the quality of the resultant topics by using the concept of topic distinctness. To extract changeset topics several steps are performed: first, git diff is used to get the changeset; second, tokens are extracted and split based on camel case, underscores, and non-letters; third, stop words are removed; finally, the documents are input to an LDA implementation (Gensim’s LDA). An experiment on changeset corpora from 4 systems, ant, AspectJ, Joda-Time, and PostgreSQL have been performed. The experiment show that for two of the systems, the words that appear in a changeset corpus are similar! to words that appear in a corpus extracted from one snapshot of a software repository (release corpora). Furthermore, for two out of the four systems, the topics that are extracted from changeset corpus have higher topic distinctness scores than topics that are extracted from release corpus.

Pros:

The paper analyzes 4 software systems and compares the topics extracted from changeset corpus and release corpus using topic distinctness.
Experiment shows that at least for some software systems word distribution in a changeset corpus are rather different than word distribution in a release corpus (cosine distance of 0.3 or higher).
Experiment shows that in two software systems the topic distinctness score of topics extracted from a changeset corpus is higher than the topic distinctness score of topics extracted from a release corpus.

Comments for Improvement:

It seems Thomas et al. have also modelled changeset topics before (Reference [3]). It is not clear what are the differences between Thomas et al.’s approach and the proposed approach. The paper states: “we find similar topic distinctness scores” and “our approach is feasible, as it captures distinct topics while not needing post-processing and is always up-to-date with the source code repository”. What kind of post-processing was performed by Thomas et al.’s approach that is not performed by the proposed approach? Is it bad to perform post-processing? Can’t Thomas et al.’s approach generate topics that are up-to-date with a source code repository? Please elaborate more. If the technical difference between the paper and Thomas et al.’s approach is small it is better to reposition the paper as a replication study. It seems the paper investigates more systems and the findings provide additional insights not provided by Thomas et al.’s paper.
It will be good to add some additional details to the paper to answer the following questions:
(Section IIID) Is the higher a topic distinctness score, the better a set of extracted topics is? Please elaborate more.
(Section IIIE) After the encoding errors are removed, is the words that appear in a release corpus always the same as the words that appear in a changeset corpus? Why does the encoding error only affect either one of the corpus but not both of them?
(Section IIIE) Please explain more how cosine distance is computed. Cosine similarity is well known, but cosine distance is not so well known.
(Section IIIE) Please provide more insight on the cosine distance scores. Is a cosine distance of 0.00396 good or bad? Why some systems have much higher cosine distance score than others (e.g., 0.33957 vs. 0.00396)?
(Section IIIE) “Ant and PostgreSQL have drastically more documents in their respective change set corpora than Joda-Time and AspectJ” It is good to also mention how many documents are in the change set corpus of each of the four systems.
There are many other studies that use topic modelling for software maintenance; it will be good to add them to the related work section especially those that use topic model for bug localization, feature location, or traceability link recovery which is the motivation of the work (as stated in the abstract), e.g.,:
Stacy K. Lukins, Nicholas A. Kraft, Letha H. Etzkorn: Bug localization using latent Dirichlet allocation. Information & Software Technology 52(9): 972-990 (2010)
Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen, David Lo, Chengnian Sun: Duplicate bug report detection with a combination of information retrieval and topic modeling. ASE 2012: 70-79
Tien-Duy B. Le, Shaowei Wang, David Lo: Multi-abstraction Concern Localization. ICSM 2013: 364-367

Reviews from MSR2014

2014-07-16T00:00:00+00:00

I’ve been reviewing some papers for the ICSME 2014 tool demo track, and it occurred to me that I could post my own reviews from previous published papers. This will (hopefully) share some insight to fledgling researchers (cough cough, me) on what a short paper review would roughly contain.

So, here goes.

“New Features for Duplicate Bug Detection” was a study conducted by an REU student over the summer of 2013, with mentoring and guidance from Dr. Kraft and myself. Here is a link to the preprint [PDF]. We submitted this to MSR 2014 short paper track, and it was accepted.

Below are the three reviews this paper received. Note that these reviews were for the submission, and these comments were geared toward that copy. I don’t have that anywhere that I can remember, but this should give you an idea.

Note: I’ve _slightly_ modified these with whitespace so that they render in markdown

Review #1

Summary: The paper proposes a technique that predicts if a pair of bug reports is a duplicate pair or not. It extends the previous work by Alipour et al. by introducing additional features that are based on the differences in the words, topics, priority, reporting time, and components of two bug reports. Several machine learning algorithms from Weka have been used to investigate the effectiveness of the proposed features. Experiments have been performed on the same Android bug report dataset as Alipour et al. The results of the experiments show that the proposed features could improve the result of Alipour et al.’s method by 3.33%, 7.24%, and 11.76% in terms of Accuracy, AUC, and Kappa.

Recommendation: Weak Accept

Pros:

A number of new features have been proposed. These features capture differences between two bug reports in terms of their words, topics, priority, reporting time, and component.
Experiments using 6 classifiers have been conducted to demonstrate the value of the proposed features.
The experiments on the Android datasets show that the proposed approach could improve Alipour et al.’s approach by 3.33%, 7.24%, and 11.76% in terms of Accuracy, AUC, and Kappa.

Suggestion for improvement:

Reference [2] is not the paper referred to by Alipour et al. It should be changed to:

Chengnian Sun, David Lo, Siau-Cheng Khoo, Jing Jiang: Towards more accurate retrieval of duplicate bug reports. ASE 2011: 253-262

Reference [2] is related to your proposed approach though since it also uses topic modelling. It has not been compared with Alipour et al.’s method. Thus please refer to it too and mention the differences between your approach and the paper, e.g., difference in setting (see next comment).
The setting that your paper consider and the setting considered by Sun et al’s approach are different. I think there is a need to highlight the difference in the paper.

In Sun et al.’s approach, the setting is: given a bug report, return a list of top-k most similar bug reports.

In your approach (and Alipour et al.’s approach), the setting is: given two bug reports, predict if they are a duplicate of each other or not.

Alipour et al.’s setting is first considered in the following paper:

David Lo, Hong Cheng, Lucia: Mining closed discriminative dyadic sequential patterns. EDBT 2011: 21-32 (See Case Study section)

This setting is actually easier, since it is easier to differentiate between “two completely random bug reports” and “duplicate bug reports”, than to differentiate between “two similar bug reports that are not duplicate of each other” and “two similar bug reports that are duplicate of each other”.
Please describe more about the evaluation metrics (i.e., Accuracy, AUC, and Kappa). In particular, please describe Kappa since it is not a very frequently used metric.
Please add a related work section that more comprehensively describes work in the area of duplicate bug report detection.
“Alipour et al” => “Alipour et al.”
“International Workshop on Mining Software Repository” => “Working Conference on Mining Software Repository”
Weimar et al => Please add a reference …
I think a better title could have been: “New Features for Duplicate Bug Detection”.

In general, I have no major concern with the paper. The writing could be improved in a number of ways though. There is still one more page that the authors can use to improve the writing.

Review #2

The paper proposes a new set of features to identify duplicate bugs. The efficacy of these new metrics/features is evaluated using 6 machine learning algorithms from Weka. The paper builds on the work of Alipour et al. and uses the same Android bug dataset. The experiments indicate that these new features result in an improvement in accuracy compared to Alipour et. al.’s for all the 6 learners considered.

Though the paper is not significantly novel, the idea of considering the first shared identical topic seems new. The results, at least for the Android data set, seems encouraging.

That said, it is generally necessary to evaluate a new metric rigorously and on several benchmark data sets before we claim that the metric is better. Since using shared identical topic seems to make sense intuitively, this is a ok for a short paper.

Few suggestions:

Since you have 1 additional page, and you use the same data set as Alipou’s, it would be good to show some examples of pairs of bug reports that are actually duplicates but could not be detected by Alipour et al. but was detected using your new metrics. And also vice-versa. This will also help describe your new metrics in more detail with examples.
Ideally, one would want to know the efficacy of each individual metric. Which of your metrics would have a better performance? can we rank them?
There is a problem with Table III. You say that you added REPTree that Alipour et al. did not use, but then you show the performance improvement over Alipour’s metric. This needs some additional explanation.
In Section V, you mention Weimar et al. but do not provide any reference.
The exposition can be improved. The new metrics (especially the one that you think is very novel) should be highlighted in the introduction itself. Currently, one needs to read till page 2 to figure out that the attributes in Table 1 are the new metrics you refer to in the paper.

Review #3

The authors replicate a prior bug deduplication study and apply new metrics and evaluate performance. They improve performance across a wide range of learners and provide a new learner that works as well. Furthermore their technique is far more generalized than the work they replicate and thus is more automatable.

First and foremost, I think these incremental improvements in mining are actually best served in short form. I think the length of this submission is almost appropriate (although you had an extra page for say better descriptions of the results or more comparison).

Second, they argue they have a consider improvement over other techniques

Third, they don’t make it clear in their paper but their results all suggest that LDA based comparisons results for bug deduping improve performance far more than priority, time, component, or bug type. Implying that at least in Android the meta-data is poor.

Questions:

In table III what was reptree compared to ?
Clarify this statement, you have space: To protect the validity of our study, we ensured that no two pairs contained identical reports.

Issues:

I think the wrong style was used for this submission the current style is sig-alternate and you’re using something else, for instance numbering is in roman numerals.
There’s an extra page to go…
I think in Alipour et al. and in this study that the application of KNN is inappropriate. While it works, I think it violates the triangle inequality.
I want to see some time and space given to describing what were in your true positives and false negatives and true negatives and false positives.

Conclusions:

I think it is a nice short replication. A little more presentation work would be appreciated but these features are easy to calculate and easy to integrate into any deduper framework.

Using Gensim for LDA

2014-05-06T00:00:00+00:00

This is a short tutorial on how to use Gensim for LDA topic modeling. What is topic modeling? It is basically taking a number of documents (new articles, wikipedia articles, books, &c) and sorting them out into different topics. For example, documents on Babe Ruth and baseball should end up in the same topic, while Dennis Rodman and basketball should end up in another.

LDA is an extension of LSI/pLSI using some crazy statistical stuff. Most of that will not matter to us since we aren’t implementing LDA. One important thing to consider about LDA, however, is that it is a mixture model, which is statistical mumbojumbo for “documents can be associated with more than one topic.” That is, and article about Dennis Rodman could be related to multiple topics: basketball, tattoos, and crazy hair colors.

Right now, Gensim is in the process of being ported to Python 3. This tutorial is written for Gensim 0.9.1. I’ll assume that you’ve got Gensim installed and working on Python 2 already.

Let’s start, go ahead and import gensim:

in [1]:

from __future__ import print_function
import gensim

In LDA, we infer a certain number of topics from a given corpus. I prefer the Mallet format for corpora, namely because each document has an associated document name or id. Other formats require you to maintain this separately with a key file, but that’s just dumb.

I’ve got handy a corpus of every title (already preprocessed) of the Android issue report database. You can download that here.

Here are the first three lines (aka the first three documents (aka the first three issue report titles)) of the corpus file:

in [2]:

!head -3 android.mallet

out [2]:

en incorrect url address project
en good luck
en http proxy support

Luckily, Gensim supports reading this format directly! So, let’s load up our corpus into something Gensim can use internally:

in [3]:

corpus = gensim.corpora.MalletCorpus('android.mallet')

This might take awhile, because it is building some metadata about the corpus itself.

Typically, you would use the corpus in a loop like so:

for document in corpus:
    blah(document)

But, just for our purposes, let’s look at the first document it’s holding:

in [4]:

next(iter(corpus))

out [4]:

[(6936, 1), (15314, 1), (300, 1), (10981, 1)]

Um, what? That doesn’t look anything like the first document from before. That’s because this is the internal representation Gensim (and all of its modeling algorithms) uses. This is a document, but instead of a list of words, it is a list of tuples where each tuple is a word id and frequency pair.

So we can see word #6936 appears 1 time in the first document. But what is word #6936? Again, let’s do that crazy next(iter( business so we don’t end up going over every document in the corpus. Check this out:

in [5]:

for word_id, freq in next(iter(corpus)):
    print(corpus.id2word[word_id], freq)

out [5]:

incorrect 1
url 1
address 1
project 1

in [6]:

!head -1 android.mallet

out [6]:

1 en incorrect url address project

Badass, yeah?

Okay, not really, that’s not very interesting. I did something a little different here, and that’s using the corpus.id2word attribute. It’s simply a Python dictionary that maps id->word for all words in the corpus.

Alright, let’s actually generate a model (go ahead and get a sandwich, it’ll be a minute):

in [7]:

model = gensim.models.LdaModel(corpus, id2word=corpus.id2word, alpha='auto', num_topics=25)
model.save('android.lda')
#model = gensim.models.LdaModel.load('android.lda')

We can save/load the model for later use instead of having to rebuild it every time, as shown in the comment. As much as I enjoy sandwiches, I don’t want to do this all the time.

There are a couple of parameters other than the corpurs that I’ve set there. Let’s talk about those for a sec:

id2word: Although you can build a model from just a corpus, I’ve gone ahead and let the LdaModel know about the corpus.id2word. It just makes some of the things I’ll show you next nicer.
alpha: This particular LDA implementation uses something that can automatically update the alpha value for us. This determines how ‘smooth’ the model is, which makes no damned sense if you aren’t working in the area (it doesn’t make much sense to me). Here’s what alpha does: as it gets smaller, each document is going to be more specific, i.e., likely to only made up of a few topics. As it gets bigger, a document can begin to appear in multiple topics, which is what we want. It’s not good to have a large alpha either, because then all our topics will start intermingling and making out and that’s gross. I have no idea how the 'auto' setting really works, but it seems pretty legit to me so I’ll just use that for now.
num_topics: The num_topics parameter just determines how many topics we want the model to give us. I’ve used 25 here since we are only looking at a corpus of titles.

Let’s look at a few random topics:

in [8]:

model.show_topics(topics=5, topn=5)

out [8]:

['0.047*link + 0.027*ui + 0.018*main + 0.017*level + 0.016*locale',
 '0.107*tap + 0.047*popup + 0.045*appears + 0.031*request + 0.029*tab',
 '0.120*play + 0.096*ics + 0.084*music + 0.049*bug + 0.030*android',
 '0.106*device + 0.078*google + 0.060*talk + 0.057*voice + 0.044*icon',
 '0.191*screen + 0.055*button + 0.034*change + 0.032*page + 0.032*lock']

These are the top 5 words associated with 5 random topics. The decimal number is the weight of the word it is multiplying, i.e., how much does this word influence the particular topic. The model knows how to do this because we gave it the id2word dictionary. Without it, we wouldn’t be able to read this output (still).

Now, let’s do something actually useful: query the model.

Let’s say we would like to know which topics a certain string is most associated with.

in [9]:

query = 'google maps broken navigation'
query = query.split()
query

out [9]:

['google', 'maps', 'broken', 'navigation']

We query the model by indexing it with our query! But first, we need to transform it into a representation the model understands. We can’t just do this (yet):

model[query]

That will definitely cause us some heartache, because the query is just words. LDA technically knows nothing about the actual words, just the ids we’ve given them.

So, let’s build something to translate those words back to ids and their frequencies. Gensim has an awesome built in way of doing this called a Dictionary. Sure, we could use regular old Python dicts to map id->word and build the (word, frequency) pairs ourselves, but I’m a fancy person that enjoys fancy things.

Here’s what we do:

in [10]:

id2word = gensim.corpora.Dictionary()
_ = id2word.merge_with(corpus.id2word)

This creates an empty special Dictionary, and then we merge our original corpus dictionary into it. Whatever merge_with returns isn’t important to us, so throw it in the Python garbage bin, underscore.

This doesn’t seem to gain us much, until we want to translate an entire document into (word, frequency) pairs:

in [11]:

query = id2word.doc2bow(query)
query

out [11]:

[(1754, 1), (6081, 1), (8441, 1), (9208, 1)]

in [12]:

model[query]

out [12]:

[(3, 0.20387470260323143),
 (9, 0.35862973787398261),
 (15, 0.010585652382570768),
 (16, 0.010899567346349904),
 (18, 0.011132829837161632),
 (21, 0.22968681811101002),
 (22, 0.010344492016793241),
 (23, 0.010589823218917306),
 (24, 0.010154742173706556)]

Note: your results absolutely should differ from mine _slightly_, given the probablistic nature of the model

Awwwwww yeahhhhhhhhhhh. Now we’re cookin’ with gas.

From this list, we have each topic and the likelihood that the query relates to that topic. So, if we sort this a little more meaningfully:

in [13]:

a = list(sorted(model[query], key=lambda x: x[1]))
print(a[0])
print(a[-1])

out [13]:

(24, 0.010154742173743013)
(9, 0.35859622416422271)

We can see that the least and the most related topic to our document. Let’s check out what words are most associated with those two topics.

in [14]:

model.print_topic(a[0][0]) #least related

out [14]:

'0.063*apps + 0.062*wifi + 0.058*calendar + 0.044*exchange + 0.035*changing + 0.030*location + 0.027*latitude + 0.024*automatically + 0.021*event + 0.020*disappears'

in [15]:

model.print_topic(a[-1][0]) #most related

out [15]:

'0.155*maps + 0.086*issue + 0.054*google + 0.034*android + 0.031*unlock + 0.024*books + 0.020*coming + 0.016*failed + 0.015*note + 0.013*word'

So, the first one looks like garbage for our query, but the second seems to be mostly about the Google specific applications, including maps! Not the best results, so this model’s number of topics probably needs to be a bit higher, or alpha values played with until results pan out.

Note how our initial query only returned nine or so related topics. Didn’t we ask for 25 of them? Well, we did, but Gensim defaults to only showing the top ones that meet a certain threshold (>= 0.01). Digging deeper than that is ugly, so for now we will just deal with these results.

I am getting pretty tired of looking at this, so I think this will conclude the tutorial on using Gensim’s LDA stuff for now. Go ahead and try out this code for yourself.

This notebook on “Using Gensim for LDA” is available for download here.

Blogging with IPython and Jekyll

2014-02-21T00:00:00+00:00

Lately I’ve been using IPython to do most of my tinkering work. It’s pretty neat, to say the least.

I’ve seen around the Internet people using IPython as a way to blog. I thought that this would be a pretty neat way to go about things, and probably save a large amount of time on editing code-centric blog posts. However, the methods I found were either outdated, outputted HTML (usually with gross CSS conflicts), were hacks for other blogging software, or required a plugin.

Since I use Github Pages (read: Jekyll) to auto-render my blog, I decided to code up my own method. My method outputs files that are in Markdown with a Jekyll front matter pre- filled. This way, I can still add blog posts in the same format as before and edit if needed. No plugins are required this way, too. Sure, it is manual conversion of notebooks, but that’s pretty much the only way to get around the plugin issue and still be able to use Github Pages.

Here are the files you will need to publish a notebook to Jekyll: https://gist. github.com/cscorley/9144544

jekyll.py: This is the config file used for conversion. It should be placed wherever the profile you are using is. Default is ~/.ipython/profile_default/
jekyll.tpl: I plop all my template files into ~/.ipython/templates, but put jekyll.tpl wherever suits you best (just be sure to change the jekyll.py to point to that location, also)

Everything will output into a folder named notebooks. You can change this by replacing in the config all instances of ‘notebooks’ with whatever you want.

There is one variable in the config named BLOG_DIR that is used to automatically generate the markdown and any support files the notebook needs into this directory. Right now it reads from the environment variable of the same name. You will need to export a $BLOG_DIR environment variable to be able to use this script as-is. This is important because the configuration file jekyll.py will use this variable unless the configuration is changed. If you just want them to plop into the current directory, change it in the config to an empty string.

Finally, you can now run your conversion on a single file with the command: ipython nbconvert --config jekyll.py <FILENAME>.

I did this whole $BLOG_DIR and notebooks mess because Jekyll was pooping out whenever a markdown file appeared in the notebooks folder I was using. I also wanted the notebooks folder so nbconvert would know where to place any support files, and that Jekyll would blindly copy these into the generated site. Plus, a place to put the notebook files themselves so they can be downloaded directly! Nice, yeah?

Here’s a shell function I wrote to convert a notebook file and then move any markdown files created into the _drafts folder.

export BLOG_DIR="/Users/cscorley/git/cscorley.github.io"
nbconvert(){
    ipython nbconvert --config jekyll.py $@;
    find ${BLOG_DIR}/notebooks/ -name '*.md' -exec mv {} ${BLOG_DIR}/_drafts/ \;
    cp $@ ${BLOG_DIR}/notebooks/
}

That’s all. I just do nbconvert FILE now and it just works. Jekyll doesn’t kill itself over it. When I’m done checking that the post is ready to go live, I move it into the _posts folder. No big deal, right?

Below is some example code!

In [1]:

class Pizza:
    def __init__(self, toppings):
        self.toppings = toppings
        
    def is_yummy(self):
        return True

p = Pizza(['pineapple', 'cheese'])
print(p.is_yummy())

True

In [2]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

Some code copied from Wikipedia:

In [3]:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> a = np.linspace(0,10,100)
>>> b = np.exp(-a)
>>> plt.plot(a,b)
>>> plt.show()

Object Environment

2014-02-19T00:00:00+00:00

Students often have trouble grasping the difference between objects, classes, and the variables which hold them. This article aims to explain object oriented programming by example in Python.

Review

First, let us review a few things.

Variables

To create a variable in Python, we simply need to assign it a value:

a = 10
b = "Tacos"

Let’s consider mapping these variables out as we go into something I’m going to call an environment. Environments are simply tables that map the known variables to their values. For example, the code above would have the following environment:

        Variable    | Type     | Value
        ------------------------------
        a           | int      | 10
        b           | str      | "Tacos"

That is, a is a variable that holds the integer 10. We can add new variables to the environment at will.

not_my_gpa = 4.0

        Variable    | Type     | Value
        ------------------------------
        a           | int      | 10
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0

That isn’t very interesting. Neither would be changing a variable.

a = [1, 2, 3]

        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0

If we wanted to use a variable, then Python would have to look up its value in the environment table.

print(b) # finds variable b and gives it to the 'print' function

Sometimes while debugging through a program, it is handy to keep an environment table updated for each step of execution in the program. This is known as tracing a program.

Functions

Functions are little snippets of code that complete tasks for us. Say we wanted to write a function that calculates the square of a number. It might look like this:

def square(val):
    return val * val

Now, some cool cool stuff happens here when we create square. First, it is added to the environment table. Yep, square is pretty much just a variable name.

        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |

I’ve left the value empty because functions are special. Something is there and it’s the body of the function.

Let’s call square and see what happens to our environment table.

c = square(10)

There are several steps that happen here. First, we can see that the value is going to be stored in to a variable c, but we don’t actually know what value yet. So, Python will evaluate the function call for us. Whenever Python sees a variable name followed by some parentheses, possibly with arguments such as 10, it knows it’s got to do some stuff for us.

Python will first retrieve the value at the variable square in our environment. Then, it will execute the code associated it (the value) given the arguments. Something special happens then with those arguments. When the function is evaluated, the arguments are set up in yet another environment table, specifically for this single call to square.

        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |

Function call-> square(10):
                Variable    | Type     | Value
                ------------------------------
                val         | int      | 10

When square finishes up, it will return the value 100, which we can then assign to a new variable c.

        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
        c           | int      | 100

Note that the square(10) environment is destroyed because it is no longer needed! If we called square again, a new environment will be created specifically for it and whatever argument we give it.

Let’s look at another example:

def power_of_c(val):
    z = 1
    for i in range(val):
        z = z * c
    return z

Oh geez, this function is drunk. It uses something that is given as an argument, creates its own variables, and even uses some outside of it. How is that possible? It is possible through something known as scoping. If we call power_of_c, an environment is created specifically for it, just like when square was called.

d = power_of_c(3)

        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
        power_of_c  | function |
        c           | int      | 100

Function call-> power_of_c(3):
                Variable    | Type     | Value
                ------------------------------
                val         | int      | 3

Now the function begins to execute. The first thing that happens is that it creates a new variable, z, and gives it the value 1.

        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
        power_of_c  | function |
        c           | int      | 100

Function call-> power_of_c(3):
                Variable    | Type     | Value
                ------------------------------
                val         | int      | 3
                z           | int      | 1

Note that z is created within the power_of_c(3) environment. Next, we begin our loop and start updating z with z * c. First loop through z will become 100, since c is 100 and 1 * 100 == 100.

Function call-> power_of_c(3):
                Variable    | Type     | Value
                ------------------------------
                val         | int      | 3
                z           | int      | 100

A second time,

Function call-> power_of_c(3):
                Variable    | Type     | Value
                ------------------------------
                val         | int      | 3
                z           | int      | 10000

And I think we can see how this ends: with z holding integer 1000000. Finally, power_of_c(3) returns the value held within z, the environment is destroyed, and our new variable is created.

        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
        power_of_c  | function |
        c           | int      | 100
        d           | int      | 1000000

But how did power_of_c know where to find c if it wasn’t in its environment? It knows because the environments are nested in a sense. That is, if a variable does not exist within the inner most environment, Python will try to look it up in the next environment up, or the environment that was in scope when our new environment was created, which in our case, is our main environment we started with. Let’s go ahead and give that environment a name, how about global? Sounds good to me.

global:
        Variable    | Type     | Value
        ------------------------------
        a           | list     | [1, 2, 3]
        b           | str      | "Tacos"
        not_my_gpa  | float    | 4.0
        square      | function |
        power_of_c  | function |
        c           | int      | 100
        d           | int      | 1000000

This environment table is special to our program, it’s basically where everything is going to be defined.

Classes and Objects

Alright, now that we’re good with how environments work, let’s finally create some classes. Let’s start with a fresh, empty environment.

a = 1
b = "Tacos"

class Fraction:
    def __init__(self, n, d):
        self.numerator = n
        self.denominator = d

This class will represent a fraction. A fraction has two parts: a numerator and a denominator. Now our global environment looks something like this:

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |

Again, I’ve left the value of the Fraction variable empty. Why? Because it’s going to operate just like a function did in a sense. Let’s make some stuff and see what happens!

To use a class, we call it just like we would a function:

half = Fraction(1, 2)

Python knows what’s up when we do this, and handles “calling” the class specially. First, we create a new Fraction with values 1 and 2. What happens is that Python realizes we are trying to do a call on a class, hands off everything to the constructor, known in Python as __init__, and calls it instead.

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |

Create object--> Fraction(1,2):
                Variable    | Type     | Value
                ------------------------------
                self        | object   | *
                n           | int      | 1
                d           | int      | 2

Or, more specifically:

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |

Method call---> Fraction.__init__(*, 1,2):
                Variable    | Type     | Value
                ------------------------------
                self        | object   | *
                n           | int      | 1
                d           | int      | 2

So, if you were like me back when I was first learning this stuff, you are asking yourself, “what the hell is self and why does __init__ get called with three parameters when I only gave Fraction two arguments?” It’s because the self parameter is going to be the object we just created. Python is giving us a chance to initialize some values for this new object before it returns it and assigns it to the variable half. (Real answer: mostly because Python is stupid.)

What the hell is an object?!

Aye. Now we’re at the meat of the subject. An object is simply a thing. Alright, cya next time!

Just kidding.

A handy thing to do is to think of objects as their own environments. So, when __init__ is called, it is given 1 and 2, and some object we’ve named self. This self variable is just a reference to a new environment table!

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |

Method call---> Fraction.__init__(*, 1,2):
            Variable    | Type     | Value
            ------------------------------
            self        | object   | *------\
            n           | int      | 1      |
            d           | int      | 2      |
                                            |
                                            |
    /---------------------------------------/
    |
    V
<Fraction> object #1:
        Variable    | Type     | Value
        ------------------------------

Right now it’s empty, but that’s because __init__ has just started to execute. What does it do?

class Fraction:
    def __init__(self, n, d):
        self.numerator = n
        self.denominator = d

Hm. It uses some sort of dot notation to assign the arguments to variables. Where are these variables created? Within self! Think of that dot as “we must go deeper in the environments.”

First it creates a new variable within self named numerator, and assigns it the value of n. Then the same for the denominator and d.

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |

Method call---> Fraction.__init__(*, 1,2):
            Variable    | Type     | Value
            ------------------------------
            self        | object   | *------\
            n           | int      | 1      |
            d           | int      | 2      |
                                            |
                                            |
    /---------------------------------------/
    |
    V
<Fraction> object #1:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 1
        denominator | int      | 2

Welp, that about wraps that up. __init__ finishes, implicitly returns self, and destroys its environment. We are now left with something that looks like this:

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
                                            |
                                            |
    /---------------------------------------/
    |
    V
<Fraction> object #1:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 1
        denominator | int      | 2

Note how the value of half points to that environment representing the new object. These are known as pointers in other languages, such as C. (Yep, we’re real creative with names in computer science.) Also, its type is a Fraction.

So, let’s do something with our new fraction. What is its value represented as a float (decimal)?

d = half.numerator / half.denominator

Again, notice the dot notation and how it allows us to access the environment within half.

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        d           | float    | 0.5        |
                                            |
                                            |
    /---------------------------------------/
    |
    V
<Fraction> object #1:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 1
        denominator | int      | 2

Let’s create a few more fractions and have some fun.

third = Fraction(1, 3)
almost_pi = Fraction(22, 7)

Now our set of environments looks like this (I’ve left out the calls to __init__):

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        d           | float    | 0.5        |
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
                                            |   |   |
                                            |   |   |
    /---------------------------------------/   |   |
    |                                           |   |
    V                                           |   |
<Fraction> object #1:                           |   |
        Variable    | Type     | Value          |   |
        ------------------------------          |   |
        numerator   | int      | 1              |   |
        denominator | int      | 2              |   |
                                                |   |
    /-------------------------------------------/   |
    |                                               |
    V                                               |
<Fraction> object #2:                               |
        Variable    | Type     | Value              |
        ------------------------------              |
        numerator   | int      | 1                  |
        denominator | int      | 3                  |
                                                    |
    /-----------------------------------------------/
    |
    V
<Fraction> object #3:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 22
        denominator | int      | 7

Converting our fraction to a float might be useful enough to put in its own fuction.

def to_float(f):
    return f.numerator / f.denominator

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        d           | float    | 0.5        |
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
        to_float    | function |            |   |   |
                                            |   |   |
                                           ... ... ...

To use to_float, we give it an entire Fraction object. Yup. The whole thing.

many_three = to_float(third)

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        d           | float    | 0.5        |
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
        to_float    | function |            |   |   |
                                            |   |   |
                                           ...  |  ...
                                                |
                                                |
    /-------------------------------------------+---\
    |                                               |
    V                                               |
<Fraction> object #2:                               |
        Variable    | Type     | Value              |
        ------------------------------              |
        numerator   | int      | 1                  |
        denominator | int      | 3                  |
                                                    |
Function call-> to_float(third):                    |
                Variable    | Type     | Value      |
                ------------------------------      |
                f           | Fraction | *----------/

Notice when to_float(third)’s environment is created, its parameter f points to the same fraction as the argument third. When to_float begins execution, it will use the dot notation to access values within f, or as it is here, third.

We can apply to_float a few times to different Fractions and the same thing will happen every time.

zero_five = to_float(half)
pi_ish    = to_float(almost_pi)

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        d           | float    | 0.5        |
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
        to_float    | function |            |   |   |
        many_three  | float    | 0.333...   |   |   |
        zero_five   | float    | 0.5        |   |   |
        pi_ish      | float    | 3.14...   |   |   |
                                            |   |   |
                                           ... ... ...

Neat-o.

Methods

Alright. Time to introduce something new. Method, as defined in the Oxford English Dictionary is:

method, n.

A procedure for attaining an object.

A recommended or prescribed medical treatment for a specific disease.
More generally: a way of doing anything, esp. according to a defined and regular plan; a mode of procedure in any activity, business, etc.

Actually, this is close enough I can stop here, because if you have learned anything in computer science yet, you know that we name things in a sort-of-but-not-really fashion. Here’s our definition of method:

method, n.

A procedure related to an object.

See definition for function.

What I’m trying to get at is that there is no practical difference between functions and methods other than methods are defined within a class and become part of the environment for objects created from that class.

Let’s suppose our Fraction class had the to_float function built right in. Starting with a fresh global environment:

a = 1
b = "Tacos"

class Fraction:
    def __init__(self, n, d):
        self.numerator = n
        self.denominator = d

    def to_float(self):
        return self.numerator / self.denominator

half = Fraction(1, 2)
third = Fraction(1, 3)
almost_pi = Fraction(22, 7)

Now our all our environments are structured like this:

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
                                            |   |   |
                                            |   |   |
    /---------------------------------------/   |   |
    |                                           |   |
    V                                           |   |
<Fraction> object #1:                           |   |
        Variable    | Type     | Value          |   |
        ------------------------------          |   |
        numerator   | int      | 1              |   |
        denominator | int      | 2              |   |
        to_float    | function |                |   |
                                                |   |
    /-------------------------------------------/   |
    |                                               |
    V                                               |
<Fraction> object #2:                               |
        Variable    | Type     | Value              |
        ------------------------------              |
        numerator   | int      | 1                  |
        denominator | int      | 3                  |
        to_float    | function |                    |
                                                    |
    /-----------------------------------------------/
    |
    V
<Fraction> object #3:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 22
        denominator | int      | 7
        to_float    | function |

P rad, yeah? Now each Fraction object has its own to_float, much like how it has its own numerator and denominator. So, how can we use it?

zero_five = half.to_float()
many_three = third.to_float()
pi_ish = almost_pi.to_float()

Yep, we use the same dot notation as before, only this time we attach a () to the end so Python knows we’re calling a ~~function~~ method.

A call to third.to_float() creates environments just like before, only now self is the pointer to third:

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | 1
        b           | str      | "Tacos"
        Fraction    | class    |
        half        | Fraction | *----------\
        third       | Fraction | *----------)---\
        almost_pi   | Fraction | *----------)---)---\
                                            |   |   |
                                           ...  |  ...
                                                |
    /---------------------------------------+---/
    |                                       |
    V                                       |
<Fraction> object #2:                       |
        Variable    | Type     | Value      |
        ------------------------------      |
        numerator   | int      | 1          |
        denominator | int      | 3          |
        to_float    | function |            |
                                            |
Method call-> third.to_float():             |
            Variable    | Type     | Value  |
            ------------------------------  |
            self        | Fraction | *------/

*busts an air guitar solo*

Most things are object-like

In Python, you can treat just about everything like an object, even strings.

b = "Tacos"
print(b)       # prints "Tacos" to screen
c = b.upper()
d = b.swapcase()
print(c)       # prints "TACOS" to screen
print(d)       # prints "tACOS" to screen

Neat, yeah? So that means… dun dun dunnnnnnnnnnnnnn:

global:
        Variable    | Type     | Value
        ------------------------------
        a           | int      | *------------------------------\
        b           | str      | *------------------------------)---\
        Fraction    | class    |                                |   |
        half        | Fraction | *----------\                   |   |
        third       | Fraction | *----------)---\               |   |
        almost_pi   | Fraction | *----------)---)---\           |   |
        c           | str      | *----------)---)---)---\       |   |
        d           | str      | *----------)---)---)---)---\   |   |
                                            |   |   |   |   |   |   |
                                           ... ... ... ... ... ...  |
                                                                    |
    /---------------------------------------------------------------/
    |
    V
<str> object #1: "Tacos"
        Variable    | Type     | Value
        ------------------------------
        upper       | function |
        swapcase    | function |
        ...         | ...      |

Yeah, I left a lot out. I am getting lazy and all this taco-talk is making me hungry, but I think you get the idea: the environment actually just holds pointers to all the objects for variables.

Classes holding objects that are classes holding objects that are…

Alright, let’s get real crazy here before I go eat. In addition to our Fraction class, we’ll add ourselves a MixedFraction. MixedFractions are whole numbers (ints) and Fraction objects combined together like peanut butter and jelly. It’s beautiful.

And while we’re at it, let’s go on and create a to_float method that will convert the mixed fraction into a floating point number.

Here goes:

class MixedFraction:
    def __init__(self, whole_num, fraction_obj):
        self.whole_num = whole_num
        self.fraction_obj = fraction_obj

    def to_float(self):
        val = float(self.whole_num)
                # float() is a built-in function that
                # can convert integers to floats.

        val += self.fraction_obj.to_float()
                # ask the fraction for its floating point value!

        return ret

That’s pretty straight forward, yeah? This is known as an aggregation relationship, as MixedFraction is composed of a Fraction, but isn’t responsible for it (i.e., it was created outside of the class.)

Let’s make some MixedFractions and look at the environment.

half = Fraction(1, 2)
one_and_a_half = MixedFraction(1, half)

Now, our environment holds this:

global:
        Variable        | Type     | Value
        ----------------------------------
        Fraction        | class    |
        MixedFraction   | class    |
        half            | Fraction | *----------\
        one_and_a_half  | MixedF...| *----------)---\
                                                |   |
                                                |   |
    /---------------------------------------+---/   |
    |                                       |       |
    V                                       |       |
<Fraction> object #1:                       |       |
        Variable    | Type     | Value      |       |
        ------------------------------      |       |
        numerator   | int      | 1          |       |
        denominator | int      | 2          |       |
        to_float    | function |            |       |
                                            |       |
    /---------------------------------------)-------/
    |                                       |
    V                                       |
<MixedFraction> object #1:                  |
        Variable    | Type     | Value      |
        ------------------------------      |
        whole_num   | int      | 1          |
        fraction_obj| Fraction | *----------/
        to_float    | function |

If we were to, for example, call to_float on one_and_a_half, what would happen?

z = one_and_a_half.to_float()

I’ll work this one step by step. I just ordered Jimmy John’s for delivery so we got time.

First, we ask one_and_a_half to execute the to_float method. A new temporary environment is created for it to work in, but isn’t very interesting since MixedFractions.to_float needs no parameters:

global:
        Variable        | Type     | Value
        ----------------------------------
        Fraction        | class    |
        MixedFraction   | class    |
        half            | Fraction | *----------\
        one_and_a_half  | MixedF...| *----------)---\
                                                |   |
                                                |   |
    /---------------------------------------+---/   |
    |                                       |       |
    V                                       |       |
<Fraction> object #1:                       |       |
        Variable    | Type     | Value      |       |
        ------------------------------      |       |
        numerator   | int      | 1          |       |
        denominator | int      | 2          |       |
        to_float    | function |            |       |
                                            |       |
    /---------------------------------------)---+---/
    |                                       |   |
    V                                       |   |
<MixedFraction> object #1:                  |   |
        Variable    | Type     | Value      |   |
        ------------------------------      |   |
        whole_num   | int      | 1          |   |
        fraction_obj| Fraction | *----------/   |
        to_float    | function |                |
                                                |
                                                |
Method call-> one_and_a_half.to_float():        |
            Variable    | Type     | Value      |
            ------------------------------      |
            self        | MixedF...| *----------/

This should look familiar, because it is the same thing as when we did third.to_float() before. However, the MixedFractions version of to_float is a whole lot different when it executes.

Here’s MixedFraction’s to_float for reference:

def to_float(self):
    val = float(self.whole_num)
            # float() is a built-in function that
            # can convert integers to floats.

    val += self.fraction_obj.to_float()
            # ask the fraction for its floating point value!

    return ret

First, on line 2, it gets the floating point of the whole number part and stores it to a variable cleverly named val.

                                           ...     ...
                                            |       |
    /---------------------------------------)---+---/
    |                                       |   |
    V                                       |   |
<MixedFraction> object #1:                  |   |
        Variable    | Type     | Value      |   |
        ------------------------------      |   |
        whole_num   | int      | 1          |   |
        fraction_obj| Fraction | *----------/   |
        to_float    | function |                |
                                                |
                                                |
Method call-> one_and_a_half.to_float():        |
            Variable    | Type     | Value      |
            ------------------------------      |
            self        | MixedF...| *----------/
            val         | float    | 1.0

Then, on line 6, it does something we haven’t seen before: double dots! But by now, you should be able to smell what The Rock cookin’.

The first dot resolves self to the MixedFraction object.
The second dot resolves fraction_obj to the Fraction object.
Then, we ask that Fraction to execute its to_float method.

By the time we’ve done all of that, we’ve got this mess:

                                               ... ...
                                                |   |
    /---------------------------------------+---/   |
    |                                       |       |
    V                                       |       |
<Fraction> object #1:                       |       |
        Variable    | Type     | Value      |       |
        ------------------------------      +-------)---\
        numerator   | int      | 1          |       |   |
        denominator | int      | 2          |       |   |
        to_float    | function |            |       |   |
                                            |       |   |
                                            |       |   |
    /---------------------------------------)---+---/   |
    |                                       |   |       |
    V                                       |   |       |
<MixedFraction> object #1:                  |   |       |
        Variable    | Type     | Value      |   |       |
        ------------------------------      |   |       |
        whole_num   | int      | 1          |   |       |
        fraction_obj| Fraction | *----------/   |       |
        to_float    | function |                |       |
                                                |       |
                                                |       |
                                                |       |
Method call-> one_and_a_half.to_float():        |       |
            Variable    | Type     | Value      |       |
            ------------------------------      |       |
            self        | MixedF...| *----------/       |
            val         | float    | 1.0                |
                                                        |
Method call-------> self.fraction_obj.to_float()        |
                    Variable    | Type     | Value      |
                    ------------------------------      |
                    self        | Fraction | *----------/

UGH.

We are talking about line 6 still. Note that the environment for this call has its own self within. That self is the Fraction. Thankfully this method doesn’t do a whole whole lot and returns the Fraction represented as a floating point value pretty much immediately. So, that temporary environment is destroyed and we are left with this:

                                           ...     ...
                                            |       |
    /---------------------------------------)---+---/
    |                                       |   |
    V                                       |   |
<MixedFraction> object #1:                  |   |
        Variable    | Type     | Value      |   |
        ------------------------------      |   |
        whole_num   | int      | 1          |   |
        fraction_obj| Fraction | *----------/   |
        to_float    | function |                |
                                                |
                                                |
Method call-> one_and_a_half.to_float():        |
            Variable    | Type     | Value      |
            ------------------------------      |
            self        | MixedF...| *----------/
            val         | float    | 1.5

Finally, we have our MixedFraction as a float, and this method call environment returns val and is destroyed. Now we can update our global environment with z:

global:
        Variable        | Type     | Value
        ----------------------------------
        Fraction        | class    |
        MixedFraction   | class    |
        half            | Fraction | *----------\
        one_and_a_half  | MixedF...| *----------)---\
        z               | float    | 1.5        |   |
                                                |   |
    /---------------------------------------+---/   |
    |                                       |       |
    V                                       |       |
<Fraction> object #1:                       |       |
        Variable    | Type     | Value      |       |
        ------------------------------      |       |
        numerator   | int      | 1          |       |
        denominator | int      | 2          |       |
        to_float    | function |            |       |
                                            |       |
    /---------------------------------------)-------/
    |                                       |
    V                                       |
<MixedFraction> object #1:                  |
        Variable    | Type     | Value      |
        ------------------------------      |
        whole_num   | int      | 1          |
        fraction_obj| Fraction | *----------/
        to_float    | function |

Awesome.

Inheritance

What if we were drunk and decided to make MixedFraction inherit from Fraction? That seems like a totally reasonable thing to do, right? After all, isn’t a mixed fraction just a special representation of a fraction?

class MixedFraction(Fraction):
    def __init__(self, whole_num, numerator, denominator):
        new_num = numerator + (whole_num * denominator)
        super().__init__(new_num, denominator)

And look at that, we are pretty much done! MixedFraction will inherit the Fraction version of to_float, and because of how we wrote our constructors everything will just work. So what about this super() business?

Let’s start with a clean environment and make ourselves a MixedFraction.

one_and_a_half = MixedFraction(1, 1, 2)
taco = one_and_a_half.to_float()

global:
        Variable        | Type     | Value
        ----------------------------------
        Fraction        | class    |
        MixedFraction   | class    |

Method call-> MixedFraction.__init__(*, 1, 1, 2):
            Variable    | Type     | Value
            ------------------------------
            self        | MixedF...| *------\
            whole_num   | int      | 1      |
            numerator   | int      | 1      |
            denominator | int      | 2      |
                                            |
    /---------------------------------------/
    |
    V
<MixedFraction> object #1:
        Variable    | Type     | Value
        ------------------------------

When its constructor begins executing, we calculate a new_num value that represents the whole number added back into the fraction’s numerator.

Method call-> MixedFraction.__init__(*, 1, 1, 2):
            Variable    | Type     | Value
            ------------------------------
            self        | MixedF...| *------\
            whole_num   | int      | 1      |
            numerator   | int      | 1      |
            denominator | int      | 2      |
            new_num     | int      | 3      |
                                            |
    /---------------------------------------/
    |
    V
<MixedFraction> object #1:
        Variable    | Type     | Value
        ------------------------------

Alright, now things get cray cray. We make a call to super(), and then use the dot notation on that? What the…?

Since it is just a function call, what does super() return? Well, that’s for another discussion, but it returns something we can just call the “super object”. The super object is an object that we can ask, just as before, execute methods for us using methods from the superclass of the object we are in. It allows us to call methods that exist in both the class and the class inherited from.

In this instance, super() can basically operate as an alias for Fraction, and as a way to tell Python how to use methods we have two of, such as the constructor.

So, we make the call to the constructor of Fraction:

Method call-> MixedFraction.__init__(*, 1, 1, 2):
            Variable    | Type     | Value
            ------------------------------
            self        | MixedF...| *----------\
            whole_num   | int      | 1          |
            numerator   | int      | 1          |
            denominator | int      | 2          |
            new_num     | int      | 3          |
                                                |
Method call----> Fraction.__init__(self, 3, 2)  |
                Variable    | Type     | Value  |
                ------------------------------  |
                self        | MixedF...| *------+
                numerator   | int      | 3      |
                denominator | int      | 2      |
                                                |
    /-------------------------------------------/
    |
    V
<MixedFraction> object #1:
        Variable    | Type     | Value
        ------------------------------

Now we begin execution of the constructor of Fraction. Notice now how the self within its environment is the MixedFraction! Baller! It completes and is destroyed, leaving us this:

Method call-> MixedFraction.__init__(*, 1, 1, 2):
            Variable    | Type     | Value
            ------------------------------
            self        | MixedF...| *----------\
            whole_num   | int      | 1          |
            numerator   | int      | 1          |
            denominator | int      | 2          |
            new_num     | int      | 3          |
                                                |
    /-------------------------------------------/
    |
    V
<MixedFraction> object #1:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 3
        denominator | int      | 2
        to_float    | function |

Anywhozzles, once the constructor of MixedFraction completes, we are left with an environment that looks like this:

global:
        Variable        | Type     | Value
        ----------------------------------
        Fraction        | class    |
        MixedFraction   | class    |
        one_and_a_half  | MixedF...| *------\
                                            |
    /---------------------------------------/
    |
    V
<MixedFraction> object #1:
        Variable    | Type     | Value
        ------------------------------
        numerator   | int      | 3
        denominator | int      | 2
        to_float    | function |

Cool, right? Okay, my sandwich is here. Time to go. Until next time…