Linh Portfolio

Python, Spark, Recommendation System

Agust, 2018

Spotify Playlist Recommendation

The goal of the challenge is to develop a system for the task of automatic playlist continuation. Given a set of playlist features, participants’ systems shall generate a list of recommended tracks that can be added to that playlist, thereby ‘continuing’ the playlist.

Code Presentation

Python, Tensorflow, LSTM, NLP, Random Forest

10 Oct, 2018

Agile User Story Point Estimation

Effort estimation is an important part of software project management. Cost and schedule overruns create more risk for software projects. Effort estimations can benefit project manager and clients to make sure that projects can complete in time.In modern Agile development, software is developed through repeated cycles (iterations). A project has a number of iterations. An iteration is usually a short (usually 2–4 weeks) period in which the development team designs, implements, tests and delivers a distinct product increment, e.g. a working milestone version or a working release. Each iteration requires the completion of a number of user stories, which are a common way for agile teams to express user requirements.

Code Presentation

Python, Sklearn

05 May, 2018

FIFA World Cup 2018 Winner Prediction

Objective:

Prediction of the winner of an international matches Prediction results are "Win / Lose / Draw" or "goal difference"

Apply the model to predict the result of FIFA world cup 2018.

Data are assembled from multiple sources, most of them are from Kaggle, others come from FIFA website / EA games.

Code Report

PySpark, Scala, WordEmbedding, NLP

14 May, 2018

Toxic Comment Classification

The aim of the project is to categorize the toxic comments based on the types of toxicity. Examples of toxicity types can be toxic, severely toxic, obscene, threat, insult, identity hate. Different machine learning techniques like Logistic Regression, Support Vector Machines and Naive Bayes are implemented to determine the 6 types of toxic comments. Data: The dataset we are using for toxic comment classification is taken from Kaggle competition which can be found at Kaggle. Dataset has a large number of comments from Wikipedia talk page edits. They have been labeled by human raters for toxic behavior.

Code

Papers, Word2Vec, NLP, FSE2018

8 Nov, 2018

Complementing Global and Local Contexts in Representing API Descriptions to Improve API Retrieval Tasks

We present D2Vec, a neural network model that considers two complementary contexts to better capture the semantics of API documentation. First, we connect the global context of the current API topic under description to all the text phrases within the description of that API. Second, the local orders of words and API elements in the text phrases are maintained in computing the vector representations for the APIs. We conducted an experiment to verify two intrinsic properties of D2Vec's vectors: 1) similar words and relevant API elements are projected into nearby locations; and 2) some vector operations carry semantics. We demonstrate the usefulness and good performance of D2Vec in three applications: API code search (text-to-code retrieval), API tutorial fragment search (code-to-text retrieval), and mining API mappings between software libraries (code-to-code retrieval). Finally, we provide actionable insights and implications for researchers in using our model in other applications with other types of documents.

Published Paper

Papers, Word2Vec, NLP, ICSE2018

31 May, 2018

Statistical Learning of API Fully Qualified Names in Code Snippets of Online Forums

Software developers often make use of the online forums such as StackOverflow (SO) to learn how to use software libraries and their APIs. However, the code snippets in such a forum often contain undeclared, ambiguous, or largely unqualified external references. Such declaration ambiguity and external reference ambiguity present challenges for developers in learning to correctly use the APIs. In this paper, we propose ST a statistical approach to resolve the fully qualified names (FQNs) for the API elements in such code snippets. Unlike existing approaches that are based on heuristics, ST has two well-integrated factors. We first learn from a large training code corpus the FQNs that often co-occur. Then, to derive the FQN for an API name in a code snippet, we use that knowledge and also leverage the context consisting of neighboring API names. To realize those factors, we treat the problem as statistical machine translation from source code with partially qualified names to source code with FQNs of the APIs. Our empirical evaluation on real-world code and StackOverflow posts shows that ST achieves very high accuracy with 97.6% precision and 96.7% recall, which is 16.5% relatively higher than the state-of-the-art approach.

Published Paper

Hello

I am Linh

Data Engineer / Amateur Photographer

let’s
Introduce about
myself

Skills

Social Media

Projects & Research

Spotify Playlist Recommendation

Agile User Story Point Estimation

FIFA World Cup 2018 Winner Prediction

Toxic Comment Classification

Complementing Global and Local Contexts in Representing API Descriptions to Improve API Retrieval Tasks

Statistical Learning of API Fully Qualified Names in Code Snippets of Online Forums

Hello

I am Linh

Data Engineer / Amateur Photographer

let’s Introduce about myself

Skills

Social Media

Projects & Research

Spotify Playlist Recommendation

Agile User Story Point Estimation

FIFA World Cup 2018 Winner Prediction

Toxic Comment Classification

Complementing Global and Local Contexts in Representing API Descriptions to Improve API Retrieval Tasks

Statistical Learning of API Fully Qualified Names in Code Snippets of Online Forums

let’s
Introduce about
myself