A tyro's blog

Friday, 9 December 2016

Clustering With K-Means in Python

A very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar to them than they are to the others. The practical applications of such a procedure are many: given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some sort of structure that might indicate that the data is separable.

Mathematical background

The k-means algorithm takes a data set X of N points as input, together with a parameter K specifying how many clusters to create. The output is a set of K cluster centroids and a labeling of X that assigns each of the points in X to a unique cluster. All points within a cluster are closer in distance to their centroid than they are to any other centroid.

The mathematical condition for the K clusters $C_k$ and the K centroids $\mu_k$ can be expressed as:

Minimize $\displaystyle \sum_{k=1}^K \sum_{\mathrm{x}_n \in C_k} ||\mathrm{x}_n - \mu_k ||^2$ with respect to $\displaystyle C_k, \mu_k$ .

Lloyd’s algorithm

Finding the solution is, unfortunately, NP hard. Nevertheless, an iterative method known as Lloyd’s algorithm exists that converges (albeit to a local minimum) in few steps. The procedure alternates between two operations. (1) Once a set of centroids $\mu_k$ is available, the clusters are updated to contain the points closest in distance to each centroid. (2) Given a set of clusters, the centroids are recalculated as the means of all points belonging to a cluster.

$\displaystyle C_k = \{\mathrm{x}_n : ||\mathrm{x}_n - \mu_k|| \leq \mathrm{\,\,all\,\,} ||\mathrm{x}_n - \mu_l||\}\qquad(1)$

$\displaystyle \mu_k = \frac{1}{C_k}\sum_{\mathrm{x}_n \in C_k}\mathrm{x}_n\qquad(2)$

The two-step procedure continues until the assignments of clusters and centroids no longer change. As already mentioned, the convergence is guaranteed but the solution might be a local minimum. In practice, the algorithm is run multiple times and averaged. For the starting set of centroids, several methods can be employed, for instance, random assignation.

Below is a simple implementation of Lloyd’s algorithm for performing k-means clustering in python:

Linear Regression in Python using Spyder

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable (independent variable) is called simple linear regression.

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.

Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other (for example, higher SAT scores do not cause higher college grades), but that there is some significant association between the two variables. A scatterplot can be a helpful tool in determining the strength of the relationship between two variables. If there appears to be no association between the proposed explanatory and dependent variables (i.e., the scatterplot does not indicate any increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide a useful model. A valuable numerical measure of association between two variables is the correlation coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed data for the two variables.

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

There are several linear regression analyses available to the researcher.

• Simple linear regression
1 dependent variable (interval or ratio), 1 independent variable (interval or ratio or dichotomous)

• Multiple linear regression
1 dependent variable (interval or ratio) , 2+ independent variables (interval or ratio or dichotomous)

• Logistic regression
1 dependent variable (binary), 2+ independent variable(s) (interval or ratio or dichotomous)

• Ordinal regression
1 dependent variable (ordinal), 1+ independent variable(s) (nominal or dichotomous)

• Multinominal regression
1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio or dichotomous)

• Discriminant analysis
1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio)

When selecting the model for the analysis another important consideration is the model fitting. Adding independent variables to a linear regression model will always increase the explained variance of the model (typically expressed as R²). However adding more and more variables to the model makes it inefficient and over-fitting occurs. Occam's razor describes the problem extremely well – a model should be as simple as possible but not simpler. Statistically, if the model includes a large number of variables the probability increases that the variables test statistically significant out of random effects.

The second concern of regression analysis is under fitting. This means that the regression analysis' estimates are biased. Under-fitting occurs when including an additional independent variable in the model will reduce the effect strength of the independent variable(s). Mostly under fitting happens when linear regression is used to prove a cause-effect relationship that is not there. This might be due to researcher's empirical pragmatism or the lack of a sound theoretical basis for the model.

Linear regression in Python:

Sunday, 27 December 2015

Dolibarr managing Gingy's Super Store

'Gingy's Super Store' is a unique company for the heroes and villains of Gotham.
The company is managed & run by Dolibarr ERP application. Visit Gingy's Super Store.

As the HR executive of the company, I would like to give you a tour of all the modules and functions of the company and its activities.

The price, sales, purchase, desired minimum stock are all provided here. You can add, delete and modify products/services and in Products/Services:

For warehouse information, input stock value, value to sell information, go to
Warehouses--> List:

The list of the suppliers, customers, can be found in Third Party --> List:

You can go to Accountancy/Treasury area from Financial Module. Salaries, customer invoices, supplier invoices can be accessed from here:

Human Resource Management can be accessed from HRM Module.

Company's information as well as different modules, menus, boxes, etc. can be managed from Setup.

You can define limits, precisions & optimisations used by Dolibarr in Setup --> Limits and accuracy:

To run through the application, please provide your User ID and password in the designated boxes.

For demo purpose: User ID- User6

Password- 5qkdzjm0

Signing off,

HR Executive

Gingy's Super Store

Hostinger- A new dimension in website building

As a part of the assignment of Business Information System, Prof. Mukerjee asked us to build a website. Now, for the people who has never written a line of coding building a whole, functional website was the challenge.

Hostinger came to our rescue!

What is Hostinger?

Hostinger provides the most reliable & feature rich hosting free of cost. With cloud computing, their uptime is 99.99%

Hostinger supports PHP & MySQL without any restrictions

Everbody can build websites with Hostinger

One can choose from over 100 ready templates and have their own website ready within minutes.

From Help Desk (Hesk) to Customer Relationship Management (SugarCRM), every necessity can be satiated in Hostinger.

For e-Commerce site, TomatoCart is the easiest application

Image Courtesy: http://theebotom.16mb.com/

There are many more interesting templates to browse and play around. But the most exciting application which I've come across in Hostinger is Dolibarr.

What Dolibarr do?

Dolibarr ERP & CRM is an open source & free software package to manage small or medium companies, freelancers or foundations. We can say Dolibarr is an ERP or CRM or both (depending on activated modules)

It is built by modules addition (you enable only features you need), on a WAMP, MAMP or LAMP server (Apache, Mysql, PHP for all Operating Systems)

Dolibarr was developed to try to offer an ERP and CRM software whose main goal is to be simple

Why did I choose Dolibarr?

Because it's SIMPLE.

Simple to install

Simple to use

Simple to develop

Also, to get extra credits!

Courtesy: http://wiki.dolibarr.org/index.php/What_Dolibarr_Do

Thursday, 8 October 2015

Computer System Archeitecture

System Architecture

System Architecture is the representation of hardware and software components that explains which software (application) is running on which software is running on which hardware and how they are connected to each other.

Different tiers of systems architecture:

i). Two-Tier: Client Serer Architecture

In this, the data is separated from the application and stored on a different platform each user has a copy of the application. The direct communication takes place between client and server.

Advantages:

Maintenance and modification are easy
Faster Communication

Disadvantages:

With increasing users performance will degrade

ii). Three-Tier

It is divided into three parts:

Business Logic: All business logic are written in this layer. It basically is an interface. It helps communicate faster between client and data layer.
Client Layer: It is user interface.
Data Layer: Contains method to connect with database

iii). Multi-Layer Architecture:

It contains the presentation, the application, processing and the data management are logically separate processes.