Down to the Bottom – Weights Update When Minimizing the Error of the Cost Function for Linear Regression

The cost function for Linear Regression is Mean Squared Errors. It goes like below:

J(\theta) = \frac{1}{m}\sum_{i=1}^{m}\left ( \hat{y}^{(i)}-y^{(i)} \right )^{2} = \frac{1}{m}\sum_{i=1}^{m}\left( h_{\theta}\left ( x^{(i)}\right ) -y^{(i)} \right )^{2}

x^{i} is the data point i in the training dataset. h_{\theta}\left ( x^{(i)}\right ) is a linear function for the weights and the data input, which is

h_{\theta}(x)=\theta^{T}x = \theta_{0}x_{0}+ \theta_{1}x_{1}+\theta_{2}x_{2}+\theta_{3}x_{3}+\cdot\cdot\cdot+\theta_{j}x_{j}

To find the best weights that minimize the error, we use Gradient Descent to update the weights. If you have been following Machine Learning courses, e.g. Machine Learning Course on Coursera by Andrew Ng, you should have learned that to update the weights, you need to repeat the process below until it converges:

\theta_{j} = \theta_{j} - \alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-\hat{y})\cdot x_j^{(i)} for j=0…n (n features)

In Andrew Ng’s course, it it also expanded to:
\theta_{0} = \theta_{0} - \alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-\hat{y})\cdot x_0^{(i)}
\theta_{1} = \theta_{1} - \alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-\hat{y})\cdot x_1^{(i)}
\theta_{2} = \theta_{2} - \alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-\hat{y})\cdot x_2^{(i)}

\theta_{j} = \theta_{j} - \alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-\hat{y})\cdot x_j^{(i)}

However, when I first studied the course a couple of years ago, I did get stuck for a little trying to figure out where that exactly came from. I wish someone had give me some more concrete expansion so I could figure it out faster. Let me do that here so you can examine the detailed breakdown and get through this stage quickly. That’s the whole objective of this blog.

Let’s make this less abstract by putting down exact data point with a small sample and feature sizes. Let’s say, we have a set of data with 4 features like below and we only select 3 samples from it for simplicity.
\begin{bmatrix} x_0^{(1)} & x_1^{(1)}  & x_2^{(1)}  & x_3^{(1)}  & x_4^{(1)}  \\ x_0^{(2)} & x_1^{(2)}  & x_2^{(2)}  & x_3^{(2)}  & x_4^{(2)} \\ x_0^{(3)} & x_1^{(3)}  & x_2^{(3)}  & x_3^{(3)}  & x_4^{(3)} \\ x_0^{(4)} & x_1^{(4)}  & x_2^{(4)}  & x_3^{(4)}  & x_4^{(4)} \\ x_0^{(5)} & x_1^{(5)}  & x_2^{(5)}  & x_3^{(5)}  & x_4^{(5)}  \end{bmatrix} \cdot\begin{bmatrix}\theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3\\ \theta_4  \end{bmatrix}

The prediction for each sample is:

\hat{y}^{(1)} = \theta_{0}x_{0}^{(1)}+ \theta_{1}x_{1}^{(1)}+\theta_{2}x_{2}^{(1)}+\theta_{3}x_{3}^{(1)}+\theta_{4}x_{4}^{(1)}
\hat{y}^{(2)} = \theta_{0}x_{0}^{(2)}+ \theta_{1}x_{1}^{(2)}+\theta_{2}x_{2}^{(2)}+\theta_{3}x_{3}^{(2)}+\theta_{4}x_{4}^{(2)}
\hat{y}^{(3)} = \theta_{0}x_{0}^{(3)}+ \theta_{1}x_{1}^{(3)}+\theta_{2}x_{2}^{(3)}+\theta_{3}x_{3}^{(3)}+\theta_{4}x_{4}^{(3)}

In this case the cost function wold be:

J(\theta) = \frac{1}{m}\sum_{i=1}^{m}\left ( \hat{y}^{(i)}-y^{(i)} \right )^{2} = \frac{1}{m}\sum_{i=1}^{m}\left( h_{\theta}\left ( x^{(i)}\right ) -y^{(i)} \right )^{2}
=((\theta_{0}x_{0}^{(1)}+ \theta_{1}x_{1}^{(1)}+\theta_{2}x_{2}^{(1)}+\theta_{3}x_{3}^{(1)}+\theta_{4}x_{4}^{(1)})-y^{1})^2 +((\theta_{0}x_{0}^{(2)}+ \theta_{1}x_{1}^{(2)}+\theta_{2}x_{2}^{(2)}+\theta_{3}x_{3}^{(2)}+\theta_{4}x_{4}^{(2)})-y^{2}))^2 +((\theta_{0}x_{0}^{(3)}+ \theta_{1}x_{1}^{(3)}+\theta_{2}x_{2}^{(3)}+\theta_{3}x_{3}^{(3)}+\theta_{4}x_{4}^{(3)})-y^{3}))^2

To minimize the cost, we need to find the partial derivatives of each \theta . The method to update the weights is

\theta_j = \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta)

Let’s find the derivative of the \theta s here, step by step. From the cost function above, by applying Chain Rule, we have:
\frac{\partial}{\partial\theta_0} J(\theta) =
2((\theta_{0}x_{0}^{(1)}+ \theta_{1}x_{1}^{(1)}+\theta_{2}x_{2}^{(1)}+\theta_{3}x_{3}^{(1)}+\theta_{4}x_{4}^{(1)}) - y^1 )\cdot x_{0}^{(1)}
+ 2((\theta_{0}x_{0}^{(2)}+ \theta_{1}x_{1}^{(2)}+\theta_{2}x_{2}^{(2)}+\theta_{3}x_{3}^{(2)}+\theta_{4}x_{4}^{(2)})-y^2 )\cdot x_{0}^{(2)}
+ 2((\theta_{0}x_{0}^{(3)}+ \theta_{1}x_{1}^{(3)}+\theta_{2}x_{2}^{(3)}+\theta_{3}x_{3}^{(3)}+\theta_{4}x_{4}^{(3)})-y^3 )\cdot x_{0}^{(3)}
= \sum_{i=1}^3 2(\hat{y}^{(i)}-y^i)\cdot x_0^{(i)}

Do the same for the rest of the \theta s

\frac{\partial}{\partial\theta_1} J(\theta) =
2((\theta_{0}x_{0}^{(1)}+ \theta_{1}x_{1}^{(1)}+\theta_{2}x_{2}^{(1)}+\theta_{3}x_{3}^{(1)}+\theta_{4}x_{4}^{(1)}) - y^1 )\cdot x_{1}^{(1)}
+ 2((\theta_{0}x_{0}^{(2)}+ \theta_{1}x_{1}^{(2)}+\theta_{2}x_{2}^{(2)}+\theta_{3}x_{3}^{(2)}+\theta_{4}x_{4}^{(2)})-y^2 )\cdot x_{1}^{(2)}
+ 2((\theta_{0}x_{0}^{(3)}+ \theta_{1}x_{1}^{(3)}+\theta_{2}x_{2}^{(3)}+\theta_{3}x_{3}^{(3)}+\theta_{4}x_{4}^{(3)})-y^3 )\cdot x_{1}^{(3)}
= \sum_{i=1}^3 2(\hat{y}^{(i)}-y^i)\cdot x_1^{(i)}

\frac{\partial}{\partial\theta_2} J(\theta) =
2((\theta_{0}x_{0}^{(1)}+ \theta_{1}x_{1}^{(1)}+\theta_{2}x_{2}^{(1)}+\theta_{3}x_{3}^{(1)}+\theta_{4}x_{4}^{(1)}) - y^1 )\cdot x_{2}^{(1)}
+ 2((\theta_{0}x_{0}^{(2)}+ \theta_{1}x_{1}^{(2)}+\theta_{2}x_{2}^{(2)}+\theta_{3}x_{3}^{(2)}+\theta_{4}x_{4}^{(2)})-y^2 )\cdot x_{0}^{(2)}
+ 2((\theta_{0}x_{0}^{(3)}+ \theta_{1}x_{1}^{(3)}+\theta_{2}x_{2}^{(3)}+\theta_{3}x_{3}^{(3)}+\theta_{4}x_{4}^{(3)})-y^3 )\cdot x_{2}^{(3)}
= \sum_{i=1}^3 2(\hat{y}^{(i)}-y^i)\cdot x_2^{(i)}

\frac{\partial}{\partial\theta_3} J(\theta) =
2((\theta_{0}x_{0}^{(1)}+ \theta_{1}x_{1}^{(1)}+\theta_{2}x_{2}^{(1)}+\theta_{3}x_{3}^{(1)}+\theta_{4}x_{4}^{(1)}) - y^1 )\cdot x_{3}^{(1)}
+ 2((\theta_{0}x_{0}^{(2)}+ \theta_{1}x_{1}^{(2)}+\theta_{2}x_{2}^{(2)}+\theta_{3}x_{3}^{(2)}+\theta_{4}x_{4}^{(2)})-y^2 )\cdot x_{3}^{(2)}
+ 2((\theta_{0}x_{0}^{(3)}+ \theta_{1}x_{1}^{(3)}+\theta_{2}x_{2}^{(3)}+\theta_{3}x_{3}^{(3)}+\theta_{4}x_{4}^{(3)})-y^3 )\cdot x_{3}^{(3)}
= \sum_{i=1}^3 2(\hat{y}^{(i)}-y^i)\cdot x_3^{(i)}

\frac{\partial}{\partial\theta_4} J(\theta) =
2((\theta_{0}x_{0}^{(1)}+ \theta_{1}x_{1}^{(1)}+\theta_{2}x_{2}^{(1)}+\theta_{3}x_{3}^{(1)}+\theta_{4}x_{4}^{(1)}) - y^1 )\cdot x_{4}^{(1)}
+ 2((\theta_{0}x_{0}^{(2)}+ \theta_{1}x_{1}^{(2)}+\theta_{2}x_{2}^{(2)}+\theta_{3}x_{3}^{(2)}+\theta_{4}x_{4}^{(2)})-y^2 )\cdot x_{4}^{(2)}
+ 2((\theta_{0}x_{0}^{(3)}+ \theta_{1}x_{1}^{(3)}+\theta_{2}x_{2}^{(3)}+\theta_{3}x_{3}^{(3)}+\theta_{4}x_{4}^{(3)})-y^3 )\cdot x_{4}^{(3)}
= \sum_{i=1}^3 2(\hat{y}^{(i)}-y^i)\cdot x_4^{(i)}

If you are at first unclear about the functions and notations in the courses or some other documentations, I hope the expansion above would help you figure it out.

A User Story in an Architect’s Eyes

You, at home, browsing Facebook with your smart phone, receive a push notification from WhatsApp. It is a message from your boss asking you to check out an article on your Intranet. You remember the link to the article was shared on Microsoft Teams previously, so you open Teams on your phone, find the link in that Channel under that Team. You click on it. Your browser (Edge) opens. After keying in your username and password, you receive a push notification from your Authenticator App, tap on Approve. Now you see the article on your phone.

Not too tedious of a process from the user’s perspective, is it? Did I even mention anything about VPN? That belongs to the past generation.

The entire process is secured. Try copying the content of the article out. It cannot be done! Try opening another corp app that you could access through VPN previously. It is not accessible!

In the eyes of an architect, the process above is like below:


What are involved:

  • ADFS
  • Azure AD Conditional Access
  • Azure AD App Proxy
  • Microsoft Intune

Succeed with Speed in Data Migration Projects

If you work in the IT department of an organization or employed by an IT service provider, likely you have been involved in a data migration project, which most of the time has a very tight timeline. The question here is how can you meet the timeline? For example, how can you migrate 10 TB of data (structured and unstructured) from System A to System B within 3 months?

It is often said that to succeed in an IT project, you need three things: People, Process, Tool. In this article, I would like to add some insights into some of the underestimated factors that are essential for a successful migration.

When it comes to migration speed, too many people are asking what software to use to have a fast migration speed as if there is a magic tool that can make it happen in the blink of an eye. The reality is the tool does not matter as much. The more crucial factors are the infrastructure and the migration team.

  • Infrastructure. Network latency and bandwidth, Disk I/Os, CPU and RAM etc.. In many of the migration projects I’ve been involved in or learned about, the importance of infrastructure is as often underestimated as it turns out to be the shortest bar of the wooden barrel when all other factors are good. And infrastructure is usually not easy to change compared to other factors. If your infrastructure is not strong enough to support the goal of the project, try to see how you can improve it as much as you can. Otherwise, even if you have a team with superb driving skills, just because the truck you give them is too small to carry the load, you cannot expect the project to complete too soon.
  • People. This is intangible but is also often under-appreciated. (Experienced consultants know what I am talking about). An experienced team would know how best the project can be finished with what’s available at hand. Things are always not perfect. Mastery is about making the most of what you have. Following a user guide step-by-step does not cut it in a migration project with a timeline to commit.

To summarize, to succeed with speed, you need the right PEOPLE, PROCESS and TOOL with a good INFRASTRUCTURE.


Blockchain and Cryptocurrencies 102

If this is the first time you hear about Blockchain or cryptocurrencies (which is unlikely), you may want to look for a real 101 intro on YouTube or just Google it. However, if you have been hearing about it, but would like to know more, this article can give you some directions.

For technical folks, you would want to know how Blockchain actually works. No other demo can show you in a more effective way than the one built by Anders. Check the two links below which lead you to both demo videos as well as the demo environment you can play with.

There is also a must-read – the Whitepaper that declared the birth of Bitcoin:

For non-tech folks who is considering investing in cryptocurrencies, I won’t be able to give you any advice on whether you should do it or what type of cryptocurrency to invest in and when. I am currently losing money myself, hahaha… But here is some more information for you to explore more:

Find the right wallet for yourself:

Other general info:

Design with the Mind in Mind

All designs are to serve the purpose of getting the info across efficiently to the audience. Therefore, the Design Principles are derived from how human perceive, process and memorize information.

Visual Perception

The first aspect to work with is how human visually perceive information presented to them. Human eyes and brains are very powerful in recognizing:

  • Features such as: color, value, angle, slope, length, texture and motion.
  • Patterns such as: proximity, closure and continuation, symmetry, similarity, common area etc.

With the characteristics above in mind, here are the design principles that we can derive:

  • User “pop out” to attract attention. This is to use contrast and emphasis using visual features to make the important elements standout. One of the ways the rotating banner attracts attention is through motion. Usually on a webpage, there is only banner or web part that rotates while other elements stay still.
  • Associate and organize items for skippability.

Nielson Norman Group conducted a study of how people read on a webpage by tracking their eye fixation. A common eye fixation pattern, F-pattern, is shown in the picture below. User start from the first row, reading most of the content of row and move down to the next row. However, as they move down, they tend to read less and less of each row. If the information does not appear in the first few lines, it is buried in the page with little visibility to the reader. This usually happens when no clear guidance is provided to the user and the user cannot easily discover the information they need. That is to say, this is a patter to be avoided. More on NNGroup’s latest article.


An alternative pattern that indicates good reading experience is layer cake gaze pattern. Users work their way down the page, stopping to glance at each heading. They either read more of the section if it is relevant, or skip the section if they find it not relevant. This helps users locate the information they are after quickly, without them wasting time searching through the page line by line.

On a homepage of a portal, content is usually organized into sections ( or web parts), and the title of each section is clear and stands out. This is to help the user navigate their way through the page and find the information that’s of interest to them.


Human memorize information by both short-term and long-term memory.

Short-term memory

A research conducted by George Miller (1956) shows that 7±2 is the “magic number” for the limit on human capacity for processing information with short-term memory[1]. This means human can only process 5-9 items at one shot. Another cognitive scientist Cowan (2001) even provided evidence that that 4±1 is a more realistic figure[2]. This means when providing lists and options on the page, it is important to keep the list short. If the list must be long, for example, you might have a list of the 20 announcements from 5 different departments, all important to publish, you can choose to group them into sub groups and organize them under tabs.

Long-term Memory

People’s long-term memory can also be leveraged for a good UX design. This is to associate the new things (the portal) to the audience’s existing knowledge, which makes it easier for users to learn and adopt. This is the idea behind some of the design considerations on the intranet homepage. Users can easily discover the elements below even without the “pop out” effect.

  • Global Navigation menu at the top of the page.
  • Search Box at the top right corner.
  • A Button to access More Info at the top/bottom right corner of each web part.

These elements exist as part of the convention for a webpage. Therefore, there is no need to emphasize them. On the contrary, it is recommended to neutralize them to minimize the distraction, and to make other important elements stand out on the page.

[1] Miller 1956:

[2] Cowan 2001:


SharePoint Infrastructure Workshop Topics

I’ve run Infrastructure Workshops for many SharePoint implementation projects over the years. The objective of the workshop is to figure our how the farm(s) will be implemented in the customer’s environment. A few important decisions are made based on the outcome of the workshop:

  • Server and storage sizing
  • Network placement
  • Service account management
  • DR strategy
  • Farm configuration
  • etc.

I am sharing the checklist here in case it helps. 🙂 Feel free to leave your questions or feedback in the comment.

Infra Workshop Topics