# Helping ANTS find food faster using the Golden Ratio!

Hi all,

After a (very) long time, I am back with another interesting post on a puzzle that I recently came across during a research project. It is a cool number-theory puzzle that has a very interesting application to efficient foraging i.e. in designing efficient algorithms to locate resources in an unknown area. I must confess that the credit for this formulation of the problem and most of the solution goes to my advisor Jared (and to Melanie and Matthew Fricke for introducing us to this problem). I only plan to give an overview of the main idea here. Our full paper in under submission and a draft will be available on my webpage soon. Hope you enjoy this post as much as I did writing it!

Caution: As with all my posts where I discuss puzzles, I assume some math background and knowledge of standard computer-science terminology.

Imagine that an army of N ants have been secretly transported to an unknown location (all at the same place) where they have no idea about their surroundings. To make matters worse, these ants are heavily mutilated in the sense that their capability to smell food from a large distance or communicate with each other, through signals or otherwise, has been completely lost; and they are very hungry (perhaps from the long trip) and need to find something to eat fast. The ants have no idea about how many other ants have been subject to the same destiny (i.e. they do not know N), nor do they know anything about the distribution of food around them. The only capabilities these ants have are to (1) be able to navigate in any direction they want to and return back to the last location, and (2) be awesome at math! Of course, these ants can detect food when they step on it and detect collisions (without exchanging any information), but they cannot leave any pheromone trails (or similar) to communicate any information to the other ants. Each ant to its own! — Aah imagine the misery.

Can we help these poor ants find some food to eat before they all starve?

Imagine there is a big heap of sugar particles somewhere around where these ants are located (let’s call this location their nest). The distance of this heap and its size are all unknown to the ants. For simplicity, assume that the heap is circular and has diameter D and that the distance of its center to the nest is L. We will also assume that $D = O(L)$. So the task here is for the ants to find the heap, without knowing L or D in advance. Is there a fast way to do so?

Let us start simple. Assume for now that L is known to the ants. How fast do you think we can find the food in this case? — Well, the simplest way is to pick an arbitrary direction and go until distance L, and then walk along a circle of radius L, centered at the nest. Certainly, at some point, the food will be located and the ants can satisfy their hunger. The maximum distance covered by any ant, in this case, is L + ($2\pi L$ – (D/2)) = O(L). Not bad! Can we do better? Well, the minimum distance any ant has to cover is at least L – (D/2) = $\Omega(L)$, before it even hopes to find food. Thus, asymptotically, this algorithm is tight and hence, no scope of improvement.

Let us look into one more simple problem. Suppose L is unknown but D is known. What can we do now? — Well, intuitively, a larger D means a larger heap of food, which should make it easier to find. However, not knowing L seems to be a difficult roadblock to overcome, since no matter how large the diameter D is, if L – (D/2) is unknown (since L is unknown), we do not know how much minimum distance one must travel out from the nest.

Let us think of it this way. We know that the angle that the heap of food makes with the nest is proportional to D/L. So, if we just moved in a random direction (to say a very large distance), then we locate the food with probability proportional to D/L.

Hmm! This suggests the following algorithm: Start with making S spokes of length 1, at random angles. Here, a spoke means going out from the nest in a straight line and then coming back along that same line. We will determine S in a short while. Once these spokes are made, double the spoke length, and repeat. Keep doing this until the food is found.

How long does this process take?

Let us call the time during which the spoke lengths remain the same as an iteration. Thus, in iteration 1, the spoke lengths are 1; in iteration 2, the spoke lengths are 2; in iteration 3, the spoke lengths are 4; … and so on. In short, the spoke lengths in iteration $i$ are $2^{i-1}$. If it takes a total of M iterations to find the food, then the maximum distance covered is $S + 2S + 4S + \dots + 2^{M-2}S + 1 = 1 + (2^{M-1}-1)S$ (I am ignoring the factor of 2 to account for the return trip along the spokes). Now, we just need to find M and set S such that this distance is as low as we can make it.

1. Clearly, the spoke lengths must be at least L – (D/2), or else the ants don’t even reach the food. Thus, M must be $\Omega(\log L)$.
2. For all iterations before this M, the likelihood of finding food along any spoke (or back) is zero.
3. For an iteration starting from M, each spoke has some fixed probability p of finding food (assume only single ant (N = 1) here). Of course, p is a function of D/L and hence, is unknown, since L is unknown.

Since we are making S spokes in each iteration, the probability that the food is not found in a given iteration is $(1-p)^S$. Thus, the average number of iterations it will take to find the food is $\frac{1}{(1-p)^S}$, and hence, the average number of iterations is at least $\log L + \frac{1}{(1-p)^S}$. The distance covered in these many iterations is at least $L + 2^{(1-p)^{-S}}$ (ignoring the constants). Now, for this distance to be comparable to what we had when L was known (which, recall, was O(L)), we need the term $2^{(1-p)^{-S}}$ to be $O(L)$. This is true when $(1-p)^S \geq \frac{1}{\log L}$, or equivalently, when $S \leq \frac{\log L}{-\log (1-p)}$. We cannot set S to be a function of L since we do not know it. So, say we set S = 1 (cannot get lower than that!). Then, the distance we cover is at least $L + 2^{(1-p)^{-1}} = \Omega(L + 2^{L/D})$, which is clearly much larger than when only L was known. Thus, we see that handing unknown D can be expensive if we want to keep the total time low.

Can we do better than $\Omega(L + 2^{L/D})$ when both L and D are unknown?

The answer is yes! We can actually go to as low as $O \left( L + \frac{L^2}{D} \log \frac{L^2}{D} \right)$, but not any asymptotically lower than that (reporting the result for N=1 only). Thus, when the diameter is very large, say $D = 0.99L$, then we can find the food in time $O(L + L\log L)$, which is faster than if you performed a spiral around the nest. In case D is small, say some constant, then the spiral is actually better for N = 1, since $O(L + L^2 \log L)$ is larger than the time taken by the spiral, which is $O(L^2)$. The problem is we do not know D vs. L in advance, and hence, we will err on the side of a slightly more expensive algorithm, since it optimizes for the case when D is large.

Let’s get back to our original problem now. I will only discuss spoke-like algorithms in this post — for more general information, stay tuned for the full paper.

Think of our problem as the ant deciding a sequence of spoke lengths which will find the food no matter what L and D are, in a small amount of time. Thus, for any given L and D, when the ant searches along spokes specified by this sequence,  it must eventually find the food. How do we construct such a sequence?

Let us make some seemingly-easy but important observations.

1. The sequence must have all positive entries — of course, since spoke lengths have to be positive. The ants cannot travel along a spoke of length -2 🙂
2. For every integer m, the sequence must have at least one element of size at least m — if not, then the sequence will never be able to locate the food if $L \geq m$.
3. Not only that but for every integer m, the sequence must have sufficiently many elements of size at least m — or else, it is possible that the food is still not located. The critical question is how many are sufficiently many. We will return to this in a while.

Thus, basically, the sequence of spokes that the ant must make should contain infinitely many distinct elements and for each element, sufficiently many copies of each to be able to handle any L or D. Remember, the challenge is still (1) defining sufficiently many, and (2) doing so without assuming anything about L or D. We can formulate this problem as follows:

Number-stream problem: Construct a sequence of positive integers in such a way that for any x and y, the total sum of the elements in the sequence before at least y elements of size at least x are seen is small. For example, suppose x = 1 and y = 2. Then, the algorithm must keep producing elements until at least 2 elements of size at least 1 are seen. We want to minimize the total sum of the elements until this condition is met. —- Why can’t we just produce 1,1? — We do not know x and y in advance 🙂

We call this the number-stream problem. In our paper, we show that the cost of this sequence must be $\Omega(xy \log (xy))$ for any x and y, and also give an algorithm that matches this lower bound. I will not go into the details of how we achieve this in this post (see the figure on top of this post for a hint), but I hope the connection of this problem to the problem of foraging is clear. I will, however, touch upon what sufficiently many means above, since that is an interesting contribution of our result.

As we saw in the example with known D but unknown L, even if we search along many spokes of sufficient length, that does not guarantee that the food is found unless we can ensure that these spokes are in some way getting closer to the food over time. In other words, we would like to ensure that these spokes are spaced apart from each other so that we search in many different directions. The problem is that if we keep the number of spokes of a fixed length constant, then as the spoke lengths increase, these spokes become further away from each other and hence, more and more area is unexplored at larger distances. Thus, as the spoke lengths increase, we must also increase the number of spokes.

However, we must be very careful here since if we increase the number of spokes by too much, then the ant might have to travel a large distance before the food is found. We need to carefully balance the number of spokes vs. the length of these spokes. This is where the cool trick comes.

Our idea is to ensure that the likelihood of locating the food increases with time, even if all the spokes are kept the same length (which is sufficiently large). To do this, all one needs to do is to make sure that the spokes get closer to each other with time. A naive way to do this is to remember the maximum angular separation between any two spokes taken so far and take the next spoke right in between these spokes. This ensures that we always decrease the separation between the spokes.

But remembering these separations can be very memory-heavy for the poor little ant. If only there was a way to do this without remembering much!

Well, as with all cool results in mathematics, there is always an age-old theorem that comes alive to help. We invoke the famous Three-Gap Theorem (also known as the Steinhaus conjecture — although it is no longer a conjecture) with the Golden Ratio for this. Without going into the details, what this theorem does for us is the following: if the ant just remembers the direction of the last spoke it makes, and for the next spoke, just adds $2 \pi \phi$ to the direction (in radians, anti-clockwise), where $\phi$ is the Golden Ratio, then the spokes automatically keep getting closer with time. Not only that, the rate at which they get closer is optimal, as was shown by the ever famous Knuth in his book on Sorting and Searching 🙂 There it is! We are done.

The ant now knows how to make the spokes – choose directions based on the Golden Ratio instead of choosing them randomly (except for the first spoke) and just use the sequence produced by our algorithm to find the food as fast as it is possible with this approach 🙂 We actually conjecture in our paper that our algorithm is not an optimal spoke algorithm, but also an optimal search algorithm for this problem — we are yet to prove this though.

As an end note, I must mention that if you observed the capitalized ANTS in the title, it was not by mistake. The problem I discussed here was actually first formalized by Feinerman, Korman and their colleagues and termed as the Ants Nearby Treasure Search (ANTS) problem. What we do is optimize their result for large heap size and ensure tolerance against failures.

Anyway, I hope you enjoyed this post 🙂 If you did (or have anything you’d like to point out), please let me know in comments 🙂 Till then, tada!

# Dr. Blockchain and her Global Clinic

Hey!
It has been some time since I posted here. I was struggling to find an interesting topic to write about, and what more interesting topic can one find other than blockchains! It has already conquered the news, the business world, the research community and very soon, who knows it may even be the new Internet of the modern age. Different than my previous posts, this post is not a puzzle or an experimental study, but more of a collection of my thoughts on an area I think blockchains can help the health care industry.

So, as part of my internship with Visa Research, I get to meet a lot of interesting people who are doing some cutting-edge work in the field of cryptocurrencies and using blockchains for a multitude of applications. During one of these conversations, I got curious about how people are trying to apply the blockchain technology to the field of medical science. However, most of the links I could find online were concerned with trying to secure medical data and make them globally available, while preserving the privacy of the people that helped in the collection of that data. This one particular web page does a nice job of listing different ways in which the healthcare industry is benefitting or can benefit from the blockchain technology.  Some of these applications include contract verification between health plan providers and patients, aggregation and availability of patient databases to a large population, trust management between record keepers and record users, real time decentralized movement and update of records, among many other ways in which the technology can help bring a unified and secure view of health care across the globe.

Although these applications do seem to be some obvious beneficiaries of the technology, I was surprised to find almost no content on the web about the use of blockchains for providing global medical services. One of the key things I learned recently since starting to look more deeply into blockchains is that one can treat them as bulletin boards, where information once recorded stays forever and any addition of new information must be agreed upon for validity by the people using the blockchain. With this view, imagine a global clinic where Dr. Block, a blockchain disguised as a doctor, treats her patients all around the world. She learns how to perform diagnosis through the experts in the industry and through treating her patients over time. Every time she treats someone or some human doctor performs diagnosis for a disease,  they update their information with Dr. Block, which she uses later for her future patients. Let me explain this is a little more detail.

Let us say that Dr. Block is an expert in the diagnosis of a disease D. Initially, some highly experienced doctor, say Dr. Expert_1, somewhere in the world, who works in a top hospital or a medical research center trains a model (with the help of his team or students) that helps him/her classify patients who have disease D or not based on the symptoms. This model can be trained on patient data in a way that preserves the privacy of these patients. For example, if this model is based on a neural network, once the training is complete, the weights of the network reveal no information about the data that was used to train the network. This doctor is altruistic by nature, so when he observes that his model provides reasonably good accuracy in the disease diagnosis, he wants to help people all around the world to use his model to determine whether they have D or not. He decides to contact Dr. Block about this.

Dr. Block suggests Dr. Expert_1 to upload his trained model (or some form of it’s hash) on the blockchain. (Let’s not deviate ourselves by focussing too much on efficiency here, but there are many other ways of making this model available on the blockchain in a way that makes it easy for people to verify and access it later. ) Since Dr. Expert_1 is still human and wants some credit for his work, he charges some money from anyone who wants to use this model to diagnose his patients, and reward people who contribute to the model. For this purpose, he also devises a cryptocurrency, say DOCT, and accepts/provides payments in this currency. So once this model (or it’s hash) is available on the blockchain for people to use, anyone in the world now can use it in exchange of some DOCTs and get rewarded in DOCTs for contributing to the model.

Say I am Dr. Expert_2 and I reside in a remote village in India where health care is beyond the reach of people who live in my village. I come to know of Dr. Block that he is very accurate in the diagnosis of disease D and I want to use his expertise in helping my patients here. I apply for access to Dr. Block by paying the required DOCTs (over the same blockchain) and download the trained model. This payment is done in a similar manner to that of cryptocurrencies today. Once I download this model, I use it on my patients and help them take appropriate measures. For some extra DOCTs, I use my technical expertise (which I gained as my hobby) to provide this diagnosis as a service through phones and self-service booths around my village. Since no network connectivity is required to make use of this model, I can do it in the remotest part of my village.

Over time, I discover that some patients were diagnosed correctly while some others were not. I am sad about the false negatives, but I am happy I was able to take preventive measures before things went out of hand for them. However, I do not want future patients to be identified falsely for the disease D. So, I train this model further using my patients’ data and results (incremental training) and suggest the improved model to Dr. Block. The typical consensus algorithm that allows for blocks to be added to the blockchain now contains experienced doctors from around the world who will now check if the new model I am proposing provides better accuracy against their test dataset. If a consensus is reached in agreement with my model, Dr. Block accepts my new model and Dr. Expert_1 pays me my reward in the form of some DOCTs. I can now use these DOCTs later to upgrade my model later, if necessary. However, if the experts feel that the new model is not good, they can not include it. This way, it is almost like all the doctors in the world have joined hands together in curing the disease D.

Of course, the discussion above is very high level and in some sense, assumes altruism and good will of the people. The typical questions of adversarial environments and malicious tampering that arise with modern cryptocurrencies also map to this setting, but we leave that for people who actually get interested in this idea and want to implement it. In a nutshell, I wanted to convey my thoughts about how blockchains are a little more powerful in their capabilities than what is currently thought of them and one should really think of them as a global service provider that is secure and tamper evident and by their very design and intention, provide a payment model for the services as well.

Hope you liked this post and will give it a thought. I am very open to actually bringing this idea to life so if any of you want to join hands with me, let’s collaborate and try to bring the best of technology and health care in making this world a safer and a healthier place! Until then, ciao! 🙂

Featured image source.

# Mutated Randomness

Hi, all!

This post is about an experiment I performed during Fall 2015 semester under the supervision of Prof. Stephanie Forrest as part of her course on complex adaptive systems here at UNM. The aim was to evolve a population of seemingly random binary sequences into sequences that are provably random with respect to a given set of tests of randomness and independence. My motivation for this experiment was to explore the possibility of using a genetic algorithm as a wrapper on the RNG (random number generator) code. This may be useful in situations when we require the user to not be able to regenerate the sequence(s) given initial RNG parameters.

For the experiment, the initial population of sequences was generated using the Mersenne twister inside Python 2.7.10’s random.randint(0,1) function and then evolved using a genetic algorithm (GA). The idea was to weigh these sequences with respect to the p-values of five statistical tests of randomness and independence performed on each of them and apply some selection, crossover, and mutation to evolve this population into one that has a majority of high fitness individuals.

The sequences in the initial population were generated using different seeds. This defined the original aim of the experiment: to evolve to a set of sequences that are *random enough* and show negligible dependence on the seed used for the random number generator (RNG) that created them. Thus, the GA was halted as soon as the population evolved to contain at least some given $\epsilon > 0$ fraction of high fitness individuals.

An interesting observation was made when the results were obtained. The algorithm showed a strong aversion to mutation (when one would expect that mutation would actually help). Even when setting the mutation probability to something as low as $0.01$, the population did not seem to converge to contain high fitness individuals up to the fraction we desired. This suggested that the space of all random sequences (as generated by this RNG) contained very small (maybe even singleton sets of) neutral regions with respect to the fitness function used and that there was perhaps a high correlation between the bits in the sequence and their position in it. The plot below shows the results obtained, where the X-axis represents number of generations and the Y-axis represents fitness.

The plot below shows the results obtained, where the X-axis represents number of generations and the Y-axis represents fitness. Mutation probability was set to $0.01$ per bit and the crossover probability was kept very close to $1$. As can be seen, the maximum fitness in the population decreases with successive generations.

Even more interesting is the fact that a high probability single point crossover operation supported evolution in our favor, and produced a population of distinct sequences having high fitness values. So, if it was indeed the case that the neutral regions are small, one would expect the crossover to not do so well either. So, to verify this, I ran some simulations with low crossover and mutation rates and observed that the population hardly evolved. This behavior has made me sleepless since.

Some questions and extensions I am considering for this project are as follows:

1. Can $p$-values be replaced by some other measure of goodness on the tests of randomness/independence?
2. Does the order of applying the tests matter? In other words, given a sequence of tests $T_1,T_2,\dots,T_n$, when does there exist (or not) a sequence $s$ such that $s$ is not random with respect to the set $\{T_1,\dots,T_i \}$ for all $i < n$, but is random with respect to the whole set $\{T_1,T_2,\dots,T_n\}$?
3. How about another design of the fitness function, other than just the product of the $p$ values?
4. Does the nature of crossover matter in this setting?
5. Is there an analytical explanation to the small sized neutral regions?
6. Define a measure for goodness of evolution and prepare plots for this goodness against crossover rates and mutation rates.

I plan to update this post with the actual algorithm I used and plots for the results, along with a more analytical explanation of the situation and possibly the questions above, so that some of you can suggest edits and help me solve the mystery. Stay tuned 🙂

References :

Image source

# A *bug* in democracy

Remember those times when the leaders were honest and you could believe in the fact that whatever they propose will be for the good of people in general? Well, some leaders are still like that and for some others, even carbon dating wouldn’t be able to determine the last time this happened.

On a more serious note, this post is an interesting problem from Peter Winkler’s book on mathematical puzzles [PW], brought to my attention by my advisor Jared during our weekly group meeting two weeks back. It tends to highlight the power of majority voting schemes in driving all the economy in the hands of only one person. Although the practical scenario may be largely different, the puzzle below does demonstrate potential scheme, which Jared referred to as a bug (and I agree), that exploits democracy in the favor of the ruler. Hope you find it an interesting read!

Consider a democracy with $n \geq 3$ people, one of whom is the leader (elected or otherwise). At the time of his election, each of the $n$ people have some initial amount of money they possess. The aim of the puzzle is for the leader to propose a series of money redistributions among the people so that at the end of the process, he is able to collect as much as money from the people as possible. The only caveat here is that for whatever change he proposes, he needs a majority of the people to vote in favor of the proposal for it to pass. We assume that every person (except the leader) votes YES to any scheme that increases the money he currently possesses, votes NO to any scheme that proposes a smaller amount that what he currently owns and ABSTAINS otherwise. However, remember that since the leader is one of the $n$ people to vote, he has a say in the voting as well.

With respect to the discussion we had in our group meeting and acknowledging all other members who contributed towards coming up with this solution, here is what we think the leader can do best.

Theorem: For a democracy with $n \geq 3$ people that starts with a total of $T$ units of money distributed among the people, a pseudo-greedy leader is always able to gather $T-1$ units of the money to himself in $\mathcal{O}(\log n)$ rounds of voting.

Let us first look at an example before we try to attempt proving the theorem above. Suppose there are $5$ people in the democracy, each starting with a dollar. For the sake of representation, let this fact be represented by {A:1, B:1, C:1, D:1, E:1}, where without loss of generality, we assume that A represents the leader.

To begin, the leader suggests the following redistribution of money: {A:1, B:2, C:2, D:0, E:0}. When the voting begins on this scheme, D and E vote NO but B and C vote YES. Then, it is up to the leader to break the tie and he does this by voting YES since he foresees this scheme to help him in the future. The scheme is then passed and the money has now being redistributed as proposed. The next scheme that the leader proposes is {A:1,B:4,C:0,D:0,E:0}, on which B votes YES, C votes NO and D and E ABSTAIN. Again, to break the tie, the leader votes YES and the scheme passes with this redistribution of the money.

In the last step, the leader proposes {A:4, B:0, C:1, D:0, E:0}. For this scheme, B votes NO, C votes YES and the others ABSTAIN. Hence, the leader votes YES and the scheme passes with the majority vote. Note that the leader has been able to grab $4 out of the total of$5 in the population in just 3 steps, by being non-greedy sometimes. Also, note that the remaining $1 cannot go to the leader since he will never have a majority to vote in favor of any proposition that does this. Thus, the leader ends with taking over$4 in just 3 steps, taking full advantage of the power of majority voting scheme. In fact, since the first two schemes he proposed did not increase his dollars, he must have come out as being generous to some people in the process, building confidence in his leadership which he would exploit later. Hence, the term pseudo-greedy in the theorem statement. Sounds like a serious bug to me!

Now that we have seen an example of how the leader can drive all but one dollar to himself, the theorem statement above can be easily proved. For the sake of brevity, I will not present a formal proof of the same, but rather give an informal idea on how one can devise a set of schemes which end in a distribution in favor of the leader.

The main idea is for the leader to always keep the decision of voting YES on the schemes he wants to pass with himself. The leader can do this by enforcing fewer than a half of the people to vote NO (since this is unavoidable) and by keeping at least these many people to vote YES (to obtain a tie which he will break in his favor). The latter requires him to be nongreedy sometimes, bu that’s OK since he knows in the long run, this will benefit him. Typical political hypocrisy!

Thus, as demonstrated above, the leader takes money from fewer than a half of the people (as may as he can) and transferring that money to the other half which will vote YES on the redistribution. This way, once all the money (except the dollars with the leader) reach one person (who is not the leader), then the trick is to give one dollar to a person who doesn’t have any money and put the remaining money with the leader. This scheme will always pass because the poor person who got this one dollar will vote YES to the scheme and the majority will be achieved when the leader votes. It’s always easy to lure the most suffering men into voting YES for anything that even remotely seems favorable to them.

Hence, in $O(\log n)$ steps, the leader has collected all but one dollar from the people in the democracy. An important point to note here is that the puzzle doesn’t require any person to know the full distribution of money he is voting for in any step. In other words, as long as a person sees that the leader has proposed something in which his balance will increase, he votes YES regardless of what others get or lose. This is again very representative of real life in which hardly anyone looks at the full budget proposed by the government and checks complete money flow before voting for the same. Ignorance may be a bliss for some, but is probably driving our money away from us in this scenario.

Well, I hope this puzzle was an interesting mathematical insight into what can go (or is going) wrong with the democracies all around. A simple majority voting scheme can be adversarially designed to capture all the money or resources and numerous such algorithms may exist in place already. I would also like to say that this post is not pointed to anyone in particular, and just presents a mathematical puzzle from a curious point of view.

Moral of the story : DO NOT ABSTAIN. Practice your voting rights even if the leaders propose something that may not favor or harm you. As can be seen, if the people in the democracy above had not abstained, the leader would never be able to gather all the money to himself.

Until next time, have fun and stay tuned!

References :

[PW] Winkler, Peter. “Mathematical Puzzles: A Connoisseur’s Collection. 2004.” AK Peters.

Image source

# Count to sample

Hey all!

This week’s post is about an interesting relation between counting and sampling, motivated from an enlightening paper [MD02] by Prof. Martin Dyer. More specifically, the paper introduces dynamic programming as a technique to aid in approximate counting and uniform sampling using a dart throwing approach, however, this blog post is only about the example presented in the beginning of the paper where Prof. Dyer uses the count of the solutions to the 0-1 knapsack problem to sample from the set of these solutions uniformly at random in expected polynomial time (in the number of variables). You are encouraged to read [WK16] to learn more about the knapsack problem. I hope you like this post 🙂

So, the problem at hand is to be able to produce a solution to a given 0-1 knapsack problem which has been sampled uniformly at random from the set of all such solutions. Why would one require to do so? Well, among the many applications of the ability to sample uniformly at random from combinatorial sets, the one I encounter the most is in computer simulations and producing approximations. It is well known that the general knapsack problem is hard to solve, so counting the number of solutions will only be harder, let alone the problem of sampling one of these solutions. Hence, if we can approximate the count using some efficient technique, (almost) uniform sampling will become readily available. How?

Well, to answer this question, let’s first formalize the problem a little bit. The 0-1 knapsack problem can be written as the inequality $\sum_{j=1}^n a_j x_j \leq b$ for $x_i \in \{0,1\}$ for all $i$. Here, we assume that $0 \leq a_1 \leq \dots \leq a_n \leq b$ are all integers. Note that any single linear inequality in 0-1 variables with rational coefficients can be written in this form [MD02, WL75]. Denote by $S$ the set of all solutions to this inequation. Then, the problem at hand is to sample uniformly at random from $S$. So, back to the question. What if we don’t know $|S|$?

One way of sampling from $S$ without any knowledge of its size is to exploit the fact that we can efficiently check if a given assignment of 0’s and 1’s to our variables satisfies the inequality above, i.e. whether a given 0-1 assignment is in $S$ or not. This is crucial for obvious reasons, most important of which is that we will rely on this capability to accept only those assignments that lie in $S$ and reject others in a technique popularly known as rejection sampling. Start with randomly producing a 0-1 assignment of the variables and accept it if it satisfies the inequality and repeat otherwise. Simple, but not always efficient. Why?

Suppose $S$ is very small, say $O(1)$. Then, the probability of acceptance in one iteration when we have $n$ variables is $O(1/2^n)$, which is $o(1)$. Hence, in expectation, exponentially many rejections will happen before the first assignment is sampled, making the algorithm extremely slow. Can we do better? Well, let’s try Prof. Dyer’s approach.

The main idea behind the technique I am now going to talk about is eliminating the rejections above and directly sampling a solution to the problem. Since each solution must be sampled with probability $1/|S|$, it will be good to know $|S|$. Let us say we do. Now, fix a variable, say $x_3$. What is the probability that $x_3 = 0$ in a satisfying assignment to the problem? Well, since we are sampling uniformly at random from $S$, this probability is equal to the ratio of the number of solutions in which $x_3 = 0$ to $|S|$. Hence, it will be good to know the number of solutions in which $x_3 = 0$ after which the problem becomes trivial. So, how to compute this number?

[Enter dynamic programming.]

Let $F(r,s)$ be the number of solutions to inequality $\sum_{j=1}^r a_j x_j \leq s$, where $a_j$‘s and $x_j$‘s are the same as above. Clearly, we can write $F(1,s) = 1$ if $s < a_1$ and 2 otherwise. Also, observe that $|S| = F(n,b)$. To recursively compute $F(r,s)$, note that if we set $x_r = 0$, then we have the inequality $\sum_{j=1}^{r-1} a_j x_j \leq s$, whose number of solutions is $F(r-1,s)$. However, if we set $x_r = 1$, then we have the inequality $\sum_{j=1}^{r-1} a_j x_j \leq s-a_r$, whose number of solutions is $F(r-1,s-a_r)$. Thus, we can write $F(r,s) = F(r-1,s) + F(r-1,s-a_r)$ and solve the dynamic program recursively in $O(nb)$ time.

We are not quite done yet. We still do not know the number of solutions in which $x_3 = 0$. At least not directly. However, this is where the genius in [MD02] shines. We sample an assignment from $S$ as follows. With probability $F(n-1,b)/F(n,b)$, set $x_n = 0$ and with probability $F(n-1,b-a_n)/F(n,b)$, set $x_n = 1$. Once we have this done, use the dynamic program above to recursively set the values of the other variables until we are done. The claim here is that the resulting assignment is a valid solution to the inequality and is sampled uniformly at random from $S$. How? Here is why.

To see why the resulting assignment is in $S$ is easy. Once we assign the value to a variable, the dynamic program lets us sample the value of the next variable (or the previous one, whichever way you see it) based on the number of solutions in which the next (previous) variable is assigned a particular value. In other words, say we just assigned the value to $x_4$. Then, if $x_3 = 1$ is not a valid solution, the dynamic program will take care of it by itself.

To see why the resulting assignment is uniformly random, let us compute the probability $p$ that a solution $\{ x_1 = v_1, \dots, x_n = v_n \}$ is produced for some $v_i \in \{0,1\}$ such that $\sum_{j=1}^n a_j v_j \leq b$. We can write $p = Pr \{ x_n = v_n \mid x_1 = v_1, \dots, x_{n-1} = v_{n-1} \} Pr \{x_1 = v_1, \dots, x_{n-1} = v_{n-1} \}$. Since the assignment to $x_n$ is independent of other variables, we can write this as $p = \frac{F(n-1,b-a_n)v_n + F(n-1,b)(1-v_n)}{F(n,b)}Pr \{x_1 = v_1, \dots, x_{n-1} = v_{n-1} \}$. Now, we can expand $Pr \{x_1 = v_1, \dots, x_{n-1} = v_{n-1} \}$ similarly as $Pr \{x_{n-1} = v_{n-1} \mid x_1 = v_1, \dots, x_{n-2} = v_{n-2} \} Pr \{x_1 = v_1, \dots, x_{n-2} = v_{n-2} \}$. Again, the assignment of $x_{n-1}$ is independent of all others and hence, $Pr \{ x_{n-1} = v_{n-1} \} = \frac{F(n-2,b)(1-v_{n-1}) + F(n-2,b-a_{n-1})v_{n-1}}{F(n,b)} + \frac{F(n-2,b-a_n)(1-v_{n-1}) + F(n-2,b-a_n-a_{n-1})v_{n-1}}{F(n,b)}$. Keep going on like this to obtain that the probability $p = 1/F(n,b)$, which is what we wanted to prove.

Hence, we just saw a technique using counting and dynamic programming that allows us to sample exactly from a set of objects which would be difficult to sample from in general. This technique can be extended to many other combinatorial problems, however, the dynamic program to count may not be straight forward. However, it should now be clear that if exact counts are replaced by approximate counts (through some technique), then uniform sampling becomes almost uniform sampling. An important application of this is approximate counting of the number of perfect matchings in a given graph and then sampling (almost) uniformly at random one of these matchings.

The reverse direction is, however, pretty straight forward. If we knew how to sample, we can count easily. However, an exact count may not be possible using this approach, but we can count to a precision that is arbitrarily close to the exact count by using a technique similar to rejection sampling.

I hope this post was helpful in understanding this fascinating connection between sampling and counting of combinatorial objects. It is a huge area of research and lots of interesting applications have been explored. I am trying to learn more in this area and hope to come up with more fascinating examples in the future posts. Until then, ciao!

References

[MD02] Dyer, Martin. “Approximate counting by dynamic programming.” In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pp. 693-699. ACM, 2003.

[WK16] https://en.wikipedia.org/wiki/Knapsack_problem

[WL75] Wolsey, Laurence A. “Faces for a linear inequality in 0–1 variables.” Mathematical Programming 8.1 (1975): 165-178.

# Oddtown Chronicles

Hola!

Today, I will be talking about a classic puzzle that uses linear algebra to attack a combinatorics problem. I first learned of this puzzle during our research group seminar here at UNM, where Jared presented this problem to us as an exercise. Later on, I read through a couple of formalizations for the same, and will now be presenting my take on the problem taking helpful references from [MIT307], [UC726] and [UC07]. My interest in this problem originates from the fact that upon first hearing the problem statement, it doesn’t strike to be something that can be made easier using linear algebra, however, the approach baffles my mind every time I think about it. Let’s dive in.

The problem, which is referred to as Elwyn Berlekamp Theorem in [UC726], is as follows: In Oddtown, there are $n$ citizens and $m$ clubs satisfying the rules that each club has an odd number of members and each pair of clubs shares an even number of members. We then need to show that $m \leq n$, i.e. the number of clubs cannot be more than the number of citizens.

Before proceeding to the proof, I must point out that this problem has a close connection to Fisher’s inequality [MIT307], which is similar to a slight modification of the original problem. Apart from this, it is related to the study of designs, set systems with special intersection patterns. [MIT307] shows how such a system can be used to construct a graph on $n$ vertices, which does not have any clique or independent set of size $\omega \left( n^{1/3} \right)$. I will briefly talk about these results towards the end of this post.

Let us now focus on proving that the number of clubs in Oddtown cannot exceed the number of citizens. Formally, let $\mathcal{C} = \{ C_1, \dots, C_m \}$ be the set of clubs in Oddtown and $\mathcal{P} = \{ P_1,\dots,P_n \}$ be the set of citizens (people). We start by paying attention to the fact the no two clubs are allowed to share an odd number of members. The entire problem is full of even-odd type of constraints, which suggests that we should attack this problem using some form of parity checking. The easiest such setting is to work in the field with characteristic 2. In other words, we will perform all addition and multiplication operations modulo 2 from now on.

Within this setting, for each club $C_i \in \mathcal{C}$, define its membership vector $v_i = \left[ v_{i,1},\dots,v_{i,n} \right] \in \mathbb{Z}_2^n$, where $v_{i,j} = 1$ if citizen $j$ is a member of club $i$, and 0 otherwise. Intuitively, this is similar to each club maintaining a ledger in which all the citizen names are listed and only those names are marked that correspond to member citizens for that club. These ledgers must satisfy the property that for any pair of clubs, if we compare their ledgers, the number of common members must be even (which allows no common members as well). Mathematically, we can represent this constraint using dot product of the membership vectors of the clubs.

Notice that if the number of citizens common to any given pair of clubs is even, then in our field, the dot product of the membership vectors of these clubs will be zero. Hence, in the magical world of modulo 2, all clubs have membership vectors that are orthogonal to each other.  More importantly, note that none of these vectors is identically zero since the number of members in each club must be odd. Hence, the puzzle now reduces to asking the maximum number of pairwise independent vectors that exist in $\mathbb{Z}_2^n$. This is where our friend, linear algebra, kicks in.

Form a matrix $M \in \mathbb{Z}_2^{n \times m}$ whose columns are the membership vectors of the clubs. Note that since all columns are pairwise independent and none of them are identically zero, the rank of this matrix is exactly $m$. However, we know that the rank of any matrix must never exceed the number of rows or the number of columns. Voila! We’re done. The number of rows in this matrix is $n$, which immediately proves that $m \leq n$. Try holding your right ear using your left hand while swirling your arm around the back of your head. I felt exactly like that my first time with this puzzle!

Another way to see this without forming the matrix is by noticing that the number of pairwise independent vectors in $\mathbb{Z}_2^n$ cannot be more than the dimension of $\mathbb{Z}_2^n$, which is $n$. Hence, the same result as above. No matter how you prove it, we always obtain the condition we wanted to prove. See! that’s why we must befriend linear algebra.

A small technicality. Can we have exactly $n$ clubs, if not more? The answer is, trivially, yes. Since the number of common members to any pair of clubs must be even, set it to zero! Have a dedicated club for every man and you’re done. If you call this cheating, think about this. How cool will be it to have your own clubhouse?!

Maximality

Given this construction above, we can prove some more exciting results about the fate of people in Oddtown. Specifically, we can prove that for every $n$, we can always save the Mayor’s money by building no more than two clubs that satisfy all the constraints. Furthermore, we can prove that this system is maximal, in the sense that no more clubs can be added without violating at least one of the conditions. Let’s see how.

So the task at hand is to divide the $n$ citizens into 2 clubs, say $C_1$ and $C_2$, such that (1) the number of citizens common to both the clubs is even, (2) the number of citizens in each club is odd and 3) adding even one more club will violate (1) and (2). If $n$ is even, then one way to divide the citizens is to allocate one citizen to the first club and the remaining citizens to the second club. This satisfies (1) and (2). To see if this satisfies (3), add another club, say $C_3$ to the town. Since this club must have even number of common members with the other two clubs,  the number of common members between $C_3$ and $C_1$ must be zero. This requires all the members in $C_3$ to also belong to $C_2$, which immediately gives a contradiction since (2) will require this number of common members to be even while (1) requires them to be odd. Hence, this distribution of citizens into two clubs is maximal.

What happens when $n$ is odd? Trivially, put all the members into one club and the problem is solved. The second club is not needed at all. The addition of any more clubs will not be possible because of the same argument as above. Hence, one club is maximal in this case. Thus, in both the cases, no more than two clubs are required for a maximal membership of citizens.

Fisher’s inequality

We now discuss a result that is closely related to a slight modification of the rules in Oddtown. We remove the restriction on the size of the clubs and require that every two clubs share a fixed number $k$ of members. We assume that if two clubs have exactly the same members, then they are the same. Fisher’s inequality then states that the number of non-empty clubs is at most $n$, similar to the result above. The proof of this inequality is slightly involved, although the basic principle is the same. We consider the membership vectors of the clubs and prove them to be linearly independent in some field, which in this case will be the field of real numbers $\mathbb{R}^n$.

To see how this proof works, an important observation needs to be made : There is at most one club with exactly $k$ members. Wondering why? Well, let’s assume otherwise and try to get a contradiction. Let there be at least two clubs with exactly $k$ members. Then each of these must have the same members by the condition of the problem. This contradicts the fact that these clubs are distinct (because of our assumption) and hence, we have proved that at most one club can have exactly $k$ members.

Now, with this critical observation in hand, we proceed with the proof as follows. Let $\mathcal{C} = \{ C_1,\dots,C_m \}$ be the clubs of size $|C_1|,\dots,|C_m|$, respectively (assuming each of these is non zero). The size here refers to the number of members in the clubs. Represent by $C_i \cap C_j$ the set of common members in clubs $C_i$ and $C_j$. Then, according to the given problem, $\left| C_i \cap C_j \right| = k$ for each pair $i,j \in \{ 1,\dots,m \}$ with $i \neq j$. We define the membership vectors of each club similar to the proof above as $v_i = \{ v_{i,1},\dots,v_{i,n} \}$, where $v_{i,j} = 1$ if citizen $j$ belongs to club $i$ and 0 otherwise. Clearly, in $\mathbb{R}^n$, we have the dot product of any $v_i$ and $v_j$ to be $k$, if $i \neq j$. All we have to do now is to prove that these membership vectors are linearly independent in $\mathbb{R}^n$.

To see this, assume they are not. Then, using standard practice in such proofs, assume there exist real numbers $\alpha_1,\dots,\alpha_m$ such that $\sum_{i=1}^m \alpha_i v_i = \mathbf{0}$, where $\mathbf{0}$ is the zero vector in $\mathcal{R}^n$. Then, we must also have $\parallel \sum_{i=1}^m \alpha_i v_i \parallel_2 = 0$, where $\parallel v \parallel_2$ denotes the 2-norm of the vector $v$. Hence, we have $0 = \left( \sum_{i=1}^m \alpha_i v_i \right) \cdot \left(\sum_{i=1}^m \alpha_i v_i \right)$, since the 2-norm can be written as the dot product of the vector with itself. We can rewrite this sum as $0 = k \left( \sum_{i=1}^m \alpha_i \right)^2 + \sum_{i=1}^m \alpha_i^2 \left( |C_i| - k \right)$. Using our critical observation now, each of the two terms on the right of this equation are non-negative, which implies that they are both identically zero.

Hence, we have (1) $\left( \sum_{i=1}^m \alpha_i \right)^2 = 0$, which implies $\sum_{i=1}^m \alpha_i = 0$, and (2) $\sum_{i=1}^m \alpha_i^2 \left( |C_i| - k \right) = 0$. From (2), since at most one club, say $C_r$ can have exactly $k$ members, each $\alpha_i = 0$ whenever $i \neq r$. However, then, from (1),  $\sum_{i=1}^m \alpha_i = \alpha_r = 0$, which contradicts the fact that the membership vectors are linearly dependent. Hence, Fisher’s inequality holds.

Fisher’s inequality has a number of cool applications. Two interesting examples are presented in [MIT307], where they prove the following : (1) For a fixed $k$, let $G$ be a graph whose vertices are triples $T \in {[k] \choose 3}$ and $\{ A, B \}$ is an edge if $|A \cap B| = 1$. Then $G$ does not contain any clique or independent set of size more than $k$. (2) Suppose $P$ is a set of $n$ points in the plane, not all on one line. Then pairs of points from $P$ define at least $n$ distinct lines.

I hope you liked this post of mine. My next post will likely be on another interesting puzzle, maybe from one of the CACM journals. Until then, stay tuned. Ciao!

References

[MIT307] http://math.mit.edu/~fox/MAT307-lecture15.pdf

[UC07] http://people.cs.uchicago.edu/~laci/REU07/appuzzles.pdf

# Notorious Coins

Have you ever found yourself reading a problem in an exam and never forget it again because it gives you chills as well as makes you want to solve it every time you come across it? Well, the geek in me found one such problem in the final exam I took with my advisor Jared in the course CS561 (Data Structures and Algorithms) at UNM. I wanted to share this awesome problem with you, with the hope that you will find it as challenging and intriguing as I did. So here is what it says.

You and an opponent are playing a game using a row of n coins of values $v_1, \dots, v_n$ where $n$ is even. Players take turns selecting either the first or last coin from the row, removing it from the row, and receiving the value of the coin. Assume you play first. Following are some examples, assuming optimal play for both players:

• 2, 4, 8, 10 : You can collect maximum value 14 (10 + 4).
• 8, 20, 3, 2 : You can collect maximum value 22 (2 + 20).

Assume that your opponent does not always play optimally. In particular, if $k$ coins remain, then they choose the optimal move with probability $p_k$ (for example $p_k$ may decrease as $k$ grows). Describe an algorithm to optimize your expected winnings.

Sounds challenging, right! This was one of the bonus problems in the final exam, which still gives me chills every time I attack it. However, with a very clever use of dynamic programming, this problem can be attacked in a nice way. I will first focus on the easy case first, where $p_k = 1$. This means the opponent plays optimally always. For this easy case, before jumping to the dynamic programming solution, let us first try to convince ourselves why a greedy approach won’t work (this is standard practice with dynamic programming problems).

Greedy (non optimal) solution for $p_k = 1$

To see why a greedy solution won’t work for this problem, we have to find an example in which the greedy solution fails to provide an optimal solution. I will formalize my proof once we understand this example.

So, the problem at hand is to devise a sequence of coin values on which the greedy approach fails. Recall that the greedy approach picks the coin of highest value from either end of the sequence, alternating between the players. Assuming I go first, consider the sequence $S = 8, 20, 3, 2$, the same sequence as in the example above. The greedy approach makes me pick coins with values 8 and 3, making my total value 11. However, as pointed above, the optimal value is 22, which is significantly higher. KMN!

So why does the greedy approach not work? To see this, we use this trick of constructing a sequence $S$, which given a value $v$, gives me a total value of $v$ if I use the greedy approach, but a value $v' > v$ in the optimal play. This would prove that we will need to dig a bit deeper to solve this problem. Here is how this construction works.

Let $v > 4$ be given as the value. (Note : The lower bound on $v$ is just a technicality. The solution can be generalized to all $v > 0$ easily, although I will spare that discussion here. Also, I am assuming that coin values can be arbitrary rational numbers and not necessarily integers. However, I again claim that extending the solution to integer coin values is fairly trivial and hence, I leave that discussion from this post.)

Consider this sequence : $\left[ \frac{v}{2} -1, \frac{v}{2} - 1, \frac{3v}{4}, \frac{v}{2}+1 \right]$. For example, given $v = 40$, the sequence we are constructing is $[19, 19, 30, 21]$. Using the greedy solution, the total value I obtain is 21 + 19 = 40, as required. However, an optimal strategy will yield me a value of 19 + 30 = 49. Generalizing this, the greedy strategy yields a total value of $\frac{v}{2} +1 +\frac{v}{2} -1 = v$, while the optimal play yields $\frac{v}{2} -1 + \frac{3v}{4} = \frac{5v}{4} - 1$, which is more than $v$ when $v > 4$. So, what on Earth is this optimal strategy!!

[Enter dynamic programming.]

Dynamic programming solution for $p_k = 1$

In cases of optimization like these, if greedy doesn’t work, dynamic has to. Ofcourse, this comes at the cost of efficiency and the requirement of precomputing the entire solution in the beginning, but hey! correctness is way more important than efficiency here. So, let’s take a deep breath and dive in.

Let $n > 0$ and $S = [v_1,\dots,v_n]$ be a sequence of coin values. Assuming I play first, define $f(i,j)$ to be the maximum value of the coins I can obtain using an optimal play in the sequence $[v_i,\dots,v_j]$ only. Clearly, if $i > j$, then $f(i,j) = 0$ and when $i = j$, then $f(i,j) = v_i$. Another easy case is $f(i,i+1) = \max \{ v_i, v_{i+1} \}$. We are interested in the case when $i < j-1$. Let me first state the solution and then justify it.

High five if you just rolled your eyes! Understanding this solution isn’t hard if you keep in mind that both players are playing optimally. This means both of them are trying to maximize their total values. Hence, when I am presented with a choice of two coins, I need to look ahead into what possible strategies my opponent can play next, and then choose a move that maximizes my gain over all those strategies. The max in front of the solution justifies this fact. I can either choose coin $i$ to gain value $v_i$ or I can choose coin $j$ to gain value $v_j$. This choice will depend on what my opponent may do next. Let us discuss these two cases separately.

Case I : If I choose coin $i$, my opponent has to choose a coin from values $[v_{i+1},\dots,v_j]$. If he chooses $v_{i+1}$, I will then be making my choice from values $[v_{i+2},\dots,v_j]$, in which case I gain $f(i+2,j)$, by defn. Else, if my opponent chooses $v_j$, I will have to make my choice from the values$[v_{i+1},\dots,v_{j-1}]$, in which case I gain $f(i+1,j-1)$. Since the play is optimal for both of us, I will assume my worst case for his choice and minimize over these two cases. Hence, the equation on the top.

Case II : Works similarly.

Clever, right! The beauty of this solution lies in the fact that even without knowing what my opponent will choose, just knowing his strategy is able to let me exactly compute what he will go for. Note that using techniques like memoization, I can precompute my optimal value in just $\mathcal{O}(n)$ time, which is linear in the number of coins. Once I have this precomputed solution, all I need to do is make my evil laugh! Bwahahaha!!!

Not so optimal opponent

All good so far, but beware! Common sense is not so common in this world. Finding such an intelligent player may be rarer than we want. So, why not give the poor opponent some slack and allow him to deviate from optimality with some probability. More formally, let $p_k$ be the probability that when $k$ coins remain, the opponent chooses to play optimally, and not otherwise.

An important assumption to compute the expected value I gain here is that I know $p_k$ ahead of time for all $k$. If not, I may be in trouble. We can argue about how practical is it to assume this, but we can always engineer the game to make this assumption hold. I will simply let my opponent flip a coin in which the probability of heads is $p_k$ when he has to choose a coin from $k$ coins. Refer to my previous post to ensure that I can always have such a coin with me.

Now, assuming that I have found a friend who is ready to bow to my rules of this game, how can I compute my expected value at the end of the game? Well, not so tricky this time! The min in the equation will just be replaced by max when the opponent is not playing optimally. Voila! Here we go. Thus, out final solution is that $f(i,j) = 0$, when $i > j$, it is $v_i$ when $i=j$ and $\max \{ v_i, v_{i+1} \}$ if $i = j-1$. Remember that I am still playing optimally. For $i < j-1$, we will now have the following. (Note, here $\mathbb{E}(f(i,j))$ stands for the expected value of $f(i,j)$.)

Double high five if you rolled your eyes again! One more crazy step? Why assume that I play optimally? I have to be realistic right? Similar to what I am assuming for my opponent, let me assume that with probability $q_k$, I play optimally when given $k$ coins, and not optimally otherwise. I will then use a similar trick, where I will choose the max in the front of the equation above with probability $q_{j-i+1}$, and I will choose min otherwise.

We’re done! So much for choosing coins! As you saw, dynamic programming has the potential to make us rich. Of course, at the cost of going through the math. An interesting question before I close this post. What if I do not know $p_k$ and/or $q_k$? As stated earlier, I can engineer my opponent to play such that I know the probability of optimal play every time he makes a choice, but to be more considerate of his feelings, I need to make myself more adaptive of his moves. Every time I realize that he is not picking the coin which the optimal strategy suggests, I will have to revise my knowledge of his probability of erring on the side of non-optimality. We have to agree that such a strategy will only work when the number of coins is large, to begin with, in which case I can hope to come close to guessing his exact probability of erring. In any case, realism is not a friend of mathematics, especially when it comes to modeling human behavior.

Thanks for reading this post. Stay tuned for my next post, which is likely to be on some cool application of linear algebra in a puzzle type setting. Until then, ciao!

# “Irrational” Coin Flips

Recently I came across this very interesting problem : How can you simulate an unbiased coin flip with a biased coin? What about the other way round?.

Although the problem is not new (to me as well), this time I found myself with a good solution for the same, which I though would be nice to share here. I will try to provide my implementation (in Python3) of the problems and also discuss some theoretical aspects of the more interesting problem : What is the minimum number of biased coin flips required to generate an unbiased flip? I will be referring to the paper [MU08] for this.

Unbiased Flips from biased coins

So, let’s talk about the easy case first. Given an unbiased coin, how do I simulate a biased flip. Slight formalization : Let a coin flip be either $0$ or $1$, where $Pr (1) = p$ and $Pr(0) = q = 1-p$ for some $p \in (0,1)$. Thus, a flip is said to be unbiased if $p = 1/2$ and biased, otherwise.

We have two sub-cases here. First, when $p$ is not known to us, and second, when it is. For the first case, we have to find a way around simulating an unbiased flip without making any assumption about $p$ in our algorithm. An important motivation for dealing with such a case is when dealing with sources of randomness whose accuracy is not known. With this respect, I will be briefly discussing the following question : How can we deal with biased coin flips on a bit-precision computer? In other words, $p$ is never really a perfect real number, when working on modern computers, due to the bit precision. We can only make it as small as the smallest floating point number that the computer can represent in its word size. So, how does our solution change in such a situation? Turns out that the solution for unknown $p$ works here as well. Try reasoning this out yourself (or leave a comment otherwise) after reading the algorithm below.

Consider flipping the biased coin twice. Note that among the four possible outcomes, the events $01$ and $10$ are equally likely. Each occurs with probability $pq$. Thus, the following algorithm immediately becomes a candidate solution to our problem.

def unbiasedFlipCase1():
(x,y) = (flip(), flip())
if x==y:
return biasedFlipCase1()
else:
return x

Here, $\texttt{flip()}$ is a function that returns $1$ with probability $p$ and $0$, otherwise. Recall that $p$ is unknown to us, and we don’t care about it either as long as we are assured that all calls to $\texttt{flip()}$ are identical and independent of all previous calls. To see why this solution works, it will suffice if we show that $Pr(\texttt{unbiasedFlipCase1() = 1}) = 1/2$. This can be easily seen through the geometric sum $\sum_{k \geq 0} \left( p^2 + q^2 \right)^k pq = \frac{pq}{1 - \left( p^2 + q^2 \right)} = \frac{1}{2}$.

Neat, right! So the next time someone flips a coin for you to make some decision and you are not sure if you trust the coin, flip it twice and do as the above algorithm suggests and you can always be sure of an unbiased outcome. One can also see that the expected number of recursive calls before this function returns something is computed as $\sum_{k \geq 0}2k \left( p^2 + q^2 \right)^k pq = \frac{p^2 + q^2}{2pq} = O(\frac{1}{p(1-p)})$. Can we do better than this? Turns out that if $p$ is unknown, we probably can not.

Let us now turn our focus to the known $p$ case. Wondering why I am discussing this after the seemingly harder case? The reason(s) will be apparent shortly. Note that when we know $p$, we can take advantage of this knowledge to produce an unbiased bit in fewer expected number of steps than the above algorithm, which also works perfectly fine here. A little math before we proceed with this thought : let us look at the variance of the number of recursive calls of the above function. This is easily calculated to be $\frac{p^2 + q^2}{4(pq)^2}$. Hence, by using Chebyshev bounds from [AS72], the probability that the number of recursive calls differs by more than $k$ from the mean is at most $\frac{p^2 + q^2}{4(pqk)^2}$, which decreases quadratically as fast as $k$. Hence, it is unlikely that this number deviates too much from the mean, implying that the solution is efficient. But, it is the most efficient we can get? The answer is negative. We can do much better than this. Let’s see how.

Recall the concept of entropy and information from any of your undergraduate probability theory classes. The entropy of a biased coin flip is given by $H(p) = -p\log p - (1-p)\log (1-p)$, which denotes the average information gained on one flip of this biased coin. You can now see where I am going with this. [MU08] shows that the most we can get out a biased coin is this information, which directly implies that $\left \lceil \frac{1}{H(p)} \right \rceil$ flips of the biased coin will produce one unbiased flip. (Note that [MU08] was not the first person to prove this result. However, I find the discussion in this report very approachable.)

For $p = \frac{1}{2}$, we have $H(p)$ attains its maximum value of $1$, which gives $\left \lceil \frac{1}{H(\frac{1}{2})} \right \rceil = 1$. Hence, one coin flip suffices, which justifies well with intuition. As $p$ deviates in either direction, $H(p)$ decreases and hence, the number of biased coin flips increases. Once we have these flip outcomes, a decision is made to return $0$ or $1$ using the Advanced Multilevel algorithm as in [MU08]. Comparing $\left \lceil \frac{1}{H(p)} \right \rceil$ with $\left \lceil \frac{p^2 + q^2}{2pq} \right \rceil = \left \lceil \frac{2p^2 - 2p + 1}{2p(1-p)} \right \rceil$, here is the plot I obtained.

As you can see, the entropy based solution always produces lesser number of biased coin flips. Here, I have only varied $p$ in increments of $\frac{1}{100}$, but the graph remains similar for more refined values as well. However, the entropy-based algorithm is quite complicated to understand (at least for me). If you have any interest in coding theory or information theory, help me understand it!

Biased Flips from an unbiased coin

Let us now turn look at the case of simulating a biased coin flip from an unbiased one. On the face of it, this problem seems to be more of a mathematical puzzle than one having any perceivable real word application, but that’s not even remotely true. A ton of applications require us to make decisions which are not always equally favorable to all its possible outcomes. A very simple example being as follows. Suppose we want to simulate the outcome of the sum of a pair of six-sided die. Clearly, all outcomes here are not equally likely. It becomes increasingly impractical to throw two die everytime we want a sample. If only we had a way around!

Unbiased coin flips to the rescue! It turns out that any biased coin flip can be simulated by an appropriate number of unbiased coin flips. The only caveat here is that the unbiased coin flips must be perfectly random, or else we land into the territory of another very interesting question : Given a biased coin with $Pr(1) = p_1$, what is the minimum number of coin flips required to generate a biased flip with $Pr(1) = p_2$ for some $p_2 \neq p_1$? Although I will not spend much time on this question in this post and assume that all unbiased coin flips are available perfectly to me, a brief answer to this question is to consider the maximum amount of information that is obtained about the coin to be simulated from a single flip of the coin used in the simulation. Then, the problem becomes very similar to the technique discussed in [MU08], as described above.

So, how should we pray to the perfection of our unbiased flip in order to be bestowed with a biased Gold coin? Let us discuss an easy-to-understand algorithm to begin with. Recall that if we were instead given a uniform random number generator (RNG), say $\texttt{rand()}$, which produces a real number in the range $0,1$ uniformly at random, we could have easily been able to simulate a biased flip by the following algorithm.

def biasedFlipFromRNG(p):
u = rand()
if (u<=p):
return 1
else:
return 0

This works because the probability that a uniformly generated random number is no more than $p$ is exactly $p$. However, the world is not so generous to bestow us with such a Godly RNG. Lose no hope! There is an easy fix.

We can simulate the algorithm above exactly by treating our sequence of $0$‘s and $1$‘s, as generated by repeatedly flipping our unbiased coin, as the binary representation of a special number, which I call $\tilde{u}$. I use the tilde sign to partially indicate that it is just an estimate of the actual $u$ in the algorithm above. With this approach, all we have to do is flip the unbiased coin, say $n$ times (starting with $n=1$, of course) and record the outcomes in an ordered tuple $(b_1, \dots,b_n)$, and then obtain our estimate at the end of $n^{th}$ iteration as $\tilde{u}_n = \sum_{i = 1}^n b_i 2^{-i}$. Now comes the trick! Let $p_n \leq p$ be a number whose binary expansion matches that of $p$ up to $n$ bits and then it is all zeros. For example, if $p = 0.125$ and $n = 2$, then $p_n = 0$ since the first two bits in the binary expansion of $p$ are zero. Compare $\tilde{u}_n$ with $p_n$. If it so happens that $\tilde{u}_n = p_n$, then repeat this process with $n+1$. Otherwise, return $1$ if $\tilde{u}_n < p_n$ and $0$ if not. The rationale behind this approach is that once $\tilde{u}_n$ becomes less than $p_n$, no further addition of bits will make it larger. Similarly for the case when $p_n$ is smaller. Only when the two are equal, can we make no decision. The algorithm to do the above is summarized in the code below.

def getBinDigit(i,p,q):
if (q + 2**(-i) <= p):
return 1
else:
return 0

def biasedCoinFlip(p):
flip = 1
value = 0
while(True):
unbiasedFlip = unbiasedCoinFlip()
decimalDigit = getBinDigit(flip,p,value)
if decimalDigit:
value = value + 2**(-flip)
if unbiasedFlip < decimalDigit:
return 1
if unbiasedFlip > decimalDigit:
return 0
flip = flip+1

Here, the function $\texttt{unbiasedCoinFlip()}$ simulates the unbiased coin for us. Note that in the code above, I do not produce the entire tuple of bits for $p_n$ every time. Instead, I generate bits only if I require them. A natural question that arises here is the expected number of iterations of the $\texttt{while}$ loop before a bit is returned by the function. But first, let us convince ourselves that the probability that this function returns $1$ is indeed $p$.

To prove this, let us assume that the binary expansion of $p$ has bits $b_1,b_2,\dots$. Note that we are only considering the bits after the decimal point, since $p < 1$. Now, the question becomes the following. Given some $n \geq 1$ and a sequence of unbiased bits $u_1,u_2,\dots,u_n$, what is the probability of the event $\{ u_i = b_i \quad \forall 1 \leq i \leq n \}$? Clearly, this is equal to $2^{-n}$, since $p$ is fixed. Hence, with probability $2^{-n}$, we enter into yet another iteration of the $\texttt{while}$ loop. Once we are done with these iterations, we return $1$ if $\sum_{i=1}^{n+1} u_i 2^{-i} < \sum_{i=1}^{n+1} b_i 2^{-i}$, which is the same as $u_{n+1} < b_{n+1}$. This happens with probability $0$ if $b_{n+1} = 0$ and probability $\frac{1}{2}$ otherwise. Thus, the probability that we return $1$ after $k \geq 0$ iterations of the $\texttt{while}$ loop can be written as $\frac{2^{-k}b_{k+1}}{2} = b_{k+1}2^{-(k+1)}$.

We are almost there! All that remains to be done is to sum the above expression for $k \geq 0$, which gives $\sum_{k \geq 0}b_{k+1}2^{-(k+1)} = p$, by definition. Hence, the algorithm is correct. Phew! One last thing before I wrap up. How efficient is this algorithm? We can calculate the expected number of loop iterations before returning $1$ using the expression above as $\sum_{k \geq 0} kb_{k+1}2^{-(k+1)}$. Wait! This seems like a tricky sum. However, we can compute an upper bound for this sum. Writing $k = (k+1)-1$ and using $b_{k+1} \leq 1$ for all $k \geq 0$, we get that the average number of loop iterations before the algorithm returns a bit is no more than $2-p$. This may not be tight, but at least it tells us that in fewer than two iterations, the algorithm will return a bit, which is pretty fast if you ask me! A tighter analysis may be non-trivial since I am not aware of any closed form expression for $b_{k+1}$ in terms of $k$ and $p$. If you do, please let me know!

So people! I hope you enjoyed reading this post as much as I enjoyed writing it. Leave your comments below and I will try my best to address them. I have of course left out a lot of details and other cool algorithms to tackle the problems above, but I want to confess that I understand what I presented the most. I would love to hear about more techniques. Do stay tuned for my next post, in which I will talk about one of my other favorite topics : Dynamic programming. Until then, ciao!

References :

[MU08] Mitzenmacher, Michael. “Tossing a biased coin.” (2008).

[AS72] Abramowitz, M. and Stegun, I. A. (Eds.). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. New York: Dover, p. 11, 1972.

# Mid South Theory Day 2016

Hey all!

I will be attending the Mid South Theory Day 2016 at the Louisiana State University on 9th December, 2016 to give a talk on our latest result in the field of secure multiparty interactive computation. Stay tuned for an awesome algorithm that compiles a noise free protocol (asynchronous) into a robust protocol that can tolerate a bit flipping oblivious adversary with an unbounded budget, with high probability while incurring a overhead that’s within log factors of the conjectured optimal. I will upload the slides here shortly.

# The ‘Clever’ Algorithm – HITS

Just last night I finished reading this paper [KLEIN99], that presumably acted as the starting point for the research into the popular Pagerank algorithm used by Google. It presents a clever algorithm for searching queries over the World Wide Web (WWW) by addressing two major problems (scarcity and abundance) faced by the state-of-the-art text-based search engines at the time. Scarcity appears when the queries are so specific that only a very few pages in the WWW contain the query word, while abundance is when this page count is huge, typically when dealing with broad-topic queries. The examples which the paper gives for such queries are “Does Netscape support the JDK 1.1 code-signing API?” for specific queries and “Find information about the Java programming language.” for broad-topic queries. Thus, to provide an efficient search, one needs to filter out the more relevant pages, which the author calls authoritative. However, an immediate consequence of this approach is the problem while quantifying the term authoritative. More concretely, in the words of Kleinberg, given a particular page, how do we tell whether it is authoritative?

A simple heuristic that strikes now is to order the pages, that contain the given input query, on the basis of their in-degrees. This way, we are giving more weight to pages that are referenced by a large number of pages in the WWW, indicating a sense of importance, or in this case, possible authority over the other pages. However, an easy hack around this situation is for the owner of a page to create a bunch of dummy purposes whose sole purpose is to have links pointing to the page of interest. This increase in in-degree will enhance the authority of the page to an arbitrary extent and make it appear as a search result for whatever query this owner wants.

The solution to the problem is to try creating a focussed subgraph of the WWW upon which the search will be performed. This graph should be small (to not overload the servers) and must contain many of the strong authorities (pages that are most relevant to the query). Kleinberg makes an attempt at this construction through the use of principal eigenvectors of $AA^T$ and $A^TA$, where $A$ is the adjacency matrix for a reduced subgraph $G_\sigma$ that is constructed as follows. A primary assumption made here is that $A$ has a non-zero spectral gap, which implies that the graph is expanding [KOT12].

Assume that the WWW is represented as a directed graph $G=(V,E)$ where the pages form the vertices and there is an edge $(i,j) \in E$ if page $i$ has a link to page $j$. Then, given a broad-topic query $\sigma$, determine a subgraph $S_\sigma$ on which the search query will be efficient to run. To obtain $S_\sigma$:

1. Select a parameter $t$, say 200, and obtain $t$ highest ranked pages for $\sigma$ using a text-based search engine (e.g. Alta Vista at the time). Call this set $R_\sigma$ (augment this with the links between pages in $R_\sigma$ to obtain a graph). Note that $R_\sigma$ is small (by keeping $t$ small) and potentially contains many strong authorities (relying on our faith in the search engine used). However, $R_\sigma$ fails to be rich in relevant pages. This is because of the problems in text based searching where queries like “What is a good search engine?” will probably never return any of the existing search engine websites in the output because most of these websites do not contain the words “search engine” in them.
2. Expand $R_\sigma$ by adding pages (and links) that enter and leave $R_\sigma$ and call this new graph $S_\sigma$. We have now added potential relevant pages to our graph under the assumption that these “neighbor” pages contain information that is crucial to the query.

Once this graph $S_\sigma$ has been obtained, some more tweaking needs to be done to avoid returning pages that contain many navigational links. For example, a page that contains many links to various parts of itself (frequently used in pages containing long articles with sections), will be assumed to be of high authority because we are biasing our pages based on their indegree. Hence, obtain the final graph $G_\sigma$ by removing all those edges from $S_\sigma$ which connect pages in $R_\sigma$. Finally, order pages in $G_\sigma$ in decreasing order of in-degree as an estimate of their authority.

The final trick employed by Kleinberg to weight authority of a page relative to its neighbors in $G_\sigma$ is to mark those set of pages as strong authorities which have a high in-degree in $G_\sigma$ and have a significant overlap in the pages linking to them. The idea is based on the existence of hubs, which are major sources of links to these high-in-degree pages. Essentially, this ensures that a page is authoritative if it pointed by a strong hub, and a hub is strong if it points to an authority. To work around this circularity, Kleinberg operates an iterative algorithm to update the degree of authority and degree of hub-ness, through parameters called authority weight and hub weight, respectively:

1. To each page $p$ in $G_\sigma$, assign an authority weight $x^{\langle p \rangle} = 1$ and a hub weight $y^{\langle p \rangle} = 1$.
2. Normalize these weights to maintain the invariant $\sum_p \left( x^{\langle p \rangle} \right)^2 =\sum_p \left( y^{\langle p \rangle} \right)^2 = 1$.
3. ($\mathcal{I}$-operation) Update the authority weights as $x^{\langle p \rangle} \gets \sum_{q:(q,p) \in E(G_\sigma)} y^{\langle q \rangle}$. Here, $E(G_\sigma)$ is the set of edges in $G_\sigma$. Essentially, this is saying that the authority weight of the page $p$ is updated as the sum of hub weights of all the pages pointing to it.
4. ($\mathcal{O}$-operation) Similarly, update the hub weights as $y^{\langle p \rangle} \gets \sum_{q:(q,p) \in E(G_\sigma)} x^{\langle q \rangle}$.
5. Normalize both $x^{\langle p \rangle}$ and $y^{\langle p \rangle}$ for each page $p$.
6. Repeat steps 3-5 until convergence.
7. Return the top $k$ pages (for required $k$) that have the highest authority weights.

The paper presents a proof that the weight vectors will converge (assuming a non-zero spectral gap in $G_\sigma$) and provides experimental evidence to show this typically happens within 20-30 steps in practice. Clearly, this clever reduction in the search space for the query to a graph that is very likely to contain the strongest authorities and many relevant pages has siginificantly reduced the search time as well as improved the quality of search results (as is evidenced by the results in the paper). This cleverness is hence, undoubtedly, used as a starting point for the giant of search engines today.

References :

[KLEIN99] Kleinberg, Jon M. “Authoritative sources in a hyperlinked environment.”Journal of the ACM (JACM) 46.5 (1999): 604-632.

[KOT12] Kotowski, Marcin, and Micha l Kotowski. “Lectures on expanders.” (2012).