Sunday, February 22, 2015

Introduction to tries through Facebook Hacker Cup 2015 Round 1 - Autocomplete

In the last week's post I discussed the solution to the problem "Homework" from Facebook Hacker Cup 2015 Round 1. This post is about the problem "Autocomplete" from the same round.

You can  read the problem statement available on the Facebook website. If you don't have a Facebook account, you can download problem statements from Codeforces Gym.

I will re-state the problem in fewer lines. You have a new phone which has an auto-complete feature. But, it will complete your word only if there is there is no ambiguity on how the word should be completed. Formally, it will complete a word iff there is exactly one way to complete the word. It so happens that the phone's dictionary is empty by default. So, for the auto-complete feature to work, before typing a word you need to add that word to the dictionary. You need to send a message containing \(N\) words. What is the minimum number of characters must you type to send all \(N\) words? (Note that the typing involved in adding a word to the dictionary is not counted.) You need to solve \(T\) instances of this problem.

Here are the constraints:
  • \(1 \le T \le 100\)
  • \(1 \le N \le 100,000\)

The solution idea is pretty straightforward. For any word \(w\), the minimum characters that have to be typed is equal to length of the shortest prefix \(p\) of \(w\) such that there is no word in the dictionary that has a prefix \(p\). It turns out that there is a very elegant solution using tries.

What is a trie?

A trie is a data structure used to store string keys for fast look-up. There are many resources online that have explained tries better than I can. So, I am just gonna give an brief overview here and point you to useful online resources. Later, I will how tries can be implemented easily in a programming contest environment.

Typically, a trie looks like this.

Notice that every node contains 26 pointers corresponding to each letter in the English alphabet. The topmost node in the above figure is called the root node. We see that only the pointers corresponding to M, P and T are depicted. We can assume the other pointers to be NULL i.e. there are no words starting with characters other than M, P and T. When we follow the M pointer we reach another node which has pointers only for A and E. This means that there are words in the trie with prefixes MA and ME. If we continue following the E branch we reach a node with a Δ symbol. The Δ symbol signifies that a word ends there. This means that the word \(\text{MENDEL}\) is present in the trie. This node also has a pointer corresponding to E. This is not the beginning of a new word. It is just a continuation of \(\text{MENDEL}\). Following the pointers leads us to the word \(\text{MENDELEEV}\). Similarly, the words \(\text{MAXWELL}\), \(\text{PASTEUR}\), \(\text{PAVLOV}\), \(\text{PEANO}\), \(\text{POINCARE}\), \(\text{POISSON}\) and \(\text{TURING}\) are present in the trie.

To know more about tries, follow these links:

If you have visited above links you would have noticed that tries have complicated and messy implementation involving dynamic memory allocation. This is unsuitable for programming contests. Here is a more contest-friendly trie implementation.
int a[N][26];
char s[N];
int n = 1;

scanf("%s", s);
int t = 1;
for (int i = 0; s[i]; ++i) {
  int p = s[i] – 'a'
  if (a[t][p] == 0) {
    n++;
    a[t][p] = n;
  }
  t = a[t][p];
}
The above shows how to implement the insert operation in a simple way. All we are doing here is simulating dynamic memory allocation in place of actually using it. In the above code a[][] denotes the trie and n denotes the index of the next available node. If you have read the GeeksforGeeks article it should not be very difficult to understand how the above code works. Can you figure out how to write code for the look-up operation? Here is my implementation:
bool isPresent(char *s) {
  int t = 1;
  for (int i = 0; s[i]; ++i) {
    int p = s[i] - 'a';
    if (a[t][p] == 0) return false;
    else t = a[t][p];
  }
  return true;
}
Know that we know how tries work and how to implement them, let's solve the original problem. As I pointed out earlier, we need to find, for every word, how much of the word has to be typed such that the rest of the word can be uniquely identified. This is the same as finding the shortest prefix of the word that is not present in the trie. Let's follow this algorithm:
for every word w do:
  add it to the trie in the following way:
    during adding count the length of the longest prefix that is already present
    call this length L
  add L + 1 to the answer
repeat
This algorithm leads to a simple implementation. You don't even need to write a separate function for look-up. Here is my implementation:
const int N = 1000050;

int a[N][26];
char s[N];

int main() {
  int t;
  scanf("%d", &t);
  long long ans;
  for (int tt = 1; tt <= t; ++tt) {
    printf("Case #%d: ", tt);
    int nn;
    scanf("%d", &nn);
    ans = 0;
    int n = 1;
    for (int i = 0; i < nn; ++i) {
      scanf("%s", s);
      int t = 1, ct = 0;
      bool end = false;
      for (int j = 0; s[j]; ++j) {
        int p = s[j] - 'a';
        if (!end) ct++;
        if (a[t][p] == 0) {
          n++;
          a[t][p] = n;
          end = true;
        }
        t = a[t][p];
      }
      ans += ct;
    }
    printf("%lld\n", ans);
    memset(a, 0, sizeof(a));
  }
  return 0;
}
You can test your solution by submitting it in the Codeforces Gym or by downloading the official input and output files. Check the Hacker Cup Facebook page for links.

Thanks for reading!

Saturday, February 14, 2015

Problem discussion: Facebook Hacker Cup 2015 Round 1 - Homework

It's been a long time since I last wrote on this blog and a lot has happened since then. The most important of all is that my personal website is up at anirudhrb.com and consequently this blog can now be accessed at blog.anirudhrb.com. Also, you can now reach through email at mail@anirudhrb.com.

Let's get down to business now. The online part of Facebook Hacker Cup 2015 concluded recently and the top 25 contestants have been selected to compete in the onsite finals. The problem I am writing about in this post appeared in the Round 1 (there were 3 rounds).

The title of the problem is "Homework" and here is the abridged statement:
Given three integers \(A\), \(B\) and \(K\), find the number of integers in range \(\big[A, B\big] \) that have a "primacity" of \(K\). Primacity of a number \(N\) is defined as the number of distinct prime factors of \(N\). You have to process \(T\) such queries.
The constraints are as follows:
  • \(2 \le A \le B \le 10^{7}\)
  • \(1 \le K \le 10^{9}\)
  • \(1 \le T \le 100\)
The first thing to notice here is that the number of prime factors of \(10^{7}\) is not very large. In fact, it is at most \(8\). So, the maximum primacity of a number within the given constraints is \(8\). To know more about this read about Primorial on Wikipedia.

We can modify the sieve algorithm to calculate the primacity of each number in the range \( \big[1, 10^{7}\big] \) in the following way: whenever you "cross-out" a number designating it as non-prime, increment the value at that position. This works because for any number \(x\), the primacity is equal to the number of primes less that \(x\) that divide it which is the same the number of times it is crossed out in the sieve algorithm.

Now that we know the primacity every possible number within the constraints, we need to think about how to answer the queries efficiently. Notice that the answer can be expressed as follows
answer = answer for \(\big[1, B\big]\) \(-\) answer for \(\big[1, A - 1\big]\)
We can answer queries in \(\mathcal{O}(1)\) time with some pre-computation. The pre-computation can be done this way
Let c[n][p] denote the number of integers in [1, n] with primacity p then,
c[n][p] = c[n - 1][p] + 1   if primacity of n is p
c[n][p] = c[n - 1][p]       if primacity of n is not p
We do this pre-processing for all possible values of primacity and answer each query in \(\mathcal{O}(1)\) time.

You can test the correctness of your solutions by submitting it in the Codeforces Gym (login required) or you can download the official input and output files and test your solution against them. Here's my C++ code (shortened). You can find the full code on GitHub.

const int N = 10000001;

bool mark[N];
int p[N], ct[N][8];

int main() {
  mark[1] = false;
  for (int i = 2; i < N; ++i) {
    if (mark[i]) continue;
    for (int j = 1; i * j < N; ++j) {
      mark[i * j] = true;
      p[i * j]++;
    }
  }
  for (int i = 2; i < N; ++i) {
    for (int j = 0; j < 8; ++j)
      ct[i][j] = ct[i - 1][j];
    ct[i][p[i] - 1]++;
  }
  int t, a, b, k;
  scanf("%d", &t);
  for (int tt = 1; tt <= t; ++tt) {
    scanf("%d %d %d", &a, &b, &k);
    printf("Case #%d: ", tt);
    if (k > 8) printf("0\n");
    else {
      int ans = ct[b][k - 1] - ct[a - 1][k - 1];
      printf("%d\n", ans);
    }
  }
  return 0;
}

I will be back next week with the next problem "Autocomplete" from the same round.

Thanks for reading!