pkgsrc-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

textproc/cabocha memory usage fix



Hi all,
the attached patch pushes the memory use of the "new" cabocha version
finally down below 2GB again. That's the limit I see in my bulk builds
and a very reasonable limit in general. The specific RAM use depends on
the STL implementation, e.g. libc++ has a 32 Bytes long std::string
class as it is optimised for short strings. The patch works by using two
ideas:

(1) Avoid resizing the feature_trie_output vector. It has very large
(8M+ elements) and the pair_weight hash map is already very huge (32M
elements).

(2) Avoid storing the stringified keys as long as possible. Most
importantly, push it after the point where pair_weight has been freed
again.

The code can likely be optimised for speed since e.g. compareIds can
likely avoid the the stringification, but getting it to work was my
priority.

Joerg
Index: textproc/cabocha/distinfo
==================================================================
--- textproc/cabocha/distinfo
+++ textproc/cabocha/distinfo
@@ -3,5 +3,6 @@
 SHA1 (cabocha-0.67.tar.bz2) = 457a9bd0d264a1146a5eb1c5a504dd90a8b51fb8
 RMD160 (cabocha-0.67.tar.bz2) = 625c4b9a9f4bbc1f454e52988ecdc87e471631e0
 Size (cabocha-0.67.tar.bz2) = 111149677 bytes
 SHA1 (patch-configure) = 1f95f39ee33a95993079e51137e5b61cbfd6dadd
 SHA1 (patch-configure.in) = a6e6e448283821a543d4ed0f534a94f3fc54f1ad
+SHA1 (patch-src_svm.cpp) = 16be1dd67b858e85ba797805a46aec8cf19efd6d

ADDED    textproc/cabocha/patches/patch-src_svm.cpp
Index: textproc/cabocha/patches/patch-src_svm.cpp
==================================================================
--- textproc/cabocha/patches/patch-src_svm.cpp
+++ textproc/cabocha/patches/patch-src_svm.cpp
@@ -0,0 +1,100 @@
+$NetBSD$
+
+--- src/svm.cpp.orig   2014-02-17 23:53:53.000000000 +0000
++++ src/svm.cpp
+@@ -268,6 +268,23 @@ double FastSVMModel::classify(const std:
+   return score * normalize_factor_;
+ }
+ 
++static std::string encodeUint64(uint64_t i) {
++  unsigned int i1 = 0;
++  unsigned int i2 = 0;
++  decodeFromUint64(i, &i1, &i2);
++  return encodeFeatureID(i1, i2);
++}
++
++static bool compareIds(const std::pair<uint64_t, float> l, const 
std::pair<uint64_t, float> r)
++{
++  std::string lkey = encodeUint64(l.first);
++  std::string rkey = encodeUint64(r.first);
++  if (lkey == rkey)
++    return l.second < r.second;
++  else
++    return lkey < rkey;
++}
++
+ bool FastSVMModel::compile(const char *filename, const char *output,
+                            double sigma, size_t minsup,
+                            size_t freq_feature_size,
+@@ -358,7 +375,7 @@ bool FastSVMModel::compile(const char *f
+     std::vector<float> fweight1(feature_size, 0.0);
+     std::vector<float> fweight2(freq_feature_size *
+                                 (freq_feature_size - 1) / 2, 0.0);
+-    std::vector<std::pair<std::string, float> > feature_trie_output;
++    std::vector<std::pair<uint64_t, float> > feature_trie_output;
+ 
+     // 0th-degree feature (bias)
+     for (size_t i = 0; i < model.size(); ++i) {
+@@ -410,6 +427,20 @@ bool FastSVMModel::compile(const char *f
+       CHECK_DIE(sigma_neg <= sigma_pos);
+ 
+       // extract valid patterns only.
++      size_t output_weights = 0;
++      for (hash_map<uint64, std::pair<unsigned char, float> >::const_iterator
++               it = pair_weight.begin(); it != pair_weight.end(); ++it) {
++        const size_t freq = static_cast<size_t>(it->second.first);
++        const float w = it->second.second;
++        unsigned int i1 = 0;
++        unsigned int i2 = 0;
++        decodeFromUint64(it->first, &i1, &i2);
++        if (i1 < freq_feature_size && i2 < freq_feature_size) {
++        } else if (freq >= minsup && (w <= sigma_neg || w >= sigma_pos)) {
++          ++output_weights;
++        }
++      }
++      feature_trie_output.reserve(output_weights);
+       for (hash_map<uint64, std::pair<unsigned char, float> >::const_iterator
+                it = pair_weight.begin(); it != pair_weight.end(); ++it) {
+         const size_t freq = static_cast<size_t>(it->second.first);
+@@ -426,8 +457,7 @@ bool FastSVMModel::compile(const char *f
+           CHECK_DIE(index >= 0 && index < fweight2.size());
+           fweight2[index] = w;
+         } else if (freq >= minsup && (w <= sigma_neg || w >= sigma_pos)) {
+-          const std::string key = encodeFeatureID(i1, i2);
+-          feature_trie_output.push_back(std::make_pair(key, w));
++          feature_trie_output.push_back(std::make_pair(it->first, w));
+         }
+       }
+     }
+@@ -460,14 +490,17 @@ bool FastSVMModel::compile(const char *f
+       weight2[i] = static_cast<int>(fweight2[i] / normalize_factor);
+     }
+ 
+-    std::sort(feature_trie_output.begin(), feature_trie_output.end());
++    std::sort(feature_trie_output.begin(), feature_trie_output.end(), 
compareIds);
+     std::vector<size_t> len(feature_trie_output.size());
+     std::vector<Darts::DoubleArray::value_type> 
val(feature_trie_output.size());
+     std::vector<char *> str(feature_trie_output.size());
+ 
+     for (size_t i = 0; i < feature_trie_output.size(); ++i) {
+-      len[i] = feature_trie_output[i].first.size();
+-      str[i] = const_cast<char *>(feature_trie_output[i].first.c_str());
++      std::string key = encodeUint64(feature_trie_output[i].first);
++      len[i] = key.size();
++      str[i] = strdup(key.c_str());
++      if (str[i] == NULL)
++        abort();
+       val[i] = static_cast<int>(
+           feature_trie_output[i].second / normalize_factor) +
+           kPKEBase;
+@@ -502,6 +535,10 @@ bool FastSVMModel::compile(const char *f
+     CHECK_DIE(weight1.size() > 0);
+     CHECK_DIE(weight2.size() > 0);
+     CHECK_DIE(node_pos.size() > 0);
++
++    for (size_t i = 0; i < str.size(); ++i)
++      free(str[i]);
++    str.clear();
+   }
+ 
+   {



Home | Main Index | Thread Index | Old Index