Perplexity Correlations documentation

Here, you will find documentation for every API method in the perplexity-correlations pip package. See https://github.com/TristanThrush/perplexity-correlations for an overview of the package.

perplexity_correlations.estimation.product(X, y)

In expectation, this function returns a vector proportional to the optimal weight vector relating the per-LLM benchmark error vector (y), and the per-LLM and per-text bits-per-byte matrix (X). In addition to standard high-dimensional regression assumptions, we assume the relationship between X and y is:

y = f(<w,X> + e),

where w is the vector of optimal per-text weights that we want to estimate, e is zero-mean error, and f is a monotonically increasing function which we do not have to know.

This function uses the single-index model parameter estimator from Plan et al. (2016): https://arxiv.org/abs/1404.3749, which is the U-statistic:

x_k*y_k,

for 1<=k<=N where N is the number of LLMs, x_k is the per-text bit-per-byte vector of the k-th LLM, and y_k is the benchmark error of the k-th LLM.

NOTE: This estimator is not robust to outliers in X or y.

Parameters:
  • X (numpy.ndarray) – A NxD matrix with bits-per-byte values of N LLMs on D pieces of text.

  • y (numpy.array) – A N-length vector with the benchmark error (1-accuracy) of the N LLMs.

Returns:

The D-length vector estimate of w.

Return type:

numpy.array

Raises:
  • ValueError – If X is not 2 dimensional.

  • ValueError – If y is not 1 dimensional.

Examples

>>> import numpy as np
>>> from perplexity_correlations.estimation import product
>>>
>>> # Bits-per-byte from 100 LLMs on 20000 text domains:
>>> X = np.random.rand(100, 20000)
>>>
>>> # Benchmark error from the 100 LLMs:
>>> y = np.random.uniform(low=0, high=1, size=(100))
>>>
>>> # Estimate the weights for the relationship:
>>> estimate = product(X, y)
perplexity_correlations.estimation.sign(X, y)

In expectation, this function returns a vector proportional to the optimal weight vector relating the per-LLM benchmark error vector (y), and the per-LLM and per-text bits-per-byte matrix (X). In addition to standard high-dimensional regression assumptions, we assume the relationship between X and y is:

y = f(<w,X> + e),

where w is the vector of optimal per-text weights that we want to estimate, e is zero-mean error, and f is a monotonically increasing function which we do not have to know.

This function uses the single-index model parameter estimator from Chen & Banerjee (2017): https://proceedings.mlr.press/v70/chen17a.html, which is the U-statistic:

sign(y_g-y_k)*(x_g-x_k),

for 1<=k,g<=N where N is the number of LLMs, x_k is the per-text bit-per-byte vector of the k-th LLM, and y_k is the benchmark error of the k-th LLM.

NOTE: This estimator is not robust to outliers in X, but is robust to outliers in y.

Parameters:
  • X (numpy.ndarray) – A NxD matrix with bits-per-byte values of N LLMs on D pieces of text.

  • y (numpy.array) – A N-length vector with the benchmark error (1-accuracy) of the N LLMs.

Returns:

The D-length vector estimate of w.

Return type:

numpy.array

Raises:
  • ValueError – If X is not 2 dimensional.

  • ValueError – If y is not 1 dimensional.

Examples

>>> import numpy as np
>>> from perplexity_correlations.estimation import sign
>>>
>>> # Bits-per-byte from 100 LLMs on 20000 text domains:
>>> X = np.random.rand(100, 20000)
>>>
>>> # Benchmark error from the 100 LLMs:
>>> y = np.random.uniform(low=0, high=1, size=(100))
>>>
>>> # Estimate the weights for the relationship:
>>> estimate = sign(X, y)
perplexity_correlations.estimation.sign_cdf(X, y)

In expectation, this function returns a vector of values with the same relative ranks as the values in the optimal weight vector relating the per-LLM benchmark error vector (y), and the per-LLM and per-text bits-per-byte matrix (X). In addition to standard high-dimensional regression assumptions, we assume the relationship between X and y is:

y = f(<w,X> + e),

where w is the vector of optimal per-text weights that we want to estimate, e is zero-mean error, and f is a monotonically increasing function which we do not have to know.

This function uses the single-index model parameter estimator from Thrush et al. (2024): https://arxiv.org/abs/2409.05816, which is the U-statistic:

sign(y_g-y_k)*(CDF(x_g)-CDF(x_k)),

for 1<=k,g<=N where N is the number of LLMs, x_k is the per-text bit-per-byte vector of the k-th LLM, y_k is the benchmark error of the k-th LLM, and CDF computes the column-wise empirical CDF of the entries in the x vectors.

NOTE: This estimator is robust to outliers in X and y.

Parameters:
  • X (numpy.ndarray) – A NxD matrix with bits-per-byte values of N LLMs on D pieces of text.

  • y (numpy.array) – A N-length vector with the benchmark error (1-accuracy) of the N LLMs.

Returns:

The D-length vector estimate of w.

Return type:

numpy.array

Raises:
  • ValueError – If X is not 2 dimensional.

  • ValueError – If y is not 1 dimensional.

Examples

>>> import numpy as np
>>> from perplexity_correlations.estimation import sign_cdf
>>>
>>> # Bits-per-byte from 100 LLMs on 20000 text domains:
>>> X = np.random.rand(100, 20000)
>>>
>>> # Benchmark error from the 100 LLMs:
>>> y = np.random.uniform(low=0, high=1, size=(100))
>>>
>>> # Estimate the weights for the relationship:
>>> estimate = sign_cdf(X, y)
perplexity_correlations.estimation.sign_sign(X, y)

In expectation, this function returns a vector of values with the same relative ranks as the values in the optimal weight vector relating the per-LLM benchmark error vector (y), and the per-LLM and per-text bits-per-byte matrix (X). In addition to standard high-dimensional regression assumptions, we assume the relationship between X and y is:

y = f(<w,X> + e),

where w is the vector of optimal per-text weights that we want to estimate, e is zero-mean error, and f is a monotonically increasing function which we do not have to know.

This function uses the single-index model parameter estimator from Thrush et al.’s (2024) (https://arxiv.org/abs/2409.05816) initial experiments, although the preprint does not document it yet. It is the U-statistic:

sign(y_g-y_k)*sign(x_g-x_k),

for 1<=k,g<=N where N is the number of LLMs, x_k is the per-text bit-per-byte vector of the k-th LLM, y_k is the benchmark error of the k-th LLM.

NOTE: This estimator is robust to outliers in X and y.

Parameters:
  • X (numpy.ndarray) – A NxD matrix with bits-per-byte values of N LLMs on D pieces of text.

  • y (numpy.array) – A N-length vector with the benchmark error (1-accuracy) of the N LLMs.

Returns:

The D-length vector estimate of w.

Return type:

numpy.array

Raises:
  • ValueError – If X is not 2 dimensional.

  • ValueError – If y is not 1 dimensional.

Examples

>>> import numpy as np
>>> from perplexity_correlations.estimation import sign_sign
>>>
>>> # Bits-per-byte from 100 LLMs on 20000 text domains:
>>> X = np.random.rand(100, 20000)
>>>
>>> # Benchmark error from the 100 LLMs:
>>> y = np.random.uniform(low=0, high=1, size=(100))
>>>
>>> # Estimate the weights for the relationship:
>>> estimate = sign_sign(X, y)
perplexity_correlations.estimation.spearmanr(X, y)

In expectation, this function returns a vector of values with the same relative ranks as the values in the optimal weight vector relating the per-LLM benchmark error vector (y), and the per-LLM and per-text bits-per-byte matrix (X). In addition to standard high-dimensional regression assumptions, we assume the relationship between X and y is:

y = f(<w,X> + e),

where w is the vector of optimal per-text weights that we want to estimate, e is zero-mean error, and f is a monotonically increasing function which we do not have to know.

This function uses the single-index model parameter estimator from Thrush et al. (2024): https://arxiv.org/abs/2409.05816, which is the elementwise Spearman rank correlation, following:

1-6*(sum_{k=1}^N(Rank(y_k)-Rank(x_k))^2)/(N*(N^2-1)),

where N is the number of LLMs, x_k is the per-text bit-per-byte vector of the k-th LLM, and y_k is the benchmark error of the k-th LLM.

NOTE: This estimator is robust to outliers in X and y.

NOTE: The current version of the Thrush et al. paper does not provide the proof that this estimator matches the ranks of the optimal weights in expectation, but we have now proved this.

Parameters:
  • X (numpy.ndarray) – A NxD matrix with bits-per-byte values of N LLMs on D pieces of text.

  • y (numpy.array) – A N-length vector with the benchmark error (1-accuracy) of the N LLMs.

Returns:

The D-length vector estimate of w.

Return type:

numpy.array

Raises:
  • ValueError – If X is not 2 dimensional.

  • ValueError – If y is not 1 dimensional.

Examples

>>> import numpy as np
>>> from perplexity_correlations.estimation import spearmanr
>>>
>>> # Bits-per-byte from 100 LLMs on 20000 text domains:
>>> X = np.random.rand(100, 20000)
>>>
>>> # Benchmark error from the 100 LLMs:
>>> y = np.random.uniform(low=0, high=1, size=(100))
>>>
>>> # Estimate the weights for the relationship:
>>> estimate = spearmanr(X, y)
perplexity_correlations.projection.l2(estimate, tau, atol=1e-12)

Given a real-valued vector ‘estimate’, and a positive real-valued vector ‘tau’, this method projects ‘estimate’, minimizing the L_2 norm subject to:

sum(projected_estimate) = 1

0 <= projected_estimate[i] <= tau[i], for all i

It uses the fast projection solution from Thrush et al. (2024): https://arxiv.org/abs/2409.05816

This projection turns the estimate from one of the estimator methods into a sampling distribution that you could use for training a model on D different domains of text (where len(estimate) == len(tau) == D). tau specifies constraints that prevent you from upsampling a domain of text too much. In Thrush et al., the standard choice for tau[i] is to set it as large as possible such that you won’t duplicate data by sampling the i-th domain with weight tau[i].

NOTE: unlike projection.linear, the solution here is dependent upon the exact values in the estimate, not just their ranks. To use this projection on estimates from estimation.sign_cdf, estimation.sign_sign, and estimation.spearmanr, you must invert the monotonic trig functions from the solutions to uncover the exact values of the weight estimates, and potentially learn the norm through hyperparameter search. Even after doing this, it is unlikely that you will be able to uncover the true weight values if your LLM bits-per-byte data deviates too much from the Gaussian distribution. Read more at https://arxiv.org/abs/2409.05816.

Parameters:
  • estimate (numpy.darray) – A D-length vector returned from one of the perplexity_correlations.estimation methods (or monotonically transformed estimate if using one of the robust estimators from Thrush et al.).

  • tau (numpy.array) – A D-length vector with the per-domain sampling thresholds.

  • atol (float, optional) – Allowable margin of error for each projected weight. Smaller values will make the bisection search take longer. Default is 1e-12.

Returns:

The D-length projected estimate to be used as a pretraining sampling distribution.

Return type:

numpy.array

Raises:
  • ValueError – If values in tau sum to less than 1.

  • ValueError – If any values in tau are non-positive.

  • ValueError – If estimate is not 1 dimensional.

  • ValueError – If tau is not 1 dimensional.

Examples

>>> import numpy as np
>>> from perplexity_correlations.estimation import sign
>>> from perplexity_correlations.projection import l2
>>>
>>> # Bits-per-byte from 100 LLMs on 20000 text domains:
>>> X = np.random.rand(100, 20000)
>>>
>>> # Benchmark error from the 100 LLMs:
>>> y = np.random.uniform(low=0, high=1, size=(100))
>>>
>>> # Estimate the weights for the relationship:
>>> estimate = sign(X, y)
>>>
>>> # per-domain sampling thresholds
>>> # (the sum of this will almost certainly be >= 1)
>>> tau = np.random.rand(20000)
>>>
>>> projected_estimate = l2(estimate, tau)
perplexity_correlations.projection.linear(estimate, tau)

Given a real-valued vector ‘estimate’, and a positive real-valued vector ‘tau’, this method projects ‘estimate’, maximizing the dot product (linear projection) subject to:

sum(projected_estimate) = 1

0 <= projected_estimate[i] <= tau[i], for all i

It uses the fast projection solution from Thrush et al. (2024): https://arxiv.org/abs/2409.05816

This projection turns the estimate from one of the estimator methods into a sampling distribution that you could use for training a model on D different domains of text (where len(estimate) == len(tau) == D). tau specifies constraints that prevent you from upsampling a domain of text too much. In Thrush et al., the standard choice for tau[i] is to set it as large as possible such that you won’t duplicate data by sampling the i-th domain with weight tau[i].

NOTE: the solution here is not dependent upon the exact values in the estimate; it only depends on their ranks. This makes it easy to directly use the estimates from estimation.sign_cdf, estimation.sign_sign, and estimation.spearmanr, which compute strictly monotonically increasing trig functions of the optimal weights in expectation. Read more at https://arxiv.org/abs/2409.05816.

Parameters:
  • estimate (numpy.darray) – A D-length vector returned from one of the perplexity_correlations.estimation methods.

  • tau (numpy.array) – A D-length vector with the per-domain sampling thresholds.

Returns:

The D-length projected estimate to be used as a pretraining sampling distribution.

Return type:

numpy.array

Raises:
  • ValueError – If values in tau sum to less than 1.

  • ValueError – If any values in tau are non-positive.

  • ValueError – If estimate is not 1 dimensional.

  • ValueError – If tau is not 1 dimensional.

Examples

>>> import numpy as np
>>> from perplexity_correlations.estimation import spearmanr
>>> from perplexity_correlations.projection import linear
>>>
>>> # Bits-per-byte from 100 LLMs on 20000 text domains:
>>> X = np.random.rand(100, 20000)
>>>
>>> # Benchmark error from the 100 LLMs:
>>> y = np.random.uniform(low=0, high=1, size=(100))
>>>
>>> # Estimate the weights for the relationship:
>>> estimate = spearmanr(X, y)
>>>
>>> # per-domain sampling thresholds
>>> # (the sum of this will almost certainly be >= 1)
>>> tau = np.random.rand(20000)
>>>
>>> projected_estimate = linear(estimate, tau)